Source
yaml
id: wikipedia-top10-python-pandas
namespace: company.team
description: Analyze top 10 Wikipedia pages
tasks:
  - id: query
    type: io.kestra.plugin.gcp.bigquery.Query
    sql: |
      SELECT DATETIME(datehour) as date, title, views FROM
      `bigquery-public-data.wikipedia.pageviews_2024`
      WHERE DATE(datehour) = current_date() and wiki = 'en'
      ORDER BY datehour desc, views desc
      LIMIT 10
    store: true
    projectId: test-project
    serviceAccount: "{{ secret('GCP_SERVICE_ACCOUNT_JSON') }}"
  - id: write_csv
    type: io.kestra.plugin.serdes.csv.IonToCsv
    from: "{{ outputs.query.uri }}"
  - id: pandas
    type: io.kestra.plugin.scripts.python.Script
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    containerImage: ghcr.io/kestra-io/pydata:latest
    inputFiles:
      data.csv: "{{ outputs.write_csv.uri }}"
    script: |
      import pandas as pd
      from kestra import Kestra
      df = pd.read_csv("data.csv")
      df.head(10)
      views = df['views'].max()
      Kestra.outputs({'views': int(views)})
About this blueprint
Python GCP
This flow will do the following:
- Use 
bigquery.Querytask to query the top 10 wikipedia pages for the current day - Use 
IonToCsvto store the results in a CSV file. - Use 
python.Scripttask to read the CSV file and use pandas to find the maximum number of views. - Use Kestra 
outputsto track the maximum number of views over time. 
The Python script will run in a Docker container based on the public image
ghcr.io/kestra-io/pydata:latest.
The BigQuery task exposes (by default) a variety of metrics such as:
- total.bytes.billed
 - total.partitions.processed
 - number of rows processed
 - query duration
 
You can view those metrics on the Execution page in the Metrics tab.
More Related Blueprints