Choosing a Beam Runner

All tools use Apache Beam pipelines. By default, pipelines run locally using the DirectRunner. You can optionally choose to run the pipelines on Google Cloud Dataflow by selection the DataflowRunner.

When working with GCP, it's recommended you set the project ID up front with the command:

gcloud config set project <your-id>

Direct Runner options:

--direct_num_workers: The number of workers to use. We recommend 2 for local development.

Example run:

weather-mv -i gs://netcdf_file.nc \
  -o $PROJECT.$DATASET_ID.$TABLE_ID \
  -t gs://$BUCKET/tmp  \
  --direct_num_workers 2

For a full list of how to configure the direct runner, please review this page.

Dataflow options:

--runner: The PipelineRunner to use. This field can be either DirectRunner or DataflowRunner. Default: DirectRunner (local mode)
--project: The project ID for your Google Cloud Project. This is required if you want to run your pipeline using the Dataflow managed service (i.e. DataflowRunner).
--temp_location: Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
--region: Specifies a regional endpoint for deploying your Dataflow jobs. Default: us-central1.
--job_name: The name of the Dataflow job being executed as it appears in Dataflow's jobs list and job details.

Example run:

weather-dl configs/seasonal_forecast_example_config.cfg \
  --runner DataflowRunner \
  --project $PROJECT \
  --region $REGION \
  --temp_location gs://$BUCKET/tmp/

For a full list of how to configure the Dataflow pipeline, please review this table.

Monitoring

When running Dataflow, you can monitor jobs through UI, or via Dataflow's CLI commands:

For example, to see all outstanding Dataflow jobs, simply run:

gcloud dataflow jobs list

To describe stats about a particular Dataflow job, run:

gcloud dataflow jobs describe $JOBID

In addition, Dataflow provides a series of Beta CLI commands.

These can be used to keep track of job metrics, like so:

JOBID=<enter job id here>
gcloud beta dataflow metrics list $JOBID --source=user

You can even view logs via the beta commands:

gcloud beta dataflow logs list $JOBID

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runners.md

Runners.md

Choosing a Beam Runner

Direct Runner options:

Dataflow options:

Monitoring

Files

Runners.md

Latest commit

History

Runners.md

File metadata and controls

Choosing a Beam Runner

Direct Runner options:

Dataflow options:

Monitoring