Skip to content

Latest commit

History

History
46 lines (33 loc) 路 1.7 KB

beam.mdx

File metadata and controls

46 lines (33 loc) 路 1.7 KB

Beam Datasets

Some datasets are too large to be processed on a single machine. Instead, you can process them with Apache Beam, a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as Apache Flink, Apache Spark, or Google Cloud Dataflow.

We have already created Beam pipelines for some of the larger datasets like wikipedia, and wiki40b. You can load these normally with [load_dataset]. But if you want to run your own Beam pipeline with Dataflow, here is how:

  1. Specify the dataset and configuration you want to process:
DATASET_NAME=your_dataset_name  # ex: wikipedia
CONFIG_NAME=your_config_name    # ex: 20220301.en
  1. Input your Google Cloud Platform information:
PROJECT=your_project
BUCKET=your_bucket
REGION=your_region
  1. Specify your Python requirements:
echo "datasets" > /tmp/beam_requirements.txt
echo "apache_beam" >> /tmp/beam_requirements.txt
  1. Run the pipeline:
datasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_info \
--cache_dir gs://$BUCKET/cache/datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\
"region=$REGION,requirements_file=/tmp/beam_requirements.txt"

When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers.