New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Example of Orchestrating Notebook Executions on Dataproc Serverless via Cloud Composer #1006

Closed

kristin-kim wants to merge 5 commits into GoogleCloudPlatform:main from kristin-kim:serverless

Contributor

kristin-kim commented Mar 21, 2023

Example of Orchestrating Notebook Executions on Dataproc Serverless via Cloud Composer

This PR adds an example to orchestrate running Notebooks on Dataproc Serverless via Cloud Composer using a specific Airflow operator, DataprocCreateBatchOperator(). It contains wrapper file for Notebook execution via PySpark job, Composer DAGs and sample input resources, sample Spark Notebook and datasets in sample spark session.

Typical customer scenario around this example is to 1) migrate and stage Spark Notebooks that were in legacy data lake to GCS then to 2) set up orchestration for deploying the staged Notebooks on Dataproc Serverless as a Spark batch job

kristin-kim added 3 commits

March 21, 2023 14:59


          add running notebooks ondataprocserverless

9e9638d


          Update README.md

b2e3dbc


          add architecture diagram, mention dev process using Vertex AI

6c31b97

pull-request-size bot added the size/XXL label

kristin-kim marked this pull request as ready for review

March 21, 2023 19:25

kristin-kim added 2 commits

March 21, 2023 15:28


          Update serverless_airflow.py

a360074


          Update README.md

313b872

NiloFreitas suggested changes

View reviewed changes

Member

NiloFreitas left a comment

Nice work @kristin-kim :)

examples/dataprocserverless-running-notebooks/notebooks/jupyter/spark_notebook.ipynb

+                  "    .appName(\"Spark Session for Electric Vehicle Population\") \\\n",
+                  "    .getOrCreate()\n",
+                  "\n",
+                  "gcs_bucket = \"kristin-0105\"\n",

Member

NiloFreitas Mar 21, 2023

Don't hard code personal buckets. Try to think of another way when developing these Spark codes. You could for example, from the notebook, read a yaml file with the value of these parameters. And they are kept in this repo as <generic_placeholders>

...es/dataprocserverless-running-notebooks/notebooks/jupyter/output/spark_notebook_output.ipynb

+                  "    .appName(\"Spark Session for Electric Vehicle Population\") \\\n",
+                  "    .getOrCreate()\n",
+                  "\n",
+                  "gcs_bucket = \"kristin-0105\"\n",

Member

NiloFreitas Mar 21, 2023

Don't hard code personal buckets. Try to think of another way when developing these Spark codes. You could for example, from the notebook, read a yaml file with the value of these parameters. And they are kept in this repo as <generic_placeholders>

examples/dataprocserverless-running-notebooks/notebooks/jupyter/spark_notebook.ipynb

+                 "metadata": {},
+                 "outputs": [],
+                 "source": [
+                  "# !spark-shell"

Member

NiloFreitas Mar 21, 2023

You don't need this, do you?

...es/dataprocserverless-running-notebooks/notebooks/jupyter/output/spark_notebook_output.ipynb

+                 },
+                 "outputs": [],
+                 "source": [
+                  "# !spark-shell"

Member

NiloFreitas Mar 21, 2023

You don't need this, do you?

examples/dataprocserverless-running-notebooks/composer_input/DAGs/serverless_airflow.py

+              phs_region = Variable.get('phs_region')
+              phs = Variable.get('phs')
+              # Arguments to pass to Cloud Dataproc job.

Member

NiloFreitas Mar 21, 2023

Actually these arguments are passed to your papermill wrapper, right? Via Dataproc because you are running the Python file using Dataproc

examples/dataprocserverless-running-notebooks/README.md

+              # gcp-dataproc_serverless-running-notebooks
+              ## Objective
+              Orchestrator to run Notebooks on Dataproc Serverless via Cloud Composer

Member

NiloFreitas Mar 21, 2023

Orchestrate data pipelines built using PySpark and Jupyter Notebooks from Airflow/Cloud Composer, leveraging Dataproc Serverless

examples/dataprocserverless-running-notebooks/README.md

+              ## File Details
+              ### composer_input
+              * **wrapper_papermill.py**: runs a papermill execution of input notebook and writes the output file into the assgined location
+              * **serverless_airflow.py**: orchestrates the workflow

Member

NiloFreitas Mar 21, 2023

Change filename to something like: dag_run_notebooks_dataproc.py

examples/dataprocserverless-running-notebooks/README.md


		1. Make sure to modify gcs path for datasets in Notebook

		2. Create [Persistent History Server](https://cloud.google.com/dataproc/docs/concepts/jobs/history-server)

Member

NiloFreitas Mar 21, 2023

Wouldn't this be optional? Why did you set it as a necessary step? Removing this step and pointing out the documention if the user wants to use PHS would increase simplicity and possibly adoption

examples/dataprocserverless-running-notebooks/README.md

+                  ├── notebooks
+                  │   ├── datasets/                   electric_vehicle_population.csv
+                  │   ├── jupyter/                    spark_notebook.ipynb
+                  │   ├── jupyter/output              spark_notebook_outbook.ipynb

Member

NiloFreitas Mar 21, 2023

Remove this if accept the other comment

examples/dataprocserverless-running-notebooks/README.md

+. Find DAGs folder from Composer Environment and add serverless_airflow.py (DAGs file) to it in order to trigger DAGs execution:
+              DAG folder from Cloud Composer Console
+. Have all the files available in GCS bucket, except DAGs file which should go into your Composer DAGs folder

Member

NiloFreitas Mar 21, 2023

Manually, or via Continuous Integration, copy the notebook source code files to the appropriate GCS bucket
Manually, or via Continuous Integration, copy the Airflow DAG files to your DAGs folder of your Cloud Composer environment GCS bucket

Contributor

agold-rh commented Oct 30, 2023

@kristin-kim If you can address the review questions, I can try to get this merged.

Contributor

agold-rh commented May 17, 2024

Closed as stale. Please re-open if I'm wrong.

agold-rh closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment