Apache Spark on Alibaba Cloud Maxcompute

The term ODPS is former name of a service that now becomes Maxcompute, this document will try to use Maxcompute for consistency, but some technical parts still uses ODPS

Spark on MaxCompute is a computing service provided by Alibaba Cloud. It is compatible with the open-source Spark. It provides a Spark computing framework based on unified computing resources and a dataset permission system, which allows you to submit and run Spark jobs in your preferred development method. Spark on MaxCompute can fulfill the diverse needs of data processing and analysis.¹

In this repo, some common Spark operations is implemented, such as RDD, DataFrame, SparkSQL and MlLib, as well as some Maxcompute-specific operations such as (big data platform - not unlike bigquery or redshift), (object storage), and (unified orchestrator - airflow, if you may)

Local Environment Setup

To develop Spark on Maxcompute projects, a local development environment need to be set up. The easiest way to do this is via , an Intellij IDEA plugin. However, those without IntelliJ IDEA can still setup their own development environment. The following sections will show how.

1. Requirements

, extract it and remember the extracted path
Install Java 1.8
Install

2. Set environment variables

MacOS or Linux users can add the following variables in ~/.bash_profile file.

export SPARK_HOME=/path/to/extracted/spark/client/from/step/1/above
export PATH=$SPARK_HOME/bin:$PATH

make sure to run source ~/.bash_profile from your terminal to load the environment variables.

3. Configure the spark-defaults.conf file

Go to $SPARK_HOME/conf directory. In there, a spark-defaults.conf.template file can be found. Copy this file and rename it to spark-defaults.conf. Spark on Maxcompute client can be configured in this file.

# spark-defaults.conf
# Enter the MaxCompute project name and account information.
spark.hadoop.odps.project.name = XXX  # maxcompute project name
spark.hadoop.odps.access.id = XXX     # alibaba cloud account access id
spark.hadoop.odps.access.key = XXX    # alibaba cloud account access key

# Retain the following default settings.
Spark.hadoop.odps.end.point = http://service.cn.maxcompute.aliyun.com/api # Find correct endpoints based on your maxcompute project region from: https://www.alibabacloud.com/help/doc-detail/34951.htm
spark.hadoop.odps.runtime.end.point = http://service.cn.maxcompute.aliyun-inc.com/api # Generally same as above
spark.sql.catalogImplementation=odps
spark.hadoop.odps.task.major.version = cupid_v2
spark.hadoop.odps.cupid.container.image.enable = true
spark.hadoop.odps.cupid.container.vm.engine.type = hyper

spark.hadoop.odps.cupid.webproxy.endpoint = http://service.cn.maxcompute.aliyun-inc.com/api
spark.hadoop.odps.moye.trackurl.host = http://jobview.odps.aliyun.com

For some functions, additional configuration might be needed. Refer to for more detail.

4. Clone this repository

The simplest way is to run:

git clone https://github.com/iahsanujunda/maxcompute-spark.git

from the terminal.

Building Package

Thanks to maven, all build process is already streamlined. simply run:

mvn clean package

This will resolve all dependencies, package a .jar executable, as well as run tests.

Running Spark Programs on Maxcompute

To run local development environment, Maxcompute provides two running mode: Local-mode and Cluster-mode

Local Mode

In this mode, Spark on Maxcompute client runs on the local machine but make use of Tunnel to read and write data to Maxcompute resources. Take a note on local[N] part, N indicates the number of CPU to be used by the client.

To execute, run:

$SPARK_HOME/bin/spark-submit --master local[4] \
--class com.aliyun.odps.spark.examples.SparkPi \
${path to project directory}/target/maxcompute-spark-1.0-SNAPSHOT.jar

Cluster Mode

With cluster mode, the Spark program is run on the Maxcompute clusters, note that this means resource files will need to be uploaded to the Maxcompute clusters. Therefore, it might take longer for this mode to execute compared to local mode, based on the internet connection. However, this mode will reflect the actual environment that the code will face on production environment.

To execute, run:

$SPARK_HOME/bin/spark-submit --master yarn-cluster \
--class SparkPi \
${path to project directory}/target/maxcompute-spark-1.0-SNAPSHOT.jar

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docs/img		docs/img
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
appstatus		appstatus
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs/img

docs/img

src/main

src/main

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

appstatus

appstatus

pom.xml

pom.xml

Repository files navigation

Apache Spark on Alibaba Cloud Maxcompute

Local Environment Setup

1. Requirements

2. Set environment variables

3. Configure the spark-defaults.conf file

4. Clone this repository

Building Package

Running Spark Programs on Maxcompute

Local Mode

Cluster Mode

Reference

About

Releases

Packages

Languages

License

iahsanujunda/maxcompute-spark

Folders and files

Latest commit

History

Repository files navigation

Apache Spark on Alibaba Cloud Maxcompute

Local Environment Setup

1. Requirements

2. Set environment variables

3. Configure the spark-defaults.conf file

4. Clone this repository

Building Package

Running Spark Programs on Maxcompute

Local Mode

Cluster Mode

Reference

About

Resources

License

Stars

Watchers

Forks

Languages