Skip to content

iahsanujunda/maxcompute-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Spark on Alibaba Cloud Maxcompute

Apache Spark

The term ODPS is former name of a service that now becomes Maxcompute, this document will try to use Maxcompute for consistency, but some technical parts still uses ODPS

Spark on MaxCompute is a computing service provided by Alibaba Cloud. It is compatible with the open-source Spark. It provides a Spark computing framework based on unified computing resources and a dataset permission system, which allows you to submit and run Spark jobs in your preferred development method. Spark on MaxCompute can fulfill the diverse needs of data processing and analysis.1

In this repo, some common Spark operations is implemented, such as RDD, DataFrame, SparkSQL and MlLib, as well as some Maxcompute-specific operations such as Maxcompute (big data platform - not unlike bigquery or redshift), OSS (object storage), and Dataworks (unified orchestrator - airflow, if you may)

Local Environment Setup

To develop Spark on Maxcompute projects, a local development environment need to be set up. The easiest way to do this is via Maxcompute Studio, an Intellij IDEA plugin. However, those without IntelliJ IDEA can still setup their own development environment. The following sections will show how.

1. Requirements

  • Spark on Maxcompute Client, extract it and remember the extracted path
  • Install Java 1.8
  • Install Apache Maven

2. Set environment variables

MacOS or Linux users can add the following variables in ~/.bash_profile file.

export SPARK_HOME=/path/to/extracted/spark/client/from/step/1/above
export PATH=$SPARK_HOME/bin:$PATH

make sure to run source ~/.bash_profile from your terminal to load the environment variables.

3. Configure the spark-defaults.conf file

Go to $SPARK_HOME/conf directory. In there, a spark-defaults.conf.template file can be found. Copy this file and rename it to spark-defaults.conf. Spark on Maxcompute client can be configured in this file.

# spark-defaults.conf
# Enter the MaxCompute project name and account information.
spark.hadoop.odps.project.name = XXX  # maxcompute project name
spark.hadoop.odps.access.id = XXX     # alibaba cloud account access id
spark.hadoop.odps.access.key = XXX    # alibaba cloud account access key

# Retain the following default settings.
Spark.hadoop.odps.end.point = http://service.cn.maxcompute.aliyun.com/api # Find correct endpoints based on your maxcompute project region from: https://www.alibabacloud.com/help/doc-detail/34951.htm
spark.hadoop.odps.runtime.end.point = http://service.cn.maxcompute.aliyun-inc.com/api # Generally same as above
spark.sql.catalogImplementation=odps
spark.hadoop.odps.task.major.version = cupid_v2
spark.hadoop.odps.cupid.container.image.enable = true
spark.hadoop.odps.cupid.container.vm.engine.type = hyper

spark.hadoop.odps.cupid.webproxy.endpoint = http://service.cn.maxcompute.aliyun-inc.com/api
spark.hadoop.odps.moye.trackurl.host = http://jobview.odps.aliyun.com

For some functions, additional configuration might be needed. Refer to this documentation for more detail.

4. Clone this repository

The simplest way is to run:

git clone https://github.com/iahsanujunda/maxcompute-spark.git

from the terminal.

Building Package

Thanks to maven, all build process is already streamlined. simply run:

mvn clean package

This will resolve all dependencies, package a .jar executable, as well as run tests.

Running Spark Programs on Maxcompute

To run local development environment, Maxcompute provides two running mode: Local-mode and Cluster-mode

Local Mode

In this mode, Spark on Maxcompute client runs on the local machine but make use of Tunnel to read and write data to Maxcompute resources. Take a note on local[N] part, N indicates the number of CPU to be used by the client.

To execute, run:

$SPARK_HOME/bin/spark-submit --master local[4] \
--class com.aliyun.odps.spark.examples.SparkPi \
${path to project directory}/target/maxcompute-spark-1.0-SNAPSHOT.jar

Cluster Mode

With cluster mode, the Spark program is run on the Maxcompute clusters, note that this means resource files will need to be uploaded to the Maxcompute clusters. Therefore, it might take longer for this mode to execute compared to local mode, based on the internet connection. However, this mode will reflect the actual environment that the code will face on production environment.

To execute, run:

$SPARK_HOME/bin/spark-submit --master yarn-cluster \
--class SparkPi \
${path to project directory}/target/maxcompute-spark-1.0-SNAPSHOT.jar

Reference

  1. Spark on Maxcompute Overview

About

Examples and exercises to run apache spark on alibaba cloud maxcompute

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages