cloudml-samples/reddit_tft at master · eran-levy/cloudml-samples

Name	Name	Last commit message	Last commit date
parent directory ..
trainer	trainer
README.md	README.md
__init__.py	__init__.py
config-small.yaml	config-small.yaml
path_constants.py	path_constants.py
preprocess.py	preprocess.py
reddit.py	reddit.py
requirements.txt	requirements.txt
setup.py	setup.py

Reddit Sample

Multiple years' worth of Reddit Comments are publicly available in Google Cloud BigQuery. We will use a subset of the data and some SQL manipulation to create training data for predicting the score of a Reddit thread.

The Reddit sample demonstrates the capability of both linear and deep models on a Reddit Dataset.

Prerequisites

Make sure you follow the Google Cloud ML setup here before trying the sample. More documentation about Cloud ML is available here.
Make sure your Google Cloud project has sufficient quota.

Install Dependencies

Install dependencies by running pip install -r requirements.txt

Sample Overview

This sample consists of two parts:

Data Pre-Processing

Data pre-processing step involves reading data from Google Cloud BigQuery and converting it to TFRecords format.

Model Training

Model training step involves taking the pre-processed TFRecords data and training a linear classifier using Stochastic Dual Coordinate Ascent (SDCA) optimizer, or a deep neural network classifier.

Data Format

Above dataset is available in BigQuery and need to be transformed to TFRecords format for the sample code to work. Make sure to run the data through the pre-processing step before you proceed to training.

Pre-Processing Step

The pre-processing step can be performed either locally or on cloud depending upon the size of input data.

First you need to separate your input into training and evaluation sets. We can use one month's worth of data (December 2015) which amounts to approximately 20GB for training and we can then evaluate ourselves on a month's worth of "future" data (January 2016). Finally we can issue predictions for data even "further in the future" (February 2016).

We use the appropriate table names as the input flags --training_data, --eval_data and --predict_data respectively.

Cloud Run

In order to run pre-processing on the Cloud run the commands below.

PROJECT=$(gcloud config list project --format "value(core.project)")
BUCKET="gs://${PROJECT}-ml"

GCS_PATH="${BUCKET}/${USER}/reddit_comments"

PREPROCESS_OUTPUT="${GCS_PATH}/reddit_$(date +%Y%m%d_%H%M%S)"
python preprocess.py --training_data fh-bigquery.reddit_comments.2015_12 \
                     --eval_data fh-bigquery.reddit_comments.2016_01 \
                     --predict_data fh-bigquery.reddit_comments.2016_02 \
                     --output_dir "${PREPROCESS_OUTPUT}" \
                     --project_id "${PROJECT}" \
                     --cloud

Models

The sample implements a linear model trained with SDCA, as well a deep neural network model. The code can be run either locally or on cloud.

Cloud Run

Help options

  python -m trainer.task -h

Train

To train the linear model (with crosses):

JOB_ID="reddit_comments_linear_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit training "$JOB_ID" \
  --stream-logs \
  --module-name trainer.task \
  --package-path trainer \
  --staging-bucket "$BUCKET" \
  --region us-central1 \
  --config config-small.yaml \
  -- \
  --model_type linear \
  --l2_regularization 3000 \
  --eval_steps 1000 \
  --output_path "${GCS_PATH}/model/${JOB_ID}" \
  --raw_metadata_path "${PREPROCESS_OUTPUT}/raw_metadata" \
  --transformed_metadata_path "${PREPROCESS_OUTPUT}/transformed_metadata" \
  --transform_savedmodel "${PREPROCESS_OUTPUT}/transform_fn" \
  --eval_data_paths "${PREPROCESS_OUTPUT}/features_eval*" \
  --train_data_paths "${PREPROCESS_OUTPUT}/features_train*"

To train the deep model:

JOB_ID="reddit_comments_deep_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit training "$JOB_ID" \
  --stream-logs \
  --module-name trainer.task \
  --package-path trainer \
  --staging-bucket "$BUCKET" \
  --region us-central1 \
  --config config-small.yaml \
  -- \
  --model_type deep \
  --hidden_units 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062 \
  --batch_size 512 \
  --eval_steps 250 \
  --output_path "${GCS_PATH}/model/${JOB_ID}" \
  --raw_metadata_path "${PREPROCESS_OUTPUT}/raw_metadata" \
  --transformed_metadata_path "${PREPROCESS_OUTPUT}/transformed_metadata" \
  --transform_savedmodel "${PREPROCESS_OUTPUT}/transform_fn" \
  --eval_data_paths "${PREPROCESS_OUTPUT}/features_eval*" \
  --train_data_paths "${PREPROCESS_OUTPUT}/features_train*"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reddit_tft

reddit_tft

trainer

trainer

README.md

README.md

init.py

init.py

config-small.yaml

config-small.yaml

path_constants.py

path_constants.py

preprocess.py

preprocess.py

reddit.py

reddit.py

requirements.txt

requirements.txt

setup.py

setup.py

README.md

Reddit Sample

Prerequisites

Install Dependencies

Sample Overview

Data Pre-Processing

Model Training

Data Format

Pre-Processing Step

Cloud Run

Models

Cloud Run

Help options

Train

Files

reddit_tft

Directory actions

More options

Directory actions

More options

Latest commit

History

reddit_tft

Folders and files

parent directory

Reddit Sample

Prerequisites

Install Dependencies

Sample Overview

Data Pre-Processing

Model Training

Data Format

Pre-Processing Step

Cloud Run

Models

Cloud Run

Help options

Train