Kaggle's StumbleUpon Challenge on Dataiku!

Brief Descriptions

StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen". The ratings we get from our community give us strong signals that a page may no longer be relevant - but what if we could make this distinction ahead of time? A high quality prediction of "ephemeral" or "evergreen" would greatly improve a recommendation system like ours.

Many people know evergreen content when they see it, but can an algorithm make the same determination without human intuition? Your mission is to build a classifier which will evaluate a large set of URLs and label them as either evergreen or ephemeral. Can you out-class(ify) StumbleUpon? As an added incentive to the prize, a strong performance in this competition may lead to a career-launching internship at one of the best places to work in San Francisco.

Installation

On this repository, you may find my personal projects related to Machine Learning, EDA, Python Jupyter Notebook and couple of Visualization based on the Dataiku Platform exported standard files. Most of the datasets I've been working with, downloaded from Kaggle. Installation pretty straight forward. Simply download the whole set as a single project and as a ZIP files, everything have been flattened out with plain text files, and no SQL dump was involved, so there wouldn't be any missing system dependencies issue. Simply imported the downloaded Zip file to your working project.

Data Dictionary

There are two components to the data provided for this challenge:

The first component is two files: train.tsv and test.tsv. Each is a tab-delimited text file containing the fields outlined below for 10,566 urls total. Fields for which no data is available are indicated with a question mark.

train.tsv is the training set and contains 7,395 urls. Binary evergreen labels (either evergreen (1) or non-evergreen (0)) are provided for this set.
test.tsv is the test/evaluation set and contains 3,171 urls.

The second component is raw_content.zip, a zip file containing the raw content for each url, as seen by StumbleUpon's crawler. Each url's raw content is stored in a tab-delimited text file, named with the urlid as indicated in train.tsv and test.tsv.

The following table includes field descriptions for train.tsv and test.tsv: By default, the initial dataset coming from [Kaggle's](https://www.kaggle.com/c/stumbleupon/data challenge page would give you the above dataset features at hand. But we'll try to optimize them to something much more Machine Learning friendly looking dataset.

Jupyter Notebooks

Disclaimer

And please remember, as this is only a weekend pet project, which I'm doing them for my personal interest only.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis-data/tOmNkvg5/7BkKoxNz		analysis-data/tOmNkvg5/7BkKoxNz
discussions		discussions
images		images
ipython_notebooks		ipython_notebooks
project_config		project_config
saved_models/dsSd13Tb/versions/initial		saved_models/dsSd13Tb/versions/initial
timelines		timelines
README.md		README.md
_config.yml		_config.yml
export-manifest.json		export-manifest.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis-data/tOmNkvg5/7BkKoxNz

analysis-data/tOmNkvg5/7BkKoxNz

discussions

discussions

images

images

ipython_notebooks

ipython_notebooks

project_config

project_config

saved_models/dsSd13Tb/versions/initial

saved_models/dsSd13Tb/versions/initial

timelines

timelines

README.md

README.md

_config.yml

_config.yml

export-manifest.json

export-manifest.json

Repository files navigation

Kaggle's StumbleUpon Challenge on Dataiku!

Brief Descriptions

Installation

Data Dictionary

Jupyter Notebooks

Disclaimer

About

Releases

Packages

Languages

leonism/Dataiku-StumbleUpon

Folders and files

Latest commit

History

Repository files navigation

Kaggle's StumbleUpon Challenge on Dataiku!

Brief Descriptions

Installation

Data Dictionary

Jupyter Notebooks

Disclaimer

About

Topics

Resources

Stars

Watchers

Forks

Languages