Skip to content

interviewstreet/spark-stratifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-stratifier

PyPI version Start with Why

When we first started working Spark at HackerRank, we realized that within our dataset, the size of our outcome sets varied in size by quite a bit. This led to inconsistent model cross validation and training. However, with stratified sampling, we were able to eliminate these inconsistencies and improve overall model predictions. The goal of spark-stratifier is to provide a tool to stratify datasets for cross validation in PySpark. This class extends the current CrossValidator class in Spark.

Currently, the stratified cross validator works with binary classification problems using labels 0 and 1.

Read more at engineering.hackerrank.com

Requirements

This tool is 100% Python and the only primary requirements are numpy and pyspark.

Installation

$ pip install spark-stratifier

Example

You basically use this the exact same way you would with the Spark CrossValidator... except this time, your data will be stratified.

from spark_stratifier import StratifiedCrossValidator

scv = StratifiedCrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid,
        evaluator=evaluator,
        numFolds=8
      )

model = scv.fit(matrix)

Contributing

contributions welcome

If you want to write some code and contribute to this project, go ahead and start a pull request. We hope this tool is useful for the community and we'd love to hear about how this helps solve your problems!

About

Stratified Cross Validator for Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages