Skip to content

GSoC 2016 project proposal: Adding fused types to Cython files

Nelson Liu edited this page Jul 18, 2016 · 1 revision

#Scikit-Learn

Student Information

About me

My name is Devashish Deshpande. I'm a third year undergraduate student of BITS Pilani, Goa campus. I'm currently pursuing a BE(hons.) in computer science. I am very interested in the field of machine learning and data science and have done a few projects and courses in these areas. The courses I have completed include a course on data mining and an online course on Big Data and Machine Learning offered by BerkeleyX. I am currently pursuing a course on Information Retrieval (where we use sklearn extensively) and the popular MOOC course offered by Andrew Ng. Apart from these, I have done projects on signal analysis and natural language processing which can be found on my GitHub profile. All of these projects involved working with python, C and C++. Apart from these languages, I have done developmental work using java and C# to build applications for different organisations.

Experience with scikit-learn

I have been contributing to this organisation since December 2015 and have contributed a few patches and bug fixes. I have also reviewed other PRs made by fellow contributors whenever possible. Following are my contributions:

  1. https://github.com/scikit-learn/scikit-learn/pull/6221 - (merged) - LabelBinarizer single label case now works for sparse and dense case.
  2. https://github.com/scikit-learn/scikit-learn/pull/6176 - [MRG] - Added rebasing and squashing section to contributing.rst.
  3. https://github.com/scikit-learn/scikit-learn/pull/6404 - (merged) - Added memory workaround to DBSCAN doc.
  4. https://github.com/scikit-learn/scikit-learn/pull/6229 - (merged) - LabelEncoder now works for 0-D arrays.
  5. https://github.com/scikit-learn/scikit-learn/pull/6213 - (merged) - Added link to profile in whats_new.rst.
  6. https://github.com/scikit-learn/scikit-learn/pull/6182 - (merged) - SKF raises error if all n_labels for individual classes < n_folds.
  7. https://github.com/scikit-learn/scikit-learn/pull/6173 - (merged) - FeatureHasher now accepts string values.
  8. https://github.com/scikit-learn/scikit-learn/pull/6153 - (merged) - Doc: Fix redundant doc for f_regression.
  9. https://github.com/scikit-learn/scikit-learn/pull/6139 - (merged) - Fixed doc in lfw.py.
  10. https://github.com/scikit-learn/scikit-learn/pull/6135 - (merged and backported to 0.17.X by @jakirkham) - FIX #5608 in randomized_svd flip sign.
  11. https://github.com/scikit-learn/scikit-learn/pull/6134 - (merged) - FIX #6132 broken link.
  12. https://github.com/scikit-learn/scikit-learn/pull/6041 - (merged) - Fix in plot_out_of_core_classification.py

Project Info

  • Proposal title: Adding fused types to cython files across scikit-learn.

  • Proposal Abstract: Currently, in a lot of cases, even when the user opts to use np.float32 dtype, it is internally converted to np.float64 which causes unnecessary memory usage with the user not even knowing about it. This extra memory usage has even caused MemoryError in cases where large datasets were used. This problem can be tackled by using cython fused types internally so as to use memory efficiently.

  • Motivation: I have worked with cython files in a previous PR and thus have a decent knowledge of cython. I feel this project can improve the performance of a lot of modules substantially. I see this project as an opportunity to contribute something bigger to scikit-learn, an organisation I intend to continue contributing to in the future. This project would require me to analyse and work on many different parts of the codebase. Hence this would help me become much more comfortable with the codebase in addition to improving my programming skills.

  • Theory: Fused types allow one type definition for multiple types. This can be used to have just one function which can accept and operate on multiple data types rather than handling them separately. Thus rather than writing a different code path for np.float32 values, cython fused types can be used to handle np.float64 and np.float32 values both without conversion to np.float64 in all cases. This case extends to int32 and int64 data types also. Hence, fused types allow generic programming; similar to C++ templates or Java and C# generics.

  • Affected modules:

    1. Cluster: This is where fused types could surely help a lot. Clustering is a technique used for a lot of varied purposes such as pattern recognition, image analysis etc and can involve working with very large datasets. The whole dataset is needed at once to fit a clustering model, therefore in order to avoid memory errors while working with large datasets, adding fused types here is essential. Hierarchical and K-Means clustering (especially) are used widely.
      • _hierarchical.pyx
      • _k_means.pyx
    2. Decomposition: With over 500 citations, online learning for Latent Dirichlet Allocation seems to be a popular topic modeling technique. Functions such as ones to calculate mean difference and Dirichlet expectation for a single sample still convert arrays to np.float64_t data. Fused types could be helpful here.
      • _online_lda.pyx
    3. Ensemble [DOUBT HERE. Only while storing output, float64 used. float32 really needed?]: Gradient boosting is one of the most widely used techniques in machine learning. Functions such as the ones to predict output for regression tree and to add predictions of estimator to out store the output as float64 internally. This could be optimized by using fused types.
      • _gradient_boosting.pyx
    4. Linear model: This module consists of code used for performing coordinate descent, stochastic gradient descent and stochastic average gradient which are widely used to minimize loss functions for very popular algorithms such as linear regression and logistic regression. The coordinate descent algorithm for elastic net (sparse/dense and single/multi-task) seems to convert all inputs to float64 arrays internally. Considering that this is one of the most widely used modules, adding fused types to this is a high priority task. See issue 5464 and issue 5776.
      • cd_fast.pyx
      • sag_fast.pyx
      • sgd_fast.pyx
    5. Neighbors: dtypes defined in typedefs.pxd and typedefs.pyx are imported into different .pyx files. The binary tree base class is defined in binary_tree.pxi which is extended by ball_tree and kd_tree files. This is also a high priority module because of the different dist_metrics it computes which are used widely.
      • typedefs.pxd, typedefs.pyx
      • binary_tree.pxi
      • dist_metrics.pyx
      • ball_tree.pyx
      • kd_tree.pyx
    6. SVM: libsvm for dense and sparse case use np.float64_t dtypes internally. The wrapper for liblinear also suffers from the same problem. Fused types could help here.
      • libsvm.pyx, libsvm_sparse.pyx
      • liblinear.pyx
    7. Utils: sparsefuncs_fast.pyx provides functions for a lot of operations which are performed on sparse matrices. Fused types can be added to this before GSoC starts so as to get more familiar with cython fused types.
      • sparsefuncs_fast.pyx
    8. Fused type support to _isotonic.pyx can also be added pre GSoC.
  • Project Timeline: Modules will be worked on in decreasing order of their priority. Please note that this timeline is tentative and some work might finish earlier than expected or a bit later if other issues are encountered on the way. In any case, I will be trying hard to stick to those deadlines. Following is the project timeline:

    • Today - 21st April: Continue contributing. Start work on issue #6524.
    • Community Bonding Period (22nd April - 22nd May): I will try to start coding in this period itself. Working on small files first will help me get more comfortable with how cython fused types are used with numpy arrays and how numpy arrays are handled in cython. I will start work on:
      • sparsefuncs_fast.pyx, isotonic.pyx:
        1. Start by introducing fused types into main file.
        2. Write extensive tests to demonstrate dtype consistency.
    • Week 1-3 (23rd May - 12th June): I will start my work on linear_model module since I feel adding fused types to this module is a high priority task. I will be adding fused types and tests for the following files in this order:
      • cd_fast.pyx. Tests added to test_coordinate_descent.py, test_sparse_coordinate_descent.py.
      • sgd_fast.pyx. Tests added to test_sgd.py.
      • sag_fast.pyx. Tests added to test_sag.py.
    • Week 4, 5 (13th June - 26th June): I will start work on the neighbors module. Note that changes will have to be made to typedefs.pyx and typedefs.pxd as well so as to keep code consistent. Following will be order of adding fused types and tests.
      • Fused types added to typedefs.pyx and typedefs.pxd.
      • Since the above files are used as literal imports to eliminate code duplication, changes might have to made to binary_tree.pxi, dist_metrics.pyx, ball_tree.pyx, kd_tree.pyx to keep code consistent.
      • Tests will be added to the respective test files to check dtype consistency.
    • 27th June: Mid term evaluation
    • Week 6 (27th June - 3rd July): Work will be continued on the neighbors module.
    • Week 7, 8 (4th July - 17th July): I will pick up the cluster module at this stage. _hierarchical.pyx and _k_means.pyx will benefit the most from fused types in this module.
      • Changes made to _hierarchical.pyx. Tests added to test_hierarchical.py.
      • Changes made to _k_means.pyx. Tests added to test_k_means.pyx.
    • Week 9-11 (18th July - 7th August): Work will begin on adding fused types to svm module.
      • Fused types added to liblinear.pyx.
      • Fused types added to libsvm.pyx, libsvm_sparse.pyx.
      • Test suite added to test_sparse.py, test_svm.py.
    • Week 12 (8th August - 15th August): I will start work on the decomposition and ensemble module at this stage.
      • Fused types added to _online_lda.pyx. Tests added to test_online_lda.py.
      • [If ensemble is not to be worked on, linear_model and neighbors could be extended a bit] Changes made to __gradient_boosting.pyx. Tests added to test_gradient_boosting.py.
    • Week 11 (16th August - 23rd August): Improve documentation, tests. Check final code submission. Add examples and maybe demonstrations.
    • Week 12 - np.inf: Continue contributing.
  • Other commitments: I will be having my exams from 1st May - 12th May. In this period my contribution rate might drop but I will pick up as soon as my exams get over. Other than this period, I will be having no commitments. I will be able to put in at least the required minimum of 40 hours a week. If for some reason my code does not get merged into master, I will continue work beyond the summer to ensure that it becomes mergeable.

References

Clone this wiki locally