Past sprints

PyconFR - Paris 13th-14th Sep 2012

We are organizing a sprint before the PyconFR 2012 conference.

People & tasks

Nelle Varoquaux - Isotonic regression
Olivier Grisel (working with Gaël Varoquaux on joblib parallelism)
Alexandre Gramfort
Fabian Pedregosa (Things I could work on: Implement ranking algorithms (RankSVM, IntervalRank), help with the isotonic regression and group lasso pull request)
Bertrand Thirion
Gaël Varoquaux (working with Olivier Grisel on joblib parallelism)
Alexandre Abraham
Virgile Fritsch
Nicolas Le Roux (providing machine learning expertise for RBM and DBN coding)

Location

La Villette, Paris, the 13th & 14th of September, from 10:00 until 18:00. The sprint will take place in the 'Carrefour Numerique', floor -1 of the 'cité des sciences': http://www.pycon.fr/2012/venue/

Tasks

Top priorities are merging: pull requests, fixing easyfix issues and improving documentation consistency.

In addition to the tasks listed below, it is useful to consider any issue in this list : https://github.com/scikit-learn/scikit-learn/issues

Easy

Improve test coverage: Run 'make test-coverage' after installing the coverage module, find low hanging fruits to improve coverage, and add tests. Try to test the logic, and not simple aim for augmenting the number of lines covered.
Finish estimator summary PR: https://github.com/scikit-learn/scikit-learn/pull/804

Branch merging

Improving and merging existing pull requests is the number one priority: https://github.com/scikit-learn/scikit-learn/pulls

There is a lot of very good code lying there, it often just needs a small amount of polishing

Not requiring expertise in machine learning

Affinity propagation using sparse matrices: the affinity propagation algorithm (scikits.learn.cluster.affinity_propagation_) should be able to work on sparse input affinity matrices without converting them to dense. A good implementation should make this efficient on very large data.

Machine learning tasks

Improve the documentation: You understand some aspects machine-learning. You can help making the scikit rock without writing a line of code: http://scikit-learn.org/dev/developers/index.html#documentation. See also Documentation-related issues in the issue tracker.

Text feature extraction (refactoring / API simplification) + hashing vectorizer: Olivier Grisel

Nearest Neighbors Classification/Regression : allowing more flexible Bayesian priors (currently only a flat prior is used); implementing alternative distance metrics: Jake Vanderplas

Group Lasso: Continue with pull request https://github.com/scikit-learn/scikit-learn/pull/947. Participants: @fabianp

K-means improvements

Participants: @mblondel

Code clean up
Speed improvements: don't reallocate clusters, track clusters that didn't change, triangular inequality
L1 distance: use L1 distance in e step and median (instead of mean) in m step
Fuzzy K-means: k-means with fuzzy cluster membership (not the same as GMM)
Move argmin and average operators to pairwise module (for L1/L2)
Support chunk size argument in argmin operator
Merge @ogrisel's branch
Add a score function (opposite of the kmeans objective)
Sparse matrices
fit_transform
more output options in transform (hard, soft, dense)

Semisupervised learning

Participants: @larsmans

EM algorithm for Naive Bayes (there is a pull request linguering)
Fix utility code to handle partially labeled data sets

More ambitious/long term tasks

Patch liblinear to have warm restart + LogisticRegressionCV.

Comment (by Fabian): I tried this, take a look here: liblinear fork
Locality Sensitive Hashing, talk to Brian Holt
Fused Lasso
Group Lasso, talk to Alex Gramfort (by email), or Fabian Pedregosa
Manifold learning: improve MDS (talk to Nelle Varoquaux), t-SNE (talk to DWF)
Sparse matrix support in dictionary learning module

Euroscipy - Bruxelles, Mon Aug 27 - Tue Aug 28, 2012

People & tasks

Jaques Grobler
Fabian Pedregosa. I'll be working on improving test coverage and implementing Group Lasso. Also, I can introduce newcomers into the scikit-learn workflow.

Places

Université libre de Bruxelles

Contributors might find useful the coding guidelines .

Granada 19th-21th Dec 2011

We are organizing a coding sprint after the NIPS 2011 conference.

People, tasks and funding

For this sprint, we are trying to gather funding for contributors to fly in. Please list your name and who is funding your trip.

Gael Varoquaux: Funding: INRIA
Bertrand Thirion: Funding: INRIA
Fabian Pedregosa: Funding: INRIA
Alex Gramfort: Funding: INRIA
Olivier Grisel: Funding: Google + tinyclues
Jake Vanderplas: Funding: Google + tinyclues
David Warde-Farley: Funding: LISA
Gilles Louppe: Funding: University of Liège
Lars Buitinck: Funding: Google + tinyclues
Vlad Niculae: Funding: Google + tinyclues
Andreas Mueller: Funding: Google + tinyclues
Mathieu Blondel: Funding: Google + tinyclues + private
Nicolás Della Penna: private.

Places

Granada University, Instituto de la Paz y los Conflictos, Centro de Documentación Científica de la Universidad de Granada, first floor Campoamor classroom (map) , from 10:00 until 18:00

Contributors might find useful the coding guidelines .

Tasks

Top priorities are merging: pull requests, fixing easyfix issues and improving documentation consistency.

In addition to the tasks listed below, it is useful to consider any issue in this list : https://github.com/scikit-learn/scikit-learn/issues

Merge in Randomized linear models (branch 'randomized_lasso' on GaelVaroquaux's github (Gael Varoquaux and Alex Gramfort working on this)

Easy

Improve test coverage: Run 'make test-coverage' after installing the coverage module, find low hanging fruits to improve coverage, and add tests. Try to test the logic, and not simple aim for augmenting the number of lines covered.
Py3k support: First test joblib on Python3, then scikit-learn. Both generate sources that are python3 compatible, but these have not been tested.

Branch merging

Improving and merging existing pull requests is the number one priority: https://github.com/scikit-learn/scikit-learn/pulls

There is a lot of very good code lying there, it often just needs a small amount of polishing

Not requiring expertise in machine learning

Rationalize images in documentation: we have 56Mo of images generated in the documentation (doc/_build/html/_images). First we should save jpg instead of pngs: it shrinks this directory to 45Mo (not a huge gain, granted). Second there is many times the same file saved. I need to understand what is going on, and fix that.

Affinity propagation using sparse matrices: the affinity propagation algorithm (scikits.learn.cluster.affinity_propagation_) should be able to work on sparse input affinity matrices without converting them to dense. A good implementation should make this efficient on very large data.

Machine learning tasks

Improve the documentation: You understand some aspects machine-learning. You can help making the scikit rock without writing a line of code: http://scikit-learn.org/dev/developers/index.html#documentation. See also Documentation-related issues in the issue tracker.

Text feature extraction (refactoring / API simplification) + hashing vectorizer: Olivier Grisel

Nearest Neighbors Classification/Regression : allowing more flexible Bayesian priors (currently only a flat prior is used); implementing alternative distance metrics: Jake Vanderplas

K-means improvements

Participants: @mblondel

Code clean up
Speed improvements: don't reallocate clusters, track clusters that didn't change, triangular inequality
L1 distance: use L1 distance in e step and median (instead of mean) in m step
Fuzzy K-means: k-means with fuzzy cluster membership (not the same as GMM)
Move argmin and average operators to pairwise module (for L1/L2)
Support chunk size argument in argmin operator
Merge @ogrisel's branch
Add a score function (opposite of the kmeans objective)
Sparse matrices
fit_transform
more output options in transform (hard, soft, dense)

Random projections

Participants: @mblondel

Merge random SVD PR
Merge sparse RP PR
Cython utils for fast and memory-efficient projection

Kernel Approximation

Participants: @amueller

Move to random projection module

Dictionary Learning

Participants: @vene

Fix (document) alpha scaling
Merge SparseCoder pull request
Merge KMeansCoder pull request
Begin work on supervised image classification

Semisupervised learning

Participants: @larsmans

EM algorithm for Naive Bayes
Fix utility code to handle partially labeled data sets

More ambitious/long term tasks

Patch liblinear to have warm restart + LogisticRegressionCV.

Comment (by Fabian): I tried this, take a look here: liblinear fork
Decision Tree (support boosted trees, loss matrix, multivariate regression)
Ensemble classifiers

Comment (by Gilles): I plan to review @pprett PR on Gradient Boosted Trees. I also want to implement parallel tree construction and prediction in the current implementation of forest of trees.
Locality Sensitive Hashing, talk to Brian Holt
Fused Lasso
Group Lasso, talk to Alex Gramfort (by email)
Manifold learning: MDS, t-SNE (talk to DWF)
Bayesian classification (e.g. RVM)
Sparse matrix support in dictionary learning module

Accommodation

Some of us are planning to stay at a Guest House in Granada to reduce the Hotel costs. If you are interested add your name and arrival and departure dates below:

Name	From	To
Olivier Grisel	Dec. 11	Dec. 21
Gael Varoquaux	Dec. 11	Dec. 21
David Warde-Farley	Dec. 18	Dec. 21
Alex Gramfort	Dec. 11	Dec. 21
Jake Vanderplas	Dec. 15	Dec. 22
Bertrand Thirion	Dec 12	Dec. 20
Gilles Louppe	Dec 18	Dec. 21
Mathieu Blondel	Dec 18	Dec. 22
Lars Buitinck	Dec 18	Dec 22
Vlad Niculae	Dec 18	Dec 22
Andreas Mueller	Dec 11	Dec 22
Nicolás Della Penna	Dec 18	Dec 22
(add your name here)

23, 24 August 2011

We are organizing a coding sprint the days before EuroScipy 2011

People and tasks

Olivier Grisel: review code (esp. related to Vlad's GSoC), doc improvements, maybe work on finalizing Power Iteration Clustering or the text feature extraction
Gael Varoquaux: merging pull requests
Vlad Niculae: merge remaining DictionaryLearning code, doc improvements, maybe work on SGD matrix fact. w/ someone?
Satra Ghosh: work on the ensemble/tree/random forest (only on the 24th)
Brian Holt: tree and random forest code, improve test coverage, doc improvements
Bertrand Thirion: reviewing GMM and related stuff or manifold learning (probably 24th only).
Ralf Gommers: work on joblib (only 24th, from ~12.00)
Vincent Michel: work on bi-clustering, doc improvements, code review.
Mathieu Blondel: multi-class reductions (only 24th, GMT+9)
Fabian Pedregosa : strong-rules for coordinate descent, grouped lasso or related stuff, py3k support.
Alexandre Gramfort : reviewing commits and sending negative comments to harass Fabian while he is away because he kind of likes that
Jean Kossaifi
Virgile Fritsch (only 24th): working on issues (pairwise distances, incompatibility with scipy 0.8, ...) and pull requests merging.

Places

In Paris: at ENS, in the physics department (24 rue Lhomond), probably in some classrooms on the 3rd floor.

Scipy 2011 sprinting: July 15-16

Location At the scipy conference (Austin)

People and tasks

Gael Varoquaux: review code, merge
Marcel Caraciolo: review code, easyfix issues.
David Warde-Farley: review

1st April 2011

Places

In Paris: at Logilab's (104 boulevard blanqui, Paris) - Metro 6 - Glacière

In Boston at MIT (36-537: 5th floor of building 36)

On IRC (#scikit-learn on irc.freenode.net)

People present

Please add skills/interests or planned task, to facilitate the sprint organization and pairing of people on tasks. To share knowledge as much as possible, it would be ideal to have pair-like programming of 2 people on a task, with different skills.

At Logilab, Paris (from 9H to 19H):

Gaël Varoquaux: task: code review, pair programming on specific task where needed.
Julien Miotte
Feth Arezki: could help with coding (w/ the logger?), LaTeX. Interested in learning about scikit.
Nelle Varoquaux: task: minibatch k-means
Fabian Pedregosa
Vincent Michel: task: code review, pair programming. features: ward's clustering.
Luis Belmar-Letelier
Thouis Jones: task: BallTree cython wrapper, documentation, whatever.

At MIT, Boston:

Alexandre Gramfort: task: code review and pair programming
Demian Wassermann: task: Gaussian Processes with sparse data
Satra Ghosh: task: Ensemble Learning, random forests
Nico Pinto
Pietro Berkes

At IRC (from around 9am Brasília time (GMT-3):

Alexandre Passos: task: dirichlet process mixture of gaussian models (In progress)
Vlad Niculae: task: matrix factorization (In progress)
Marcel Caraciolo: task: help in docs and bug fixes (beginner in the project).

Paris coding Sprint, 8-9 Sept. 2010

Place:

INRIA research center in Saclay-Ile de France, also in channel #scikit-learn, on irc.freenode.org. Room to be determined.

Some ideas:

extend the tutorial with features selection, cross-validation, etc

design a sphinx template for the main web page [here http://www.flickr.com/photos/fseoane/4573612893/] is a temptative design, but was not translated into a sphinx template.

Group lasso with coordinate descent in GLM module

Covariance estimators (Ledoit-Wolf) -> Regularized LDA

Add transform in LDA

PCA with fit + transform

preprocessing routines (center, standardize) with fit transform

K-means with Pybrain heuristic

Make Pipeline object work for real

FastICA

Anything you can think of, such as:

Spectral Clustering + manifold learning (MDS/PCA, Isomap, Diffusion maps, tSNE)

Canonical Correlation Analysis

Kernel PCA

Gaussian Process regression

0.4 Coding Sprint, 16 & 17 June 2010

Place:

channel #scikit-learn, on irc.freenode.org. If you do not have an IRC client or behind a firewall, check out http://webchat.freenode.net/

Some ideas:

adapt the plotting features from the em module into gmm module.

incorporate more datasets : the diabetes from the lars R package, featured datasets from http://archive.ics.uci.edu/ml/datasets.html , etc.

anything from the issue tracker.

extend the tutorial with features selection, cross-validation, etc

profile and improve the performance of the gmm module.

submit some new classifier

refactor the ann module (artificial neural networks) to conform to the API in the rest of the modules, or submit a new ann module.

make it compatible with python3 (shouldn't be hard now that there's a numpy python3 relase)

design a sphinx template for the main web page [here http://www.flickr.com/photos/fseoane/4573612893/] is a temptative design, but was not translated into a sphinx template.

anything you can think of.

Documentation Week, 14-18 March 2010

Place:

channel #learn, on irc.freenode.org. If you do not have an IRC client or behind a firewall, check out http://webchat.freenode.net/

Possible Tasks:

Document our design choices (methods in each class, convention for estimated parameters, etc.). Most of this is in ApiDiscussion.

Documentation for neural networks (nonexistent)

Examples. We currently only have a few of them. Expand and integrate them into the web page.

Write a Tutorial.

Write a FAQ.

Documentation and Examples for Support Vector Machines. What's in the web is totally outdated. Integrate the documentation from gumpy, see ticket:27 (assigned: Fabian Pedregosa)

Review documentation.

Customize the sphinx generated html.

Create some cool images/logos for the web page.

Create some benchmark plots.

Code sprint in Paris, 3 March 2010

Terminated, see http://fseoane.net/blog/2010/scikitslearn-coding-spring-in-paris/

Participants

Alexandre Gramfort

Olivier Grisel

Vincent Michel

Fabian Pedregosa

Bertrand Thirion

Gaël Varoquaux

Goals

Implement a few targeted functionalities for penalized regressions.

Target functionalities

GLMnet

Bayesian Regression (Ridge, ARD)

Univariate feature selection function

Edouard: Most of things we need are already in datamind, the main main issue is to cut the dependance with FFF(nipy)

Extras, if time permits:

LARS

Proposed workflow

Pair programming:

GLMNet (AG, OG)

Bayesian regression (FP, VM)

Feature selection (BT, GV)

LARS: Whoever is finished first.

Place in the repository

I think GLMNet goes well in scikits.learn.glm.

Edouard: The GLM term is confusing: Indeed in GLMNet the G means "generalized", however in neuroimaging people understand "general" which is in fact a linear model

Bayessian regression: scikits.learn.bayes . It's short and explicit.

Edouard: Again the term Bayes might not lead to a clear organization of algorithms.

Past sprints

Paris coding Sprint, 8-9 Sept. 2010

Place:

INRIA research center in Saclay-Ile de France, also in channel #scikit-learn, on irc.freenode.org. Room to be determined.

Some ideas:

extend the tutorial with features selection, cross-validation, etc

design a sphinx template for the main web page [here http://www.flickr.com/photos/fseoane/4573612893/] is a temptative design, but was not translated into a sphinx template.

Group lasso with coordinate descent in GLM module

Covariance estimators (Ledoit-Wolf) -> Regularized LDA

Add transform in LDA

PCA with fit + transform

preprocessing routines (center, standardize) with fit transform

K-means with Pybrain heuristic

Make Pipeline object work for real

FastICA

Anything you can think of, such as:

Spectral Clustering + manifold learning (MDS/PCA, Isomap, Diffusion maps, tSNE)

Canonical Correlation Analysis

Kernel PCA

Gaussian Process regression

0.4 Coding Sprint, 16 & 17 June 2010

Place:

channel #scikit-learn, on irc.freenode.org. If you do not have an IRC client or behind a firewall, check out http://webchat.freenode.net/

Some ideas:

adapt the plotting features from the em module into gmm module.

incorporate more datasets : the diabetes from the lars R package, featured datasets from http://archive.ics.uci.edu/ml/datasets.html , etc.

anything from the issue tracker.

extend the tutorial with features selection, cross-validation, etc

profile and improve the performance of the gmm module.

submit some new classifier

refactor the ann module (artificial neural networks) to conform to the API in the rest of the modules, or submit a new ann module.

make it compatible with python3 (shouldn't be hard now that there's a numpy python3 relase)

design a sphinx template for the main web page [here http://www.flickr.com/photos/fseoane/4573612893/] is a temptative design, but was not translated into a sphinx template.

anything you can think of.

Documentation Week, 14-18 March 2010

Place:

channel #learn, on irc.freenode.org. If you do not have an IRC client or behind a firewall, check out http://webchat.freenode.net/

Possible Tasks:

Document our design choices (methods in each class, convention for estimated parameters, etc.). Most of this is in ApiDiscussion.

Documentation for neural networks (nonexistent)

Examples. We currently only have a few of them. Expand and integrate them into the web page.

Write a Tutorial.

Write a FAQ.

Documentation and Examples for Support Vector Machines. What's in the web is totally outdated. Integrate the documentation from gumpy, see ticket:27 (assigned: Fabian Pedregosa)

Review documentation.

Customize the sphinx generated html.

Create some cool images/logos for the web page.

Create some benchmark plots.

Code sprint in Paris, 3 March 2010

Terminated, see http://fseoane.net/blog/2010/scikitslearn-coding-spring-in-paris/

Participants

Alexandre Gramfort

Olivier Grisel

Vincent Michel

Fabian Pedregosa

Bertrand Thirion

Gaël Varoquaux

Goals

Implement a few targeted functionalities for penalized regressions.

Target functionalities

GLMnet

Bayesian Regression (Ridge, ARD)

Univariate feature selection function

Edouard: Most of things we need are already in datamind, the main main issue is to cut the dependance with FFF(nipy)

Extras, if time permits:

LARS

Proposed workflow

Pair programming:

GLMNet (AG, OG)

Bayesian regression (FP, VM)

Feature selection (BT, GV)

LARS: Whoever is finished first.

Place in the repository

I think GLMNet goes well in scikits.learn.glm.

Edouard: The GLM term is confusing: Indeed in GLMNet the G means "generalized", however in neuroimaging people understand "general" which is in fact a linear model

Bayessian regression: scikits.learn.bayes . It's short and explicit.

Edouard: Again the term Bayes might not lead to a clear organization of algorithms.

Feature selection: featsel? selection ? I'm not sure about this one.

AG : maybe univ?

Edouard: Maybe it is to early to decide the structure of the repository during your coding sprint. I think this organization should follow discussion we had we Fabian, Gael and Bertand. Next I tried to synthesize those discussions, however its just a proposition and many things are missing:

If there's code that we want to share and it does not fit into any of these schemes, it's ok to put it into sandbox/ (it does not yet exist)

Feature selection: featsel? selection ? I'm not sure about this one.

AG : maybe univ?

Edouard: Maybe it is to early to decide the structure of the repository during your coding sprint. I think this organization should follow discussion we had we Fabian, Gael and Bertand. Next I tried to synthesize those discussions, however its just a proposition and many things are missing:

Past sprints

Past sprints

PyconFR - Paris 13th-14th Sep 2012

People & tasks

Location

Tasks

Easy

Branch merging

Not requiring expertise in machine learning

Machine learning tasks

K-means improvements

Semisupervised learning

More ambitious/long term tasks

Euroscipy - Bruxelles, Mon Aug 27 - Tue Aug 28, 2012

People & tasks

Places

Granada 19th-21th Dec 2011

People, tasks and funding

Places

Tasks

Easy

Branch merging

Not requiring expertise in machine learning

Machine learning tasks

K-means improvements

Random projections

Kernel Approximation

Dictionary Learning

Semisupervised learning

More ambitious/long term tasks

Accommodation

23, 24 August 2011

People and tasks

Places

Scipy 2011 sprinting: July 15-16

People and tasks

1st April 2011

Places

People present

Paris coding Sprint, 8-9 Sept. 2010

0.4 Coding Sprint, 16 & 17 June 2010

Documentation Week, 14-18 March 2010

Code sprint in Paris, 3 March 2010

Participants

Edouard: Again the term Bayes might not lead to a clear organization of algorithms.

Paris coding Sprint, 8-9 Sept. 2010

0.4 Coding Sprint, 16 & 17 June 2010

Code sprint in Paris, 3 March 2010

Participants

Goals

Target functionalities

Proposed workflow

Place in the repository

Clone this wiki locally