Skip to content
chkoar edited this page Jul 5, 2016 · 4 revisions

Under-sampling methods.

UnderSampler

UnderSampler is an object that under-samples the majority class(es) at random with replacement.

Parameters:

  • ratio : Controls the number of new samples to draw. The number of new samples is given by int(ratio * num_minority_samples)
  • random_state : Seed for random numbers generation.

Methods:

  • fit : Find the target statistics to determine the minority class, and the number of samples in each class.
  • sample : Returns the re sampled version of the original data set (X, y) passed to fit.
  • fit_sample : Automatically performs both fit and sample.

TomekLinks

TomekLinks is an object that identifies all Tomek links between the majority and minority classes and eliminates the link element that belongs to the majority class.

Parameters:

Methods:

  • fit : Find the target statistics to determine the minority class, and the number of samples in each class.
  • sample : Returns the re sampled version of the original data set (X, y) passed to fit.
  • fit_sample : Automatically performs both fit and sample.

ClusterCentroids

ClusterCentroids is an object that under-samples the majority by replacing cluster of samples by the cluster centroid of a KMeans algorithm.

(Experimental) A KMeans algorithm is fitted to the data, the number of clusters N being decided by the level of under sampling. The majority samples are then completely replaced by the set cluster centroids from KMeans.

Parameters:

  • kargs : Dictionary to pass any parameters to the scikit-learn KMeans object.
  • ratio : Controls the number of new samples to draw. The number of new samples is given by int(ratio * num_minority_samples)
  • random_state : Seed for random numbers generation.

Methods:

  • fit : Find the target statistics to determine the minority class, and the number of samples in each class.
  • sample : Returns the re sampled version of the original data set (X, y) passed to fit.
  • fit_sample : Automatically performs both fit and sample.