Skip to content
fernando edited this page Aug 17, 2014 · 8 revisions

Over-sampling methods.

OverSampler

OverSampler is an object that over-samples the minority class at random with replacement.

Parameters:

  • ratio : Controls the number of new samples to draw. The number of new samples is given by int(ratio * num_minority_samples)
  • random_state : Seed for random numbers generation.

SMOTE - Synthetic Minority Over-sampling Technique

SMOTE is an object that generates synthetic samples by applying the SMOTE algorithm. New minority samples are generated along the lines that connecting minority samples to its nearest minority neighbours.

Parameters:

  • k : Number of nearest neighbours to use when generating synthetic samples.
  • ratio : Controls the number of synthetic samples to generate. The number of new samples is given by int(ratio * num_minority_samples)
  • random_state : Seed for random numbers generation.

Methods:

  • fit : Find the target statistics to determine the minority class, and the number of samples in each class.
  • transform : Returns the re sampled version of the original data set (X, y) passed to fit.
  • fit_transform : Automatically performs both fit and transform.

bSMOTE1 - Borderline SMOTE - type 1

bSMOTE1 is an object that generates synthetic samples by applying the SMOTE algorithm, but only to samples that are near the border between different classes.

Initially m nearest neighbours for every sample in the minority class are found. Minority samples that are completely surrounded by majority samples, i.e.: all m nearest neighbours belong to the majority class, are considered to be noise and left out of the process. Samples with at most m/2 NNs from the majority class are considered to be safe, and also left out of the process.

Samples for which the number of NNs from the majority class is greater than m<2 (but not m) are considered in danger (near the borderline) and used to generate synthetic samples. New minority samples are generated along the lines connecting minority samples only to their nearest minority neighbours.

Parameters:

  • m : Number of nearest neighbours to use when deciding if a sample is in danger.
  • k : Number of nearest neighbours to use when generating synthetic samples.
  • ratio : Controls the number of synthetic samples to generate. The number of new samples is given by int(ratio * num_minority_samples)
  • random_state : Seed for random numbers generation.

Methods:

  • fit : Find the target statistics to determine the minority class, and the number of samples in each class.
  • transform : Returns the re sampled version of the original data set (X, y) passed to fit.
  • fit_transform : Automatically performs both fit and transform.

bSMOTE2 - Borderline SMOTE - type 2

bSMOTE2 is an object that generates synthetic samples by applying the SMOTE algorithm, but only to samples that are near the border between different classes.

Similarly to bSMOTE1, initially m nearest neighbours for every sample in the minority class are found. Minority samples that are completely surrounded by majority samples, i.e.: all m nearest neighbours belong to the majority class, are considered to be noise and left out of the process. Samples with at most m/2 NNs from the majority class are considered to be safe, and also left out of the process.

Samples for which the number of NNs from the majority class is greater than m<2 (but not m) are considered in danger (near the borderline) and used to generate synthetic samples. What differs bSMOTE2 from bSMOTE1 is that synthetic samples are created both from nearest minority neighbours as well as nearest majority neighbours. However, synthetic samples created from majority neighbours are created closer to the minority sample then when created from minority neighbours.

Parameters:

  • m : Number of nearest neighbours to use when deciding if a sample is in danger.
  • k : Number of nearest neighbours to use when generating synthetic samples.
  • ratio : Controls the number of synthetic samples to generate. The number of new samples is given by int(ratio * num_minority_samples)
  • random_state : Seed for random numbers generation.

Methods:

  • fit : Find the target statistics to determine the minority class, and the number of samples in each class.
  • transform : Returns the re sampled version of the original data set (X, y) passed to fit.
  • fit_transform : Automatically performs both fit and transform.

Support vector SMOTE

SVM_SMOTE is an object that generate synthetic samples using the SMOTE method, but only for borderline samples. However, unlike the usual borderline smote, this method uses support vector as borderline samples.

First, the support vectors of the minority class are found by fitting an SVM classification object. Then, similarly to bSMOTE(1&2), m nearest neighbours for every minority support vector are found. Samples that are completely surrounded by majority samples, i.e.: all m nearest neighbours belong to the majority class, are considered to be noise and left out of the process.

Samples with at most m/2 NNs from the majority class are considered to be safe, while samples for which the number of NNs from the majority class is greater than m<2 (but not m) are considered in danger. Synthetic samples are created via interpolation for samples in danger and extrapolation of safe samples.

Parameters:

  • m : Number of nearest neighbours to use when deciding if a sample is in danger.
  • k : Number of nearest neighbours to use when generating synthetic samples.
  • step_out : The step size when extrapolating safe support samples.
  • ratio : Controls the number of synthetic samples to generate. The number of new samples is given by int(ratio * num_minority_samples).
  • svm_args : Dictionary to pass any arguments to the scikit-learn object SVC
  • random_state : Seed for random numbers generation.

Methods:

  • fit : Find the target statistics to determine the minority class, and the number of samples in each class.
  • transform : Returns the re sampled version of the original data set (X, y) passed to fit.
  • fit_transform : Automatically performs both fit and transform.