Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing values and categorical features when generating datasets #28952

Open
lcrmorin opened this issue May 5, 2024 · 5 comments
Open

Add missing values and categorical features when generating datasets #28952

lcrmorin opened this issue May 5, 2024 · 5 comments

Comments

@lcrmorin
Copy link

lcrmorin commented May 5, 2024

Describe the workflow you want to enable

I am often using random datasets (typically with make_classification). However I often find myself having to add more realistic features to the dataset:

  • missing data, sometime just to test the pipeline (missing at random would be fine), or sometimes to look for more complex phenomenons (missingnes not at random, possibly depending on the target)
  • categorical: categoricals variables often need to be handled specifically. I usually introduce categoricals with binning a continuous value, then transforming to strings.
    It would be nice to have both of those in datasets generation.

Describe your proposed solution

Introduce parameters to allow for generation of missing data (proportion of missingness, type of missingness - at random, not at random).
Introduce parameters to allow for generation of categorical features (number of features, type of repartition in categories - even - uneven - pareto.

Describe alternatives you've considered, if relevant

I usually handle this by hand.

Additional context

Could be used to illustrate imputing techniques, encoding techniques.

@lcrmorin lcrmorin added Needs Triage Issue requires triage New Feature labels May 5, 2024
@oasidorshin
Copy link

@lcrmorin This would be great for testing! I would also suggest adding infinities as possible values, bcs they also break stuff quite often. Also, if randomly generated, making sure to always include at least one NaN and inf value

@AK3847
Copy link

AK3847 commented May 6, 2024

@lcrmorin I suggest adding a noise function or something similar which can generate structured randomness so as to make some sense in data and not pseudo-randomness. Perhaps something like Perlin Noise?

@glemaitre
Copy link
Member

Regarding the missing values I recall the following issues/PRs: #6284 / #7084. It seems that the consensus was to have something similar to the ampute R package.

I almost a similar discussion for categorical features but I could not find. For sure, it would be handy to have those two parameters even though we could limit the complexity (e.g. only have a single missingness pattern)

@glemaitre glemaitre removed the Needs Triage Issue requires triage label May 14, 2024
@glemaitre
Copy link
Member

Regarding the categorical features, we have the following related issue: #12433

@glemaitre glemaitre changed the title Improve random datasets Add missing values and categorical features when generating datasets May 16, 2024
@kaustubhgap
Copy link

kaustubhgap commented May 19, 2024

I am working on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

5 participants