datag.py

We have moved to https://codeberg.org/KOLANICH-ML/datag.py, grab new versions there.

Under the disguise of "better security" Micro$oft-owned GitHub has discriminated users of 1FA passwords while having commercial interest in success and wide adoption of FIDO 1FA specifications and Windows Hello implementation which it promotes as a replacement for passwords. It will result in dire consequencies and is competely inacceptable, read why.

If you don't want to participate in harming yourself, it is recommended to follow the lead and migrate somewhere away of GitHub and Micro$oft. Here is the list of alternatives and rationales to do it. If they delete the discussion, there are certain well-known places where you can get a copy of it. Read why you should also leave GitHub.

This is a data cleansing, standardization and aggregation framework.

Assumme you have a few noisy bad-quality data tables produced by the ones not caring about their quality. These datasets are made just to say "we support open data", but in fact they have multiple issues. And we need to train a model on this piece of shit. In order to do it we need to make a candy of shit ..

Issues in scope

issue	fix
data contains typos, even identifiers meant to uniquily identify stuff contain typos!	custom function fixing the typo
data in different units even for the same column	determine unit for each data in the dataset and validate it
some data is completely junk, for example an atom containing 1000 protons or mass in coulombs	detect junk by encorporating domain knowledge and discard it
columns names are semantically incorrect and different datasets use different columns	rename columns
some columns contain multiple data encoded with some hand-crafted format	expand them into different columns, delete the original column
some data field is repeated, but with different values	compute an estimate using the present values or discard the value

Issues out of scope

Imputation
(Re)balancing
encoding
any stuff doing machine learning (but you can implement it yourself)

Pipeline

get a formal description on what you want from data to be
- unit
- constraints
for each source:
- get a raw record from a source
- apply a transformation
- apply in-source validation
do intersource
- validation and consistency checks
- merging and estimation

Task decomposition

Spec - a way to encode requirements to our data.
Record - just a dict with some additional properties.
Source - gets the records by their identifiers. Has
- priority
- spec
- entity
Entity - a way to discover Sources providing us with Records of the same kind. Acts as a namespace and as a final validator. Has
- spec
Rule - transforms the data, detects errors and recovers the missing stuff.
Disambiguator - uses a dictionary for standardization of identifiers.
Merger - combines different datasets into a composit one.
Pipeline - a Source of the resulting dataset. Because it is a Source, it can be plugged further.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
datag		datag
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Code_Of_Conduct.md		Code_Of_Conduct.md
ReadMe.md		ReadMe.md
UNLICENSE		UNLICENSE
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

datag

datag

.editorconfig

.editorconfig

.gitignore

.gitignore

.gitlab-ci.yml

.gitlab-ci.yml

Code_Of_Conduct.md

Code_Of_Conduct.md

ReadMe.md

ReadMe.md

UNLICENSE

UNLICENSE

pyproject.toml

pyproject.toml

Repository files navigation

datag.py

Issues in scope

Issues out of scope

Pipeline

Task decomposition

About

Releases

Packages

Languages

License

KOLANICH-ML/datag.py

Folders and files

Latest commit

History

Repository files navigation

datag.py

Issues in scope

Issues out of scope

Pipeline

Task decomposition

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages