Skip to content

averykhoo/data-science-training-package

Repository files navigation

Data Science Training Package

To train up new data scientists to bare minimum competency level

Prerequisites

Contents

S/N Domain Estimated Duration
1 Foundation 2 weeks
2 Machine Learning 2 months
3 Natural Language Processing 1 month
4 Automatic Speech Recognition 1 month

(Bonus) Interesting Reads

When you want to take a break from lectures 😊

ML workflow

  1. asking the right questions
    • see also the data literacy project scoping guide
    • look into elicitation of requirements or ideas from your users
    • find ways to convert business information needs into data questions
  2. data acquisition
    • maybe 'just get some labelled training data' doesn't seem like it even warrants a mention
    • trust me, it does (unless it's an open source dataset)
  3. ETL
    • (part of 'wrangling' or 'munging')
    • transforming data between formats (csv, json, html/xml, pipe-delimited, ...)
    • fixing broken formats (such as unquoted csv ಠ_ಠ)
    • minimal data edits for now, at least until you understand the data
  4. exploration
    • also known as 'EDA'
    • just look at a subset if your dataset can't fit in RAM, but make sure it's not a biased subset
    • df.describe()
    • it is important to visualize your data
      • parallel coordinate plot / Andrews plot / Kent–Kiviat radar m chart
      • correlation matrix
      • example
  5. data provenance, background research, and literature review
    • this often happens concurrently with EDA
    • understand where the data came from
    • what the labels mean, and how accurate they are, whether there's class imbalance
    • any pre-processing that was done that can't be undone
    • find out what has been tried before that did or didn't work
    • make a list of things you think might work for your dataset
  6. cleaning
    • (part of 'wrangling' or 'munging')
    • outliers / anomalies (eg huge spike in data)
    • impute missing values
    • remove noise
    • handle Unicode
    • data version control, to track how the data was cleaned
  7. baseline / POC
    • train/test split (+ optional cross-validation)
    • linear least squares / logistic regression
    • xgboost
    • if you're getting abysmal performance, maybe the data is still borked - garbage in, garbage out
      • or maybe it's impossible, and you should just give up
    • if you're getting suspiciously good performance (especially if you get 100%) something is probably wrong
      • somehow leaking your label, maybe via an extremely correlated variable
        • e.g. you've randomly split a timeseries into train/test sets that overlap in time (as opposed to training and test sets that are respectively before and after some date)
      • or maybe your problem is too easy, and you just need some rules or heuristics
  8. featurization (NLP usually happens here)
  9. training (ML usually happens here)
    • what is the value in your data? -> ML must be either actionable or informative (or both)
      • predictive models (regressions)
      • descriptive models (classifications)
      • prescriptive models (recommendations)
      • associative models (clustering)
      • (this list is not comprehensive - e.g. there are also generative models)
    • feature selection
      • duplicate features: high correlation / covariance
      • useless features: low / no variance
      • null features: mostly missing values, with only a small percent of real data
      • xgboost feature importance
    • model selection
    • hyperparameter optimization
    • dimension reduction
    • stacking/bagging/boosting
  10. testing (measuring performance)
    • as close to real data as possible
    • debugging
    • try to find edge cases / figure out where your model / algorithm breaks down
  11. visualization (of results)
    • also known as 'storytelling'
    • seaborn or even the most basic matplotlib.pyplot
  12. inference and explanations
    • if you've gotten this far, congrats
    • try something like SHAP
  13. deploying / sharing your model
    • fastapi
    • streamlit
    • or just share the Jupyter notebook with all your above steps (but clean it up and add explanations first!)
  14. developing a clean API
  15. building a UI
    • UX is more of an art than a science, many books have been written, none of them cover everything you need to know
    • but here's a TL;DR for UI: let the user get their thing done as fast as possible, with the fewest possible ways to get it wrong or misunderstand what happened, with minimal interaction per transaction, ideally not needing to read any instructions or even think about the process, and also don't make it ugly or irritating if you have to use it a thousand times (because they probably will have to)
  16. monitoring: collecting usage stats / telemetry
    • inevitably, management will ask how many people are using your thing
    • you can't really answer "no clue" and still expect to get your bonus
  17. making it faster with better algorithms (do this last, don't prematurely optimize unless it's really too slow)
    • think about time / space complexity
    • an inverted index for search
    • using collections.deque (or a circular buffer) instead of list.pop(0)
    • binary search in a sorted list (e.g. built-in bisect)
    • dynamic programming, memoization (e.g. built-in functools.lru_cache), tail call elimination, loop unrolling
    • approx nearest-neighbor lookup (annoy, faiss, Milvus)
    • parallelism (multiprocessing), async, locks, atomicity
    • A* or Dijkstra (as opposed to BFS / DFS)
    • cython / numba / pypy
  18. ml ops
    • CI/CD (e.g. to push to prod)
    • version control of models and data
    • quality metrics, detecting drift, auto retraining
    • TODO: fill up this bit

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •