Datalake Documentation

This documentation describes a GitOps oriented data lake.

Architecture

Components

Physical Storage Layer(Amazon S3)

The data is stored in Amazon S3 and contains the following types of assets

{
    // Raw dataset
    `s3://data-bucket/asset/${repo}/${branch}/data/${yyyy}-${mm}-${dd}.csv.gz`: '<gzipped csv data>',
    // Metadata file indicating the location of the latest dataset
    `s3://data-bucket/asset/${repo}/${branch}/metadata/latest`: latest.grosRef.serialize(),
    // Downstream Dependency location
    `s3://data-bucket/dependency/${repo}/${branch}/downstreams/${urlencode(`${repo}/${branch}`)}`: dep3.grosRef.serialize(),
    // Audit logs containing information on the downstreams dependent on a dataset
    `s3://data-bucket/dependency/${repo}/${branch}/${yyyy}-${mm}-${dd}/downstreams/${urlencode(`${dep3.repo}/${dep3.branch}`)}`: dep3.grosRef.serialize(),
    // Audit logs containing information on the upstream datasets used to produce this dataset
    `s3://data-bucket/dependency/${repo}/${branch}/${yyyy}-${mm}-${dd}/metadata/${yyyy}-${mm}-${dd}.json`: {
        "gros_ref": this.grosRef.serialize(),
        "upstreams": [
            dep1.grosRef.serialize(),
            dep2.grosRef.serialize(),
        ]
    },
}

Data Virtualization

To retrieve datasets in the data lake, you can use a URI style protocol in the pandas.read_csv function.

Example:

import gros
import pandas as pd
df = pd.read_csv('gros://my-branch/my-repo:latest')

GROS URI spec

The URI is in the following format:

gros://$repo[/$branch]?[:$tag]?

where: repo(required) is the name of the repo(the project slug) that produced the data branch(optional) is the branch of code that produced the data. Defaults to master. tag(optional) is:

indicating that it's sourcing the the latest dataset as a real-time dependency. This pipeline will be run whenever its upstream has new data.
latest indicating that it's using the latest dataset. The pipeline will not be triggered on new data.
a date string in YYYY-mm-dd format
a signed int allows selecting datasets by earliest to latest

Data Downstream Propagation

The pipelines are run on GitLab CI Runners on a Kubernetes Cluster. The CI jobs have credentials in order to run downstream pipelines using GitLab multi-project features. When pipelines are run, datasets are retrieved from the data lake using a single API. This API logs the datasets accessed during the job run. If a dataset is retrieved using an empty tag, it is recognized as an upstream dependency and will be run when the upstream dataset is updated.

Data Ops

This strategy benefits from pipeline logs, source code, and secrets management being consolidated behind a single pane of glass. The infrastructure uses a GKE Autopilot cluster and auto-scales the node count based on the needed CPU and RAM. After jobs complete running, the resources are automatically scaled down by the platform for cost savings. This strategy leads to a scaleable, but low-maintenance data lake that can support complex pipeline logic and easy access to datasets in an interactive notebook environment.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
imgs		imgs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imgs

imgs

README.md

README.md

Repository files navigation

Datalake Documentation

Architecture

Components

Physical Storage Layer(Amazon S3)

Data Virtualization

GROS URI spec

Data Downstream Propagation

Data Ops

About

Releases

Packages

dixler/data-lake-docs

Folders and files

Latest commit

History

imgs

imgs

README.md

README.md

Repository files navigation

Datalake Documentation

Architecture

Components

Physical Storage Layer(Amazon S3)

Data Virtualization

GROS URI spec

Data Downstream Propagation

Data Ops

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages