Skip to content

eode/quilt-compiler

 
 

Repository files navigation

Gitter docs on_gitbook

OS CI testing on master
Linux
CircleCI branch
Windows

Docs

Visit docs.quiltdata.com. Or browse the docs on GitHub.

Manage data like code

Quilt provides versioned, reusable building blocks for analysis in the form of data packages. A data package may contain data of any type or size. In spirit, Quilt does for data what package managers and Docker registries do for code: provide a centralized, collaborative store of record.

Getting started tutorial

Benefits

  • Reproducibility - Imagine source code without versions. Ouch. Why live with un-versioned data? Versioned data makes analysis reproducible by creating unambiguous references to potentially complex data dependencies.
  • Collaboration and transparency - Data likes to be shared. Quilt offers a centralized data warehouse for finding and sharing data.
  • Auditing - the registry tracks all reads and writes so that admins know when data are accessed or changed
  • Less data prep - the registry abstracts away network, storage, and file format so that users can focus on what they wish to do with the data.
  • Deduplication - Data fragments are hashed with SHA256. Duplicate data fragments are written to disk once globally per user. As a result, large, repeated data fragments consume less disk and network bandwidth.
  • Faster analysis - Serialized data loads 5 to 20 times faster than files. Moreover, specialized storage formats like Apache Parquet minimize I/O bottlenecks so that tools like Presto DB and Hive run faster.

Commands

Here are the basic Quilt commands:

Service

Quilt is offered as a managed service at quiltdata.com.

Architecture

Quilt consists of three source-level components:

  1. A data catalog

    • Displays package meta-data in HTML
    • Implemented with JavaScript with redux, sagas
  2. A data registry

    • Controls permissions
    • Stores package fragments in blob storage
    • Stores package meta-data
    • De-duplicates repeated data fragments
    • Implemented in Python with Flask and PostgreSQL
  3. A data compiler

    • Serializes tabular data to Apache Parquet
    • Transforms and parses files
    • builds packages locally
    • pushes packages to the registry
    • pulls packages from the registry
    • Implemented in Python with pandas and PyArrow

About

Quilt - manage data like code

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 63.3%
  • JavaScript 33.0%
  • HTML 2.4%
  • PowerShell 0.7%
  • Batchfile 0.3%
  • Shell 0.2%
  • Mako 0.1%