GitHub - milesgranger/flaco: (PoC) A very memory-efficient way to read data from PostgreSQL

flaco

Install:

pip install flaco

The easiest and perhaps most memory efficient way to get PostgreSQL data (more flavors to come?) into pyarrow.Table, pandas.DataFrame or Arrow (IPC/Feather) and Parquet files.

Since Arrow supports efficient and even larger-than-memory processing, as with dask, duckdb, or others. Just getting data onto disk is sometimes the hardest part; this aims to make that easier.

API: flaco.read_sql_to_file: Read SQL query into Feather or Parquet file. flaco.read_sql_to_pyarrow: Read SQL query into a pyarrow table.

NOTE: This is still a WIP. I intend to generalize it more to be useful towards a wider audience. Issues and pull requests welcome!

Example

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   122    147.9 MiB    147.9 MiB           1   @profile
   123                                         def memory_profile():
   124    147.9 MiB      0.0 MiB           1       stmt = "select * from test_table"
   125
   126                                             # Read SQL to file
   127    150.3 MiB      2.4 MiB           1       flaco.read_sql_to_file(DB_URI, stmt, 'result.feather', flaco.FileFormat.Feather)
   128    150.3 MiB      0.0 MiB           1       with pa.memory_map('result.feather', 'rb') as source:
   129    150.3 MiB      0.0 MiB           1           table1 = pa.ipc.open_file(source).read_all()
   130    408.1 MiB    257.8 MiB           1           table1_df1 = table1.to_pandas()
   131
   132                                             # Read SQL to pyarrow.Table
   133    504.3 MiB     96.2 MiB           1       table2 = flaco.read_sql_to_pyarrow(DB_URI, stmt)
   134    644.1 MiB    139.8 MiB           1       table2_df = table2.to_pandas()
   135
   136                                             # Pandas
   137    648.8 MiB      4.7 MiB           1       engine = create_engine(DB_URI)
   138   1335.4 MiB    686.6 MiB           1       _pandas_df = pd.read_sql(stmt, engine)

License

Why did you choose such lax licensing? Could you change to a copy left license, please?

...just kidding, no one would ask that. This is dual licensed under Unlicense or MIT, at your discretion.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.cargo		.cargo
.github/workflows		.github/workflows
benchmarks		benchmarks
pagila @ 334c905		pagila @ 334c905
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-MIT		LICENSE-MIT
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
cbindgen.toml		cbindgen.toml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt

License

Licenses found

milesgranger/flaco

Folders and files

Latest commit

History

Repository files navigation

flaco

Install:

Example

License

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Languages