Skip to content

jychen7/BigQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigQL

BigQL provides a SQL Query Layer for Google Cloud Bigtable.

Use Cases

Cloud Bigtable is Google's fully managed NoSQL Big Data database service. Each table contains rows and columns. Each row/column intersection can contain multiple cells. Each cell contains a unique timestamped version of the data for that row and column. Thus Bigtable is often used to store time series data.

BigQL provides a SQL query layer to run aggregation query on Bigtable.

Quick Start

pip install bigql

Using the weather balloon example data shown in Single-timestamp unserialized schema design

Row key                         pressure    temperature humidity    altitude
us-west2#3698#2021-03-05-1200   94558       9.6         61          612
us-west2#3698#2021-03-05-1201   94122       9.7         62          611
us-west2#3698#2021-03-05-1202   95992       9.5         58          602
us-west2#3698#2021-03-05-1203   96025       9.5         66          598
us-west2#3698#2021-03-05-1204   96021       9.6         63          624

After initialize the client

from bigql.client import Client
# config follows offical python bigtable client
client = Client(config)

client.register_table(
    "weather_balloons",
    instance_id="INSTANCE_ID",
    column_families={
        "measurements": {
            "only_read_latest": True,
            "columns": {
                "pressure": int,
                "temperature": str,
                "humidity": int,
                "altitude": int
            }
        }
    }
)

we are able to calculate average pressure of the period by

client.query("measurements", """
SELECT avg(pressure) FROM weather_balloons
WHERE
  "_row_key" BETWEEN 'us-west2#3698#2021-03-05-1200' AND 'us-west2#3698#2021-03-05-1204'
""")

Or with row key decomposition

client.register_table(
    xxx,
    row_key_identifiers=["location", "balloon_id", "event_minute"],
    row_key_separator="#"
)

client.query("measurements", """
SELECT balloon_id, avg(pressure) FROM weather_balloons
WHERE
  location = 'us-west2'
  AND balloon_id IN ('3698', '3700')
  AND event_minute BETWEEN '2021-03-05-1200' AND '2021-03-05-1204'
GROUP BY 1
""")

The output of query is list of pyarrow.RecordBatch. It can be easily convert to python dictionary (to_pydict) and pandas dataframe (to_pandas).

Group by Time

Each cell in Bigtable have a timestamp. SELECT "_timestamp" will return a float number, represent seconds since Unix epoch. Following is an example to select 5 minutes interval

SELECT to_timestamp_seconds(cast(floor("_timestamp" / 600) * 600 as bigint)) as interval

Alternative

  1. Google BigQuery external data source

However, as of 2022-01, it

  • only supports "us-central1" and "europe-west1" region
  • only supports query with "rowkey"
  • by default can run up to 4 concurrent queries against Bigtable external data source

Roadmap

SQL

  • ✅ Insert Into
  • ✅ Select *
  • ✅ Select column(s)
  • ✅ Filter (WHERE): "=", "IN", "BETWEEN", ">", ">=", "<", "<="
  • ✅ GROUP BY
  • ✅ ORDER BY
  • ✅ HAVING
  • ✅ LIMIT
  • ✅ Aggregate (e.g. avg, sum, count)
  • ✅ AND
  • ✅ Alias
  • ✅ Cast
  • ✅ Common Math Functions
  • Common Date/Time Functions
  • OR ???
  • Join ???

General

  • ✅ Partition Pruning
  • ✅ Projection pushdown
  • ❌ Predicate push down (Value range and Value regex)
    • not work well, because its filter works on all cells, not only predicate column
  • Limit Pushdown ???

Limitation

  • for row key encoding, only string is supported
  • for single/composite row key, identifiers supports "=" and "IN". Additionally, last identifier also supports "BETWEEN".
  • for qualifiers, only string and integer (64bit BigEndian encoding) value are supported
  • subqueries and common table expressions are not supported

Technical Details

BigQL depends on