Skip to content

Commit

Permalink
Create unique naming spec (apache#31)
Browse files Browse the repository at this point in the history
* crate unique naming spec
  • Loading branch information
julienledem committed May 7, 2021
1 parent b879e75 commit a9d34a3
Showing 1 changed file with 120 additions and 0 deletions.
120 changes: 120 additions & 0 deletions spec/Naming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Naming

We define the unique name strategy per datasource to ensure it is followed uniformly independently from the type of job consuming and producing data.
The namespace for a dataset is the unique name for its datasource.
For jobs it is the scheduler.

## Datasets:
The namespace and name of a datasource can be combined to form a URI (scheme:[//authority]path)
* Namespace = scheme:[//authority] (the datasource)
* Name = path (the datasets)

### Data warehouses/data bases.
Datasets are called tables. Tables are organised in databases and schemas.
#### Postgres:
Datasource hierarchy:
* Host
* Port

Naming hierarchy:
* Database
* Schema
* Table

Identifier:
* Namespace: postgres://{host}:{port} of the service instance.
* Scheme = postgres
* Authority = {host}:{port}
* Unique name: {database}.{schema}.{table}
* URI = postgres://{host}:{port}/{database}.{schema}.{table}

#### Snowflake
See: [Object Identifiers — Snowflake Documentation](https://docs.snowflake.com/en/sql-reference/identifiers.html)

Datasource hierarchy:
* account name

Naming hierarchy:
* Database: {database name} => unique across the account
* Schema: {schema name} => unique within the database
* Table: {table name} => unique within the schema

Identifier:
* Namespace: snowflake://{account name}
* Scheme = snowflake
* Authority = {account name}
* Name: {database}.{schema}.{table}
* URI = snowflake://{account name}/{database}.{schema}.{table}

#### BigQuery
See:
[Creating and managing projects | Resource Manager Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
[Introduction to datasets | BigQuery](https://cloud.google.com/bigquery/docs/datasets-intro)
[Introduction to tables | BigQuery](https://cloud.google.com/bigquery/docs/tables-intro)

Datasource hierarchy:
* bigquery

Naming hierarchy:
* Project Name: {project name} => is not unique
* Project number: {project number} => numeric: is unique across google cloud
* Project ID: {project id} => readable: is unique across google cloud
* dataset: {dataset name} => is unique within a project
* table: {table name} => is unique within a dataset

Identifier :
* Namespace: bigquery
* Scheme = bigquery
* Authority =
* Unique name: {project id}.{dataset name}.{table name}
* URI = bigquery:{project id}.{schema}.{table}

### Distributed file systems/blob stores
#### GCS
Datasource hierarchy: none, naming is global

Naming hierarchy:
* bucket name => globally unique
* Path

Identifier :
* Namespace: gs://{bucket name}
* Scheme = gs
* Authority = {bucket name}
* Unique name: {path}
* URI = gs://{bucket name}{path}

#### S3
Naming hierarchy:
* bucket name => globally unique
* Path

Identifier :
* Namespace: s3://{bucket name}
* Scheme = s3
* Authority = {bucket name}
* Unique name: {path}
* URI = s3://{bucket name}{path}

#### HDFS
Naming hierarchy:
* Namenode: host + port
* Path

Identifier :
* Namespace: hdfs://{namenode host}:{namenode port}
* Scheme = hdfs
* Authority = {namenode host}:{namenode port}
* Unique name: {path}
* URI = hdfs://{namenode host}:{namenode port}{path}

## Schedulers
### Airflow

Naming hierarchy:
* job => workspace/DAG
* Task => unique within the job

Identifier :
* Namespace: {scheduler namespace}
* Unique name: {job}.{task}

0 comments on commit a9d34a3

Please sign in to comment.