forked from apache/airflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Create unique naming spec (apache#31)
* crate unique naming spec
- Loading branch information
1 parent
b879e75
commit a9d34a3
Showing
1 changed file
with
120 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
# Naming | ||
|
||
We define the unique name strategy per datasource to ensure it is followed uniformly independently from the type of job consuming and producing data. | ||
The namespace for a dataset is the unique name for its datasource. | ||
For jobs it is the scheduler. | ||
|
||
## Datasets: | ||
The namespace and name of a datasource can be combined to form a URI (scheme:[//authority]path) | ||
* Namespace = scheme:[//authority] (the datasource) | ||
* Name = path (the datasets) | ||
|
||
### Data warehouses/data bases. | ||
Datasets are called tables. Tables are organised in databases and schemas. | ||
#### Postgres: | ||
Datasource hierarchy: | ||
* Host | ||
* Port | ||
|
||
Naming hierarchy: | ||
* Database | ||
* Schema | ||
* Table | ||
|
||
Identifier: | ||
* Namespace: postgres://{host}:{port} of the service instance. | ||
* Scheme = postgres | ||
* Authority = {host}:{port} | ||
* Unique name: {database}.{schema}.{table} | ||
* URI = postgres://{host}:{port}/{database}.{schema}.{table} | ||
|
||
#### Snowflake | ||
See: [Object Identifiers — Snowflake Documentation](https://docs.snowflake.com/en/sql-reference/identifiers.html) | ||
|
||
Datasource hierarchy: | ||
* account name | ||
|
||
Naming hierarchy: | ||
* Database: {database name} => unique across the account | ||
* Schema: {schema name} => unique within the database | ||
* Table: {table name} => unique within the schema | ||
|
||
Identifier: | ||
* Namespace: snowflake://{account name} | ||
* Scheme = snowflake | ||
* Authority = {account name} | ||
* Name: {database}.{schema}.{table} | ||
* URI = snowflake://{account name}/{database}.{schema}.{table} | ||
|
||
#### BigQuery | ||
See: | ||
[Creating and managing projects | Resource Manager Documentation](https://cloud.google.com/resource-manager/docs/creating-managing-projects) | ||
[Introduction to datasets | BigQuery](https://cloud.google.com/bigquery/docs/datasets-intro) | ||
[Introduction to tables | BigQuery](https://cloud.google.com/bigquery/docs/tables-intro) | ||
|
||
Datasource hierarchy: | ||
* bigquery | ||
|
||
Naming hierarchy: | ||
* Project Name: {project name} => is not unique | ||
* Project number: {project number} => numeric: is unique across google cloud | ||
* Project ID: {project id} => readable: is unique across google cloud | ||
* dataset: {dataset name} => is unique within a project | ||
* table: {table name} => is unique within a dataset | ||
|
||
Identifier : | ||
* Namespace: bigquery | ||
* Scheme = bigquery | ||
* Authority = | ||
* Unique name: {project id}.{dataset name}.{table name} | ||
* URI = bigquery:{project id}.{schema}.{table} | ||
|
||
### Distributed file systems/blob stores | ||
#### GCS | ||
Datasource hierarchy: none, naming is global | ||
|
||
Naming hierarchy: | ||
* bucket name => globally unique | ||
* Path | ||
|
||
Identifier : | ||
* Namespace: gs://{bucket name} | ||
* Scheme = gs | ||
* Authority = {bucket name} | ||
* Unique name: {path} | ||
* URI = gs://{bucket name}{path} | ||
|
||
#### S3 | ||
Naming hierarchy: | ||
* bucket name => globally unique | ||
* Path | ||
|
||
Identifier : | ||
* Namespace: s3://{bucket name} | ||
* Scheme = s3 | ||
* Authority = {bucket name} | ||
* Unique name: {path} | ||
* URI = s3://{bucket name}{path} | ||
|
||
#### HDFS | ||
Naming hierarchy: | ||
* Namenode: host + port | ||
* Path | ||
|
||
Identifier : | ||
* Namespace: hdfs://{namenode host}:{namenode port} | ||
* Scheme = hdfs | ||
* Authority = {namenode host}:{namenode port} | ||
* Unique name: {path} | ||
* URI = hdfs://{namenode host}:{namenode port}{path} | ||
|
||
## Schedulers | ||
### Airflow | ||
|
||
Naming hierarchy: | ||
* job => workspace/DAG | ||
* Task => unique within the job | ||
|
||
Identifier : | ||
* Namespace: {scheduler namespace} | ||
* Unique name: {job}.{task} |