Releases · MobileTeleSystems/onetl

29 May 08:10

github-actions

0.11.1

7c9c708

0.11.1 (2024-05-29) Latest

Latest

Features

Change MSSQL.port default from 1433 to None, allowing use of instanceName to detect port number. (#287)

Bug Fixes

Remove fetchsize from JDBC.WriteOptions. (#288)

Assets 4

27 May 09:46

github-actions

0.11.0

2335d7f

0.11.0 (2024-05-27)

Breaking Changes

There can be some changes in connection behavior, related to version upgrades. So we mark these changes as breaking although most of users will not see any differences.

Update Clickhouse JDBC driver to latest version (#249):
- Package was renamed ru.yandex.clickhouse:clickhouse-jdbc → com.clickhouse:clickhouse-jdbc.
- Package version changed 0.3.2 → 0.6.0-patch5.
- Driver name changed ru.yandex.clickhouse.ClickHouseDriver → com.clickhouse.jdbc.ClickHouseDriver.
This brings up several fixes for Spark <-> Clickhouse type compatibility, and also Clickhouse clusters support.
Update other JDBC drivers to latest versions:
- MSSQL 12.2.0 → 12.6.2 (#254).
- MySQL 8.0.33 → 8.4.0 (#253, #285).
- Oracle 23.2.0.0 → 23.4.0.24.05 (#252, #284).
- Postgres 42.6.0 → 42.7.3 (#251).
Update MongoDB connector to latest version: 10.1.1 → 10.3.0 (#255, #283).

This brings up Spark 3.5 support.
Update XML package to latest version: 0.17.0 → 0.18.0 (#259).

This brings few bugfixes with datetime format handling.

For JDBC connections add new SQLOptions class for DB.sql(query, options=...) method (#272).

Firsly, to keep naming more consistent.

Secondly, some of options are not supported by DB.sql(...) method, but supported by DBReader. For example, SQLOptions do not support partitioning_mode and require explicit definition of lower_bound and upper_bound when num_partitions is greater than 1. ReadOptions does support partitioning_mode and allows skipping lower_bound and upper_bound values.

This require some code changes. Before:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.sql(
    """
    SELECT *
    FROM some.mytable
    WHERE key = 'something'
    """,
    options=Postgres.ReadOptions(
        partitioning_mode="range",
        partition_column="id",
        num_partitions=10,
    ),
)

After:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.sql(
    """
    SELECT *
    FROM some.mytable
    WHERE key = 'something'
    """,
    options=Postgres.SQLOptions(
        # partitioning_mode is not supported!
        partition_column="id",
        num_partitions=10,
        lower_bound=0,  # <-- set explicitly
        upper_bound=1000,  # <-- set explicitly
    ),
)

For now, DB.sql(query, options=...) can accept ReadOptions to keep backward compatibility, but emits deprecation warning. The support will be removed in v1.0.0.

Split up JDBCOptions class into FetchOptions and ExecuteOptions (#274).

New classes are used by DB.fetch(query, options=...) and DB.execute(query, options=...) methods respectively. This is mostly to keep naming more consistent.

This require some code changes. Before:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.fetch(
    "SELECT * FROM some.mytable WHERE key = 'something'",
    options=Postgres.JDBCOptions(
        fetchsize=1000,
        query_timeout=30,
    ),
)

postgres.execute(
    "UPDATE some.mytable SET value = 'new' WHERE key = 'something'",
    options=Postgres.JDBCOptions(query_timeout=30),
)

After:

from onetl.connection import Postgres

# Using FetchOptions for fetching data
postgres = Postgres(...)
df = postgres.fetch(
    "SELECT * FROM some.mytable WHERE key = 'something'",
    options=Postgres.FetchOptions(  # <-- change class name
        fetchsize=1000,
        query_timeout=30,
    ),
)

# Using ExecuteOptions for executing statements
postgres.execute(
    "UPDATE some.mytable SET value = 'new' WHERE key = 'something'",
    options=Postgres.ExecuteOptions(query_timeout=30),  # <-- change class name
)

For now, DB.fetch(query, options=...) and DB.execute(query, options=...) can accept JDBCOptions, to keep backward compatibility, but emit a deprecation warning. The old class will be removed in v1.0.0.

Serialize ColumnDatetimeHWM to Clickhouse's DateTime64(6) (precision up to microseconds) instead of DateTime (precision up to seconds) (#267).

In previous onETL versions, ColumnDatetimeHWM value was rounded to the second, and thus reading some rows that were read in previous runs, producing duplicates.

For Clickhouse versions below 21.1 comparing column of type DateTime with a value of type DateTime64 is not supported, returning an empty dataframe. To avoid this, replace:
```
DBReader(
    ...,
    hwm=DBReader.AutoDetectHWM(
        name="my_hwm",
        expression="hwm_column",  # <--
    ),
)
```
with:
```
DBReader(
    ...,
    hwm=DBReader.AutoDetectHWM(
        name="my_hwm",
        expression="CAST(hwm_column AS DateTime64)",  # <-- add explicit CAST
    ),
)
```
Pass JDBC connection extra params as properties dict instead of URL with query part (#268).

This allows passing custom connection parameters like Clickhouse(extra={"custom_http_options": "option1=value1,option2=value2"}) without need to apply urlencode to parameter value, like option1%3Dvalue1%2Coption2%3Dvalue2.

Features

Improve user experience with Kafka messages and Database tables with serialized columns, like JSON/XML.

Allow passing custom package version as argument for DB.get_packages(...) method of several DB connectors:
- Clickhouse.get_packages(package_version=..., apache_http_client_version=...) (#249).
- MongoDB.get_packages(scala_version=..., spark_version=..., package_version=...) (#255).
- MySQL.get_packages(package_version=...) (#253).
- MSSQL.get_packages(java_version=..., package_version=...) (#254).
- Oracle.get_packages(java_version=..., package_version=...) (#252).
- Postgres.get_packages(package_version=...) (#251).
- Teradata.get_packages(package_version=...) (#256).
Now users can downgrade or upgrade connection without waiting for next onETL release. Previously only Kafka and Greenplum supported this feature.
Add FileFormat.parse_column(...) method to several classes:
- Avro.parse_column(col) (#265).
- JSON.parse_column(col, schema=...) (#257).
- CSV.parse_column(col, schema=...) (#258).
- XML.parse_column(col, schema=...) (#269).
This allows parsing data in value field of Kafka message or string/binary column of some table as a nested Spark structure.
Add FileFormat.serialize_column(...) method to several classes:
- Avro.serialize_column(col) (#265).
- JSON.serialize_column(col) (#257).
- CSV.serialize_column(col) (#258).
This allows saving Spark nested structures or arrays to value field of Kafka message or string/binary column of some table.

Improvements

Few documentation improvements.

Replace all assert in documentation with doctest syntax. This should make documentation more readable (#273).
Add generic Troubleshooting guide (#275).
Improve Kafka documentation:
- Add "Prerequisites" page describing different aspects of connecting to Kafka.
- Improve "Reading from" and "Writing to" page of Kafka documentation, add more examples and usage notes.
- Add "Troubleshooting" page (#276).
Improve Hive documentation:
- Add "Prerequisites" page describing different aspects of connecting to Hive.
- Improve "Reading from" and "Writing to" page of Hive documentation, add more examples and recommendations.
- Improve "Executing statements in Hive" page of Hive documentation. (#278).
Add "Prerequisites" page describing different aspects of using SparkHDFS and SparkS3 connectors. (#279).
Add note about connecting to Clickhouse cluster. (#280).
Add notes about versions when specific class/method/attribute/argument was added, renamed or changed behavior (#282).

Bug Fixes

Fix missing pysmb package after installing pip install onetl[files].

Assets 4

21 Mar 07:56

github-actions

0.10.2

c9b54df

0.10.2 (2024-03-21)

Features

Add support of Pydantic v2. (#230)

Improvements

Improve database connections documentation:
- Add "Types" section describing mapping between Clickhouse and Spark types
- Add "Prerequisites" section describing different aspects of connecting to Clickhouse
- Separate documentation of DBReader and .sql() / .pipeline(...)
- Add examples for .fetch() and .execute() (#211, #228, #229, #233, #234, #235, #236, #240)
Add notes to Greenplum documentation about issues with IP resolution and building gpfdist URL (#228)
Allow calling MongoDB.pipeline(...) with passing just collection name, without explicit aggregation pipeline. (#237)
Update default Postgres(extra={...}) to include {"stringtype": "unspecified"} option. This allows to write text data to non-text column (or vice versa), relying to Postgres cast capabilities.

For example, now it is possible to read column of type money as Spark's StringType(), and write it back to the same column, without using intermediate columns or tables. (#229)

Bug Fixes

Return back handling of DBReader(columns="string"). This was a valid syntax up to v0.10 release, but it was removed because most of users neved used it. It looks that we were wrong, returning this behavior back, but with deprecation warning. (#238)
Downgrade Greenplum package version from 2.3.0 to 2.2.0. (#239)

This is because version 2.3.0 introduced issues with writing data to Greenplum 6.x. Connector can open transaction with SELECT * FROM table LIMIT 0 query, but does not close it, which leads to deadlocks.

For using this connector with Greenplum 7.x, please pass package version explicitly:
```
maven_packages = Greenplum.get_packages(package_version="2.3.0", ...)
```

Assets 4

05 Feb 09:16

github-actions

0.10.1

3214e75

0.10.1 (2024-02-05)

Features

Add support of Incremental Strategies for Kafka connection:

reader = DBReader(
    connection=Kafka(...),
    source="topic_name",
    hwm=AutoDetectHWM(name="some_hwm_name", expression="offset"),
)

with IncrementalStrategy():
    df = reader.run()

This lets you resume reading data from a Kafka topic starting at the last committed offset from your previous run. (#202)

Add has_data, raise_if_no_data methods to DBReader class. (#203)
Updade VMware Greenplum connector from 2.1.4 to 2.3.0. This implies:
- Greenplum 7.x support
- Kubernetes support
- New read option gpdb.matchDistributionPolicy which allows to match each Spark executor with specific Greenplum segment, avoiding redundant data transfer between Greenplum segments
- Allows overriding Greenplum optimizer parameters in read/write operations (#208)
Greenplum.get_packages() method now accepts optional arg package_version which allows to override version of Greenplum connector package. (#208)

Assets 4

17 Dec 21:02

github-actions

0.10.0

f5244e5

0.10.0 (2023-12-18)

Breaking Changes

Upgrade etl-entities from v1 to v2 (#172).

This implies that HWM classes are now have different internal structure than they used to.

etl-entities < 2	etl-entities >= 2
from etl_entities.old_hwm import IntHWM as OldIntHWM from etl_entities.source import Column, Table from etl_entities.process import Process hwm = OldIntHWM( process=Process(name="myprocess", task="abc", dag="cde", host="myhost"), source=Table(name="schema.table", instance="postgres://host:5432/db"), column=Column(name="col1"), value=123, )	from etl_entities.hwm import ColumnIntHWM hwm = ColumnIntHWM( name="some_unique_name", description="any value you want", source="schema.table", expression="col1", value=123, )

etl-entities < 2

etl-entities >= 2

from etl_entities.old_hwm import IntHWM as OldIntHWM
from etl_entities.source import Column, Table
from etl_entities.process import Process

hwm = OldIntHWM(
    process=Process(name="myprocess", task="abc", dag="cde", host="myhost"),
    source=Table(name="schema.table", instance="postgres://host:5432/db"),
    column=Column(name="col1"),
    value=123,
)

from etl_entities.hwm import ColumnIntHWM

hwm = ColumnIntHWM(
    name="some_unique_name",
    description="any value you want",
    source="schema.table",
    expression="col1",
    value=123,
)

Breaking change: If you used HWM classes from etl_entities module, you should rewrite your code to make it compatible with new version.

More details

HWM classes used by previous onETL versions were moved from etl_entities to etl_entities.old_hwm submodule. They are here for compatibility reasons, but are planned to be removed in etl-entities v3 release.
New HWM classes have flat structure instead of nested.
New HWM classes have mandatory name attribute (it was known as qualified_name before).
Type aliases used while serializing and deserializing HWM objects to dict representation were changed too: int -> column_int.

To make migration simpler, you can use new method:

old_hwm = OldIntHWM(...)
new_hwm = old_hwm.as_new_hwm()

Which automatically converts all fields from old structure to new one, including qualified_name -> name.

Breaking changes:
- Methods BaseHWMStore.get() and BaseHWMStore.save() were renamed to get_hwm() and set_hwm().
- They now can be used only with new HWM classes from etl_entities.hwm, old HWM classes are not supported.
If you used them in your code, please update it accordingly.

YAMLHWMStore CANNOT read files created by older onETL versions (0.9.x or older).

Upgrade procedure

# pip install onetl==0.9.5

# Get qualified_name for HWM


# Option 1. HWM is built manually
from etl_entities import IntHWM, FileListHWM
from etl_entities.source import Column, Table, RemoteFolder
from etl_entities.process import Process

# for column HWM
old_column_hwm = IntHWM(
    process=Process(name="myprocess", task="abc", dag="cde", host="myhost"),
    source=Table(name="schema.table", instance="postgres://host:5432/db"),
    column=Column(name="col1"),
)
qualified_name = old_column_hwm.qualified_name
# "col1#schema.table@postgres://host:5432/db#cde.abc.myprocess@myhost"

# for file HWM
old_file_hwm = FileListHWM(
    process=Process(name="myprocess", task="abc", dag="cde", host="myhost"),
    source=RemoteFolder(name="/absolute/path", instance="ftp://ftp.server:21"),
)
qualified_name = old_file_hwm.qualified_name
# "file_list#/absolute/path@ftp://ftp.server:21#cde.abc.myprocess@myhost"


# Option 2. HWM is generated automatically (by DBReader/FileDownloader)
# See onETL logs and search for string like qualified_name = '...'

qualified_name = "col1#schema.table@postgres://host:5432/db#cde.abc.myprocess@myhost"


# Get .yml file path by qualified_name

import os
from pathlib import PurePosixPath
from onetl.hwm.store import YAMLHWMStore

# here you should pass the same arguments as used on production, if any
yaml_hwm_store = YAMLHWMStore()
hwm_path = yaml_hwm_store.get_file_path(qualified_name)
print(hwm_path)

# for column HWM
# LocalPosixPath('/home/maxim/.local/share/onETL/yml_hwm_store/col1__schema.table__postgres_host_5432_db__cde.abc.myprocess__myhost.yml')

# for file HWM
# LocalPosixPath('/home/maxim/.local/share/onETL/yml_hwm_store/file_list__absolute_path__ftp_ftp.server_21__cde.abc.myprocess__myhost.yml')


# Read raw .yml file content

from yaml import safe_load, dump

raw_old_hwm_items = safe_load(hwm_path.read_text())
print(raw_old_hwm_items)

# for column HWM
# [
#   {
#     "column": { "name": "col1", "partition": {} },
#     "modified_time": "2023-12-18T10: 39: 47.377378",
#     "process": { "dag": "cde", "host": "myhost", "name": "myprocess", "task": "abc" },
#     "source": { "instance": "postgres: //host:5432/db", "name": "schema.table" },
#     "type": "int",
#     "value": "123",
#   },
# ]

# for file HWM
# [
#   {
#     "modified_time": "2023-12-18T11:15:36.478462",
#     "process": { "dag": "cde", "host": "myhost", "name": "myprocess", "task": "abc" },
#     "source": { "instance": "ftp://ftp.server:21", "name": "/absolute/path" },
#     "type": "file_list",
#     "value": ["file1.txt", "file2.txt"],
#   },
# ]


# Convert file content to new structure, compatible with onETL 0.10.x
raw_new_hwm_items = []
for old_hwm in raw_old_hwm_items:
    new_hwm = {"name": qualified_name, "modified_time": old_hwm["modified_time"]}

    if "column" in old_hwm:
        new_hwm["expression"] = old_hwm["column"]["name"]
    new_hwm["entity"] = old_hwm["source"]["name"]
    old_hwm.pop("process", None)

    if old_hwm["type"] == "int":
        new_hwm["type"] = "column_int"
        new_hwm["value"] = old_hwm["value"]

    elif old_hwm["type"] == "date":
        new_hwm["type"] = "column_date"
        new_hwm["value"] = old_hwm["value"]

    elif old_hwm["type"] == "datetime":
        new_hwm["type"] = "column_datetime"
        new_hwm["value"] = old_hwm["value"]

    elif old_hwm["type"] == "file_list":
        new_hwm["type"] = "file_list"
        new_hwm["value"] = [
            os.fspath(PurePosixPath(old_hwm["source"]["name"]).joinpath(path))
            for path in old_hwm["value"]
        ]

    else:
        raise ValueError("WAT?")

    raw_new_hwm_items.append(new_hwm)


print(raw_new_hwm_items)
# for column HWM
# [
#   {
#     "name": "col1#schema.table@postgres://host:5432/db#cde.abc.myprocess@myhost",
#     "modified_time": "2023-12-18T10:39:47.377378",
#     "expression": "col1",
#     "source": "schema.table",
#     "type": "column_int",
#     "value": 123,
#   },
# ]

# for file HWM
# [
#   {
#     "name": "file_list#/absolute/path@ftp://ftp.server:21#cde.abc.myprocess@myhost",
#     "modified_time": "2023-12-18T11:15:36.478462",
#     "entity": "/absolute/path",
#     "type": "file_list",
#     "value": ["/absolute/path/file1.txt", "/absolute/path/file2.txt"],
#   },
# ]


# Save file with new content
with open(hwm_path, "w") as file:
    dump(raw_new_hwm_items, file)


# Stop Python interpreter and update onETL
# pip install onetl==0.10.0
# Check that new .yml file can be read

from onetl.hwm.store import YAMLHWMStore

qualified_name = ...

# here you should pass the same arguments as used on production, if any
yaml_hwm_store = YAMLHWMStore()
yaml_hwm_store.get_hwm(qualified_name)

# for column HWM
# ColumnIntHWM(
#     name='col1#schema.table@postgres://host:5432/db#cde.abc.myprocess@myhost',
#     description='',
#     entity='schema.table',
#     value=123,
#     expression='col1',
#     modified_time=datetime.datetime(2023, 12, 18, 10, 39, 47, 377378),
# )

# for file HWM
# FileListHWM(
#     name='file_list#/absolute/path@ftp://ftp.server:21#cde.abc.myprocess@myhost',
#     description='',
#     entity=AbsolutePath('/absolute/path'),
#     value=frozenset({AbsolutePath('/absolute/path/file1.txt'), AbsolutePath('/absolute/path/file2.txt')}),
#     expression=None,
#     modified_time=datetime.datetime(2023, 12, 18, 11, 15, 36, 478462)
# )


# That's all!

But most of users use other HWM store implementations which do not have such issues.

Several classes and functions were moved from onetl to etl_entities:

onETL 0.9.x and older onETL 0.10.x and newer

onETL `0.9.x` and older	onETL `0.10.x` and newer
from onetl.hwm.store import ( detect_hwm_store, BaseHWMStore, HWMStoreClassRegistry, register_hwm_store_class, HWMStoreManager, MemoryHWMStore, )	from etl_entities.hwm_store import ( detect_hwm_store, BaseHWMStore, HWMStoreClassRegistry, register_hwm_store_class, HWMStoreManager, MemoryHWMStore, )

from onetl.hwm.store import (
    detect_hwm_store,
    BaseHWMStore,
    HWMStoreClassRegistry,
    register_hwm_store_class,
    HWMStoreManager,
    MemoryHWMStore,
)

from etl_entities.hwm_store import (
    detect_hwm_store,
    BaseHWMStore,
    HWMStoreClassRegistry,
    register_hwm_store_class,
    HWMStoreManager,
    MemoryHWMStore,
)

They s...

Assets 4

10 Oct 11:38

github-actions

0.9.5

17ed2de

0.9.5 (2023-10-10)

Features

Add XML file format support. (#163)
Tested compatibility with Spark 3.5.0. MongoDB and Excel are not supported yet, but other packages do. (#159)

Improvements

Add check to all DB and FileDF connections that Spark session is alive. (#164)

Bug Fixes

Fix Hive.check() behavior when Hive Metastore is not available. (#164)

Assets 4

26 Sep 12:34

github-actions

0.9.4

6944c4f

0.9.4 (2023-09-26)

Features

Add Excel file format support. (#148)
Add Samba file connection. It is now possible to download and upload files to Samba shared folders using FileDownloader/FileUploader. (#150)
Add if_exists="ignore" and error to Hive.WriteOptions (#143)
Add if_exists="ignore" and error to JDBC.WriteOptions (#144)
Add if_exists="ignore" and error to MongoDB.WriteOptions (#145)

Improvements

Add documentation about different ways of passing packages to Spark session. (#151)
Drastically improve Greenplum documentation:
- Added information about network ports, grants, pg_hba.conf and so on.
- Added interaction schemas for reading, writing and executing statements in Greenplum.
- Added recommendations about reading data from views and JOIN results from Greenplum. (#154)
Make .fetch and .execute methods of DB connections thread-safe. Each thread works with its own connection. (#156)
Call .close() on FileConnection then it is removed by garbage collector. (#156)

Bug Fixes

Fix issue while stopping Python interpreter calls JDBCMixin.close() and prints exceptions to log. (#156)

Assets 4

06 Sep 15:03

github-actions

0.9.3

39c497b

0.9.3 (2023-09-06)

Bug Fixes

Fix documentation build

Assets 4

06 Sep 12:57

github-actions

0.9.2

3b17ce5

0.9.2 (2023-09-06)

Features

Add if_exists="ignore" and error to Greenplum.WriteOptions (#142)

Improvements

Improve validation messages while writing dataframe to Kafka. (#131)
Improve documentation:
- Add notes about reading and writing to database connections documentation
- Add notes about executing statements in JDBC and Greenplum connections

Bug Fixes

Fixed validation of headers column is written to Kafka with default Kafka.WriteOptions() - default value was False, but instead of raising an exception, column value was just ignored. (#131)
Fix reading data from Oracle with partitioningMode="range" without explicitly set lowerBound / upperBound. (#133)
Update Kafka documentation with SSLProtocol usage. (#136)
Raise exception if someone tries to read data from Kafka topic which does not exist. (#138)
Allow to pass Kafka topics with name like some.topic.name to DBReader. Same for MongoDB collections. (#139)

Assets 4

17 Aug 18:57

github-actions

0.9.1

5f0f86a

0.9.1 (2023-08-17)

Bug Fixes

Fixed bug then number of threads created by FileDownloader / FileUploader / FileMover was not min(workers, len(files)), but max(workers, len(files)). leading to create too much workers on large files list.

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Bug Fixes

Breaking Changes

Features

Improvements

Bug Fixes

Features

Improvements

Bug Fixes

Features

Breaking Changes

Features

Improvements

Bug Fixes

Features

Improvements

Bug Fixes

Bug Fixes

Features

Improvements

Bug Fixes

Bug Fixes

Releases: MobileTeleSystems/onetl

0.11.1 (2024-05-29)

Features

Bug Fixes

0.11.0 (2024-05-27)

Breaking Changes

Features

Improvements

Bug Fixes

0.10.2 (2024-03-21)

Features

Improvements

Bug Fixes

0.10.1 (2024-02-05)

Features

0.10.0 (2023-12-18)

Breaking Changes

0.9.5 (2023-10-10)

Features

Improvements

Bug Fixes

0.9.4 (2023-09-26)

Features

Improvements

Bug Fixes

0.9.3 (2023-09-06)

Bug Fixes

0.9.2 (2023-09-06)

Features

Improvements

Bug Fixes

0.9.1 (2023-08-17)

Bug Fixes