Optimize Snowflake load_file using native COPY INTO #544

tatiana · 2022-07-15T13:59:09Z

Fix: #430
closes: #481

Reduced the time to load to Snowflake by 20% for 5GB datasets (from 24.46 min to 5.49 min). Further details in the PR results file.

Co-authored-by: Ankit Chaurasia ankit.chaurasia@astronomer.io

codecov · 2022-07-18T12:11:18Z

Codecov Report

Merging #544 (17c69ac) into main (a0d1912) will increase coverage by 0.08%.
The diff coverage is 96.29%.

❗ Current head 17c69ac differs from pull request most recent head 4737231. Consider uploading reports for the commit 4737231 to get more accurate results

@@            Coverage Diff             @@
##             main     #544      +/-   ##
==========================================
+ Coverage   92.59%   92.68%   +0.08%     
==========================================
  Files          40       40              
  Lines        1594     1613      +19     
  Branches      205      206       +1     
==========================================
+ Hits         1476     1495      +19     
  Misses         93       93              
  Partials       25       25

Impacted Files	Coverage Δ
src/astro/databases/snowflake.py	`95.69% <96.29%> (+0.48%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0d1912...4737231. Read the comment docs.

sunank200

Looks good to me with a minor comment. I think code coverage might need to be improved.

sunank200 · 2022-07-19T06:01:29Z

src/astro/databases/snowflake.py

+        :param source_file: File from which we need to transfer data
+        :param target_table: Table to which the content of the file will be loaded to
+        :param if_exists: Strategy used to load (currently supported: "append" or "replace")
+        :param native_support_kwargs: kwargs to be used by method involved in native support flow


Would be it useful for users to know what are different native_support_kwargs supported? or may be reference to COPY INTO documentation?

@sunank200, please, have a look now!
I had already added the reference to the official docs on the bottom of the docstring, do you think I should add some other? Or place it in another place?

sunank200

The test is calling wrong method i assume. It should have called load_file_to_table_natively(). Thats the reason coverage shows this test not being run.

tests/databases/test_snowflake.py

Fix: #430 Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>

Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>

tatiana · 2022-07-26T01:10:31Z

We added the benchmark results to the PR and fixed the broken tests.

Two checks are failing probably because of pypa/pip#11298 - which affects all branches. I just re-run the tests for a branch which was passing 4h ago, that wasn't changed and the same tests failed.

I've contacted TP to get his advice, and in the meantime, we'll merge this PR.

tatiana · 2022-07-26T09:05:59Z

example_dags/example_amazon_s3_snowflake_transform.py

    """


-@aql.dataframe(identifiers_as_lower=False)
+@aql.dataframe()


We should be able to handle capitalization more consistently across dataframes / loading file after implementing: #564

tatiana · 2022-07-26T09:07:27Z

example_dags/example_amazon_s3_snowflake_transform.py

@@ -67,11 +67,11 @@ def example_amazon_s3_snowflake_transform():
    )

    temp_table_1 = aql.load_file(
-        input_file=File(path=f"{s3_bucket}/ADOPTION_CENTER_1.csv"),
+        input_file=File(path=f"{s3_bucket}/ADOPTION_CENTER_1_unquoted.csv"),


The CSV was very odd beforehand in the format:

"header1","header2" "value1","value2"

The COPY INTO command wasn't acceptable in this format, and we decided that this file could be cleaned beforehand. Users may face similar issues in future.

tatiana · 2022-07-26T09:08:27Z

tests/benchmark/results.md

+| snowflake  | hundred_kb | 9.19s        | 45.4MB       | 2.55s           | 120.0ms           |
+| snowflake  | ten_mb     | 10.9s        | 47.51MB      | 2.58s           | 160.0ms           |
+| snowflake  | one_gb     | 1.07min      | 47.94MB      | 8.7s            | 5.67s             |
+| snowflake  | five_gb    | 5.49min      | 53.69MB      | 18.76s          | 1.6s              |


I'll log a ticket for us to further investigate why we weren't able to run the benchmark with 10 GB for this implementation.

tatiana · 2022-07-26T09:09:24Z

src/astro/databases/snowflake.py

+        # Snowflake doesn't handle well mixed capitalisation of column name chars
+        # we are handling this more gracefully in a separate PR
+        if dataframe is not None:
+            dataframe.columns.str.upper()


We should be able to handle capitalization more consistently across dataframes / loading file after implementing: #564

tatiana force-pushed the issue-492-last branch 2 times, most recently from 507ba2a to 9aacba0 Compare July 18, 2022 10:55

tatiana changed the title ~~WIP: Add Snowflake load_file optimization~~ Optimize Snowflake load_file using native COPY INTO Jul 18, 2022

tatiana marked this pull request as ready for review July 18, 2022 12:10

sunank200 approved these changes Jul 19, 2022

View reviewed changes

sunank200 requested changes Jul 19, 2022

View reviewed changes

tests/databases/test_snowflake.py Show resolved Hide resolved

tatiana force-pushed the issue-492-last branch 2 times, most recently from 0d603ac to d5cd2cd Compare July 19, 2022 13:59

sunank200 approved these changes Jul 19, 2022

View reviewed changes

tatiana force-pushed the issue-492-last branch 4 times, most recently from db927c1 to a794944 Compare July 19, 2022 14:14

kaxil force-pushed the issue-492-last branch from ef945b9 to c5665f4 Compare July 20, 2022 04:54

tatiana force-pushed the issue-492-last branch 2 times, most recently from 3525121 to 2e68f2f Compare July 20, 2022 12:37

sunank200 force-pushed the issue-492-last branch from 95edc5c to 4f01214 Compare July 22, 2022 19:38

kaxil approved these changes Jul 23, 2022

View reviewed changes

utkarsharma2 force-pushed the issue-492-last branch from cc3742e to 4f01214 Compare July 25, 2022 08:28

tatiana and others added 10 commits July 25, 2022 14:20

Add Snowflake load_file optimization

da597a2

Fix: #430 Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>

Fix doc issue

ff28b86

Fix pre-commit

6a6c081

Add test for Snowflake load_file_to_table_natively

2d672ba

Fix benchmark

e523211

Fix native snowflake load_file

4e51fc0

Improve docs for native_support_kwargs

ef407c7

Fix snowflake optimisation retrieval of envvar

f57c7c7

Try to fix error loading adoption dataset

c74ff57

Add Snowflake load_file optimization

ecedb2b

Fix: #430 Co-authored-by: Ankit Chaurasia <ankit.chaurasia@astronomer.io>

tatiana and others added 6 commits July 25, 2022 14:25

Fix doc issue

c16abb0

Fix benchmark

4ff74dc

Fix native snowflake load_file

4a7f159

Fix snowflake optimisation retrieval of envvar

1c804b1

Fix autodetect table check schema before writing pandas df

178a9d3

Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>

Remove README in tests/data

bc58830

tatiana force-pushed the issue-492-last branch from 17c69ac to bc58830 Compare July 25, 2022 13:35

tatiana mentioned this pull request Jul 25, 2022

Support customization of column titles capitalization when using dataframe & load_file #564

Closed

4 tasks

This was referenced Jul 25, 2022

Column capitalization #565

Closed

Add 'columns_names_capitalization' param to control the cols case #567

Merged

Add results of Snowflake optimization

4737231

tatiana merged commit 461e63e into main Jul 26, 2022

tatiana deleted the issue-492-last branch July 26, 2022 01:11

tatiana commented Jul 26, 2022

View reviewed changes

tatiana mentioned this pull request Jul 28, 2022

Fix Snowflake native load for 10GB benchmark dataset #582

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Snowflake load_file using native COPY INTO #544

Optimize Snowflake load_file using native COPY INTO #544

tatiana commented Jul 15, 2022 •

edited by sunank200

codecov bot commented Jul 18, 2022 •

edited

sunank200 left a comment •

edited

sunank200 Jul 19, 2022

tatiana Jul 19, 2022

sunank200 left a comment •

edited

tatiana commented Jul 26, 2022

tatiana Jul 26, 2022 •

edited

tatiana Jul 26, 2022

tatiana Jul 26, 2022

tatiana Jul 26, 2022

Optimize Snowflake load_file using native COPY INTO #544

Optimize Snowflake load_file using native COPY INTO #544

Conversation

tatiana commented Jul 15, 2022 • edited by sunank200

codecov bot commented Jul 18, 2022 • edited

Codecov Report

sunank200 left a comment • edited

Choose a reason for hiding this comment

sunank200 Jul 19, 2022

Choose a reason for hiding this comment

tatiana Jul 19, 2022

Choose a reason for hiding this comment

sunank200 left a comment • edited

Choose a reason for hiding this comment

tatiana commented Jul 26, 2022

tatiana Jul 26, 2022 • edited

Choose a reason for hiding this comment

tatiana Jul 26, 2022

Choose a reason for hiding this comment

tatiana Jul 26, 2022

Choose a reason for hiding this comment

tatiana Jul 26, 2022

Choose a reason for hiding this comment

tatiana commented Jul 15, 2022 •

edited by sunank200

codecov bot commented Jul 18, 2022 •

edited

sunank200 left a comment •

edited

sunank200 left a comment •

edited

tatiana Jul 26, 2022 •

edited