Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-846438: Getting 0 rows from a TIMESTAMP_TZ column gives timezone-less pandas dtype #1612

Open
mvashishtha opened this issue Jun 22, 2023 · 5 comments
Assignees
Labels
bug status-in_progress Issue is worked on by the driver team triaged

Comments

@mvashishtha
Copy link
Contributor

Python version

Python 3.8.16 (default, Mar 1 2023, 21:19:10) [Clang 14.0.6 ]

Operating system and processor architecture

macOS-10.16-x86_64-i386-64bit

Installed packages

aiobotocore==2.5.0
aiohttp==3.8.4
aioitertools==0.11.0
aiosignal==1.3.1
appnope==0.1.3
asn1crypto==1.5.1
asttokens==2.2.1
async-timeout==4.0.2
attrs==23.1.0
backcall==0.2.0
bleach==6.0.0
boto3==1.26.148
botocore==1.29.153
cachetools==5.3.1
certifi==2023.5.7
cffi==1.15.1
charset-normalizer==2.1.1
cloudpickle==2.0.0
contourpy==1.0.7
cryptography==40.0.2
cycler==0.11.0
db-dtypes==1.1.1
decorator==5.1.1
docutils==0.20.1
duckdb==0.8.0
exceptiongroup==1.1.1
execnet==1.9.0
executing==1.2.0
filelock==3.12.0
fonttools==4.39.4
frozenlist==1.3.3
fsspec==2023.6.0
google-api-core==2.11.0
google-auth==2.19.1
google-cloud-bigquery==3.11.0
google-cloud-core==2.3.2
google-crc32c==1.5.0
google-resumable-media==2.5.0
googleapis-common-protos==1.59.0
grpcio==1.54.2
grpcio-status==1.54.2
idna==3.4
importlib-metadata==6.6.0
importlib-resources==5.12.0
iniconfig==2.0.0
ipdb==0.13.13
ipython==8.12.2
jaraco.classes==3.2.3
jedi==0.18.2
jmespath==1.0.1
joblib==1.2.0
keyring==23.13.1
kiwisolver==1.4.4
markdown-it-py==2.2.0
matplotlib==3.7.1
matplotlib-inline==0.1.6
mdurl==0.1.2
-e git+ssh://git@github.com/mvashishtha/modin.git@eeb410c87aa657d556d3cf2403195bdc76cc7193#egg=modin
more-itertools==9.1.0
multidict==6.0.4
Nuitka==1.6.1
numpy==1.24.3
ordered-set==4.1.0
oscrypto==1.3.0
packaging==23.1
pandas==1.5.3
parso==0.8.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.5.0
pkginfo==1.9.6
pluggy==1.0.0
prompt-toolkit==3.0.38
proto-plus==1.22.2
protobuf==4.23.2
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==10.0.1
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pycryptodomex==3.18.0
Pygments==2.15.1
PyJWT==2.7.0
pyOpenSSL==23.2.0
pyparsing==3.0.9
pytest==7.3.1
pytest-mock==3.10.0
pytest-xdist==3.3.1
python-dateutil==2.8.2
pytz==2023.3
readme-renderer==37.3
requests==2.31.0
requests-toolbelt==1.0.0
rfc3986==2.0.0
rich==13.4.1
rsa==4.9
s3fs==0.4.2
s3transfer==0.6.1
schedule==1.2.0
scikit-learn==1.2.2
scipy==1.10.1
six==1.16.0
snowflake-connector-python==3.0.3
stack-data==0.6.2
threadpoolctl==3.1.0
tomli==2.0.1
traitlets==5.9.0
twine==4.0.2
typing_extensions==4.6.3
urllib3==1.26.16
wcwidth==0.2.6
webencodings==0.5.1
wrapt==1.15.0
yarl==1.9.2
zipp==3.15.0
zstandard==0.21.0

What did you do?

# con is a snowflake.connector.connection.SnowflakeConnection
cursor = con.cursor()
# first select all the rows
cursor.execute("""SELECT '2021-01-01 00:00:00 +0000'::timestamp_tz AS TIME""")
# i get dtypes with timezone: TIME    datetime64[ns, America/Los_Angeles]
print(cursor.fetch_pandas_all().dtypes)
# now filter out the only record.
cursor.execute("""SELECT * FROM (SELECT '2021-01-01 00:00:00 +0000'::timestamp_tz AS TIME) WHERE TIME IS NULL""")
print(cursor.fetch_pandas_all().dtypes)
# Now I lose the timezone. TIME    datetime64[ns]

What did you expect to see?

This experience doesn't match the pandas experience, where filtering out all the records from a column should preserve the column's dtype. I want to get datetime64[ns, America/Los_Angeles] (my session time zone) even when my query returns non records.

Can you set logging to DEBUG and collect the logs?

No response

@github-actions github-actions bot changed the title Getting 0 rows from a TIMESTAMP_TZ column gives timezone-less pandas dtype SNOW-846438: Getting 0 rows from a TIMESTAMP_TZ column gives timezone-less pandas dtype Jun 22, 2023
@sfc-gh-spanaite sfc-gh-spanaite self-assigned this Jun 26, 2023
@sfc-gh-aling
Copy link
Collaborator

thanks for reaching out, I can reproduce the issue.
I think this is a bug in the connector, we will look into why the timezone info got lost when there's no record

@sfc-gh-mkeller
Copy link
Collaborator

I cannot even reproduce the correct behavior...
For the following test:

def test_timestamp(conn_cnx):
    with conn_cnx() as conn, conn.cursor() as cur:
        df = cur.execute(
            "SELECT '2021-01-01 00:00:00'::timestamp_tz AS TIME",
            _statement_params={
                "TIMEZONE": "America/Los_Angeles",
            },
        ).fetch_pandas_all()
        ...

I get df.dtypes[0] = datetime64[ns, UTC]

@sfc-gh-mkeller
Copy link
Collaborator

sfc-gh-mkeller commented Jan 17, 2024

Okay, so the following test seems to work:

def test_timestamp(conn_cnx):
    with conn_cnx(timezone="America/Los_Angeles") as conn, conn.cursor() as cur:
        df = cur.execute(
            "SELECT '2021-01-01 00:00:00'::timestamp_tz AS TIME",
        ).fetch_pandas_all()
        assert str(df.dtypes["TIME"].tz) == "America/Los_Angeles"

@sfc-gh-mkeller
Copy link
Collaborator

This will require a back-end change it seems unfortunately, so it will take us a while to fix.
This the response the client gets:

{
    'parameters': [{...}, ...],
    'rowtype': [{'name': 'TIME', 'database': '', 'schema': '', 'table': 'SYS$_VIEW_15', 'byteLength': None, 'length': None, 'type': 'timestamp_tz', 'scale': 9, 'precision': 0, 'collation': None, 'nullable': False}],
    'rowsetBase64': '',
    'total': 0,
    'returned': 0,
    'queryId': '...',
    'databaseProvider': None,
    'finalDatabaseName': '...',
    'finalSchemaName': '...',
    'finalWarehouseName': '...',
    'finalRoleName': '...',
    'numberOfBinds': 0,
    'arrayBindSupported': False,
    'statementTypeId': 4096,
    ...
}

For anyone trying to repro this, I set a breakpoint at this line. Note that the rowtype does not carry timezone info and and rowsetBase64 is empty (this is the binary Arrow representation of the data, so that doesn't have any timezone info in it either.

@sfc-gh-mkeller
Copy link
Collaborator

That being said I'm not too sure how this is going to work. As far as I understand in pandas the whole column has the same timezone. This doesn't hold with TIMESTAMP_TZ columns in Snowflake, they can all have different timezone and they just all need to have a timezone attached to them.

@sfc-gh-dszmolka sfc-gh-dszmolka added the status-in_progress Issue is worked on by the driver team label Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug status-in_progress Issue is worked on by the driver team triaged
Projects
None yet
Development

No branches or pull requests

5 participants