Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1044042: Memory leak after interrupting a SELECT query mid-way while it's fetching data #1875

Open
the21st opened this issue Feb 6, 2024 · 2 comments

Comments

@the21st
Copy link

the21st commented Feb 6, 2024

Python version

3.10.12

Operating system and processor architecture

Linux-6.1.58+-x86_64-with-glibc2.35

Installed packages

absl-py==1.3.0
agate==1.6.3
alembic==1.11.1
altair==4.2.2
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asn1crypto==1.5.1
asttokens==2.0.8
astunparse==1.6.3
attrs==22.1.0
Automat==20.2.0
Babel==2.10.3
backcall==0.2.0
bcrypt==4.0.1
beautifulsoup4==4.11.1
bleach==5.0.1
blis==0.7.9
boto3==1.24.95
botocore==1.27.95
branca==0.5.0
cachetools==5.2.0
catalogue==2.0.8
certifi==2022.9.24
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
click-plugins==1.1.1
clickhouse-driver==0.2.4
clickhouse-sqlalchemy==0.2.2
cligj==0.7.2
confection==0.0.3
constantly==15.1.0
contourpy==1.0.5
cryptography==36.0.2
cssselect==1.1.0
cycler==0.11.0
cymem==2.0.7
datascience==0.17.5
db-dtypes==1.2.0
debugpy==1.6.3
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.5.1
dnspython==2.2.1
duckdb==0.9.2
duckdb-engine==0.7.0
entrypoints==0.4
executing==1.1.1
fastjsonschema==2.16.2
filelock==3.8.0
Fiona==1.8.22
flatbuffers==22.9.24
folium==0.13.0
fonttools==4.37.4
future==0.18.2
gast==0.4.0
geopandas==0.11.1
google-api-core==2.11.1
google-auth==2.21.0
google-auth-oauthlib==0.4.6
google-cloud-bigquery==3.3.5
google-cloud-bigquery-storage==2.16.2
google-cloud-core==2.3.2
google-cloud-spanner==3.36.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.4.0
googleapis-common-protos==1.56.4
greenlet==1.1.3.post0
grpc-google-iam-v1==0.12.6
grpcio==1.50.0
grpcio-status==1.48.2
h5py==3.7.0
hyperlink==21.0.0
idna==3.4
importlib-metadata==5.0.0
incremental==22.10.0
ipykernel==6.16.1
ipython==8.5.0
ipython-genutils==0.2.0
isodate==0.6.1
itemadapter==0.7.0
itemloaders==1.0.6
jedi==0.17.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
jsonify==0.5
jsonschema==3.2.0
jupyter-server==1.21.0
jupyter_client==7.4.3
jupyter_core==4.11.2
jupyterlab-pygments==0.2.2
keras==2.10.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.4
langcodes==3.3.0
leather==0.3.4
libclang==14.0.6
lxml==4.9.1
Mako==1.2.4
Markdown==3.4.1
MarkupSafe==2.0.0
matplotlib==3.6.0
matplotlib-inline==0.1.6
memory-profiler==0.61.0
mistune==0.8.4
mpmath==1.2.1
munch==2.5.0
murmurhash==1.0.9
nbclassic==1.0.0
nbclient==0.5.13
nbconvert==6.4.5
nbformat==5.7.0
nest-asyncio==1.5.6
nltk==3.7
notebook==6.5.5
notebook_shim==0.2.3
numpy==1.23.4
oauthlib==3.2.2
opt-einsum==3.3.0
oscrypto==1.3.0
packaging==21.3
pandas==1.2.5
pandocfilters==1.5.0
paramiko==3.0.0
parsedatetime==2.4
parsel==1.6.0
parso==0.7.1
pathy==0.6.2
pexpect==4.8.0
pg8000==1.29.6
pickleshare==0.7.5
Pillow==9.2.0
plotly==5.10.0
pluggy==1.0.0
preshed==3.0.8
prometheus-client==0.15.0
prompt-toolkit==3.0.31
Protego==0.2.1
proto-plus==1.22.1
protobuf==3.19.6
psutil==5.9.3
psycopg2-binary==2.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyathena==2.14.0
pycparser==2.21
pycryptodomex==3.15.0
pydantic==1.10.2
PyDispatcher==2.0.6
Pygments==2.13.0
PyJWT==2.6.0
pymongo==4.3.2
pymssql==2.2.5
PyMySQL==1.0.2
PyNaCl==1.5.0
pyOpenSSL==22.0.0
pyparsing==3.0.9
pyproj==3.4.0
pyrsistent==0.18.1
python-dateutil==2.8.2
python-jsonrpc-server==0.4.0
python-language-server @ git+https://github.com/deepnote/python-language-server.git@e6015dbebee8db15510d3c7412dbcfe04c34dd16
python-slugify==6.1.2
pytimeparse==1.1.8
pytz==2022.5
pytz-deprecation-shim==0.1.0.post0
pyzmq==24.0.1
queuelib==1.6.2
regex==2022.9.13
requests==2.28.1
requests-file==1.5.1
requests-oauthlib==1.3.1
rsa==4.9
s3transfer==0.6.0
scikit-learn==1.1.2
scipy==1.9.3
scramp==1.4.4
Scrapy==2.7.0
seaborn==0.12.1
Send2Trash==1.8.0
sentry-sdk==1.24.0
service-identity==21.1.0
Shapely==1.8.5.post1
six==1.16.0
smart-open==5.2.1
sniffio==1.3.0
snowflake-connector-python==2.8.0
snowflake-sqlalchemy==1.4.3
soupsieve==2.3.2.post1
spacy==3.4.2
spacy-legacy==3.0.10
spacy-loggers==1.0.3
sql-metadata==2.6.0
SQLAlchemy==1.4.42
sqlalchemy-bigquery==1.6.1
sqlalchemy-dremio==3.0.3
sqlalchemy-redshift==0.8.11
sqlalchemy-spanner==1.6.2
sqlparse==0.4.3
squarify==0.4.3
srsly==2.4.5
sshtunnel==0.4.0
stack-data==0.5.1
sympy==1.11.1
tabulate==0.9.0
tenacity==8.1.0
tensorboard==2.10.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.10.0
tensorflow-estimator==2.10.0
tensorflow-io-gcs-filesystem==0.27.0
termcolor==2.0.1
terminado==0.16.0
testpath==0.6.0
text-unidecode==1.3
textblob==0.17.1
thinc==8.1.5
threadpoolctl==3.1.0
tldextract==3.4.0
toolz==0.12.0
torch==1.12.1
torchvision==0.13.1
tornado==6.2
tqdm==4.64.1
traitlets==5.5.0
trino==0.322.0
Twisted==22.8.0
typer==0.4.2
typing_extensions==4.4.0
tzdata==2022.5
tzlocal==4.2
ujson==5.5.0
urllib3==1.26.12
w3lib==2.0.1
wasabi==0.10.1
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.4.1
Werkzeug==2.2.2
wrapt==1.14.1
zipp==3.9.0
zope.interface==5.5.0

What did you do?

I put this code into a jupyter code block:


import snowflake.connector
import os

conn = snowflake.connector.connect(
  user=os.environ['SNOWFLAKE_SANDBOX_USERNAME'],
  password=os.environ['SNOWFLAKE_SANDBOX_PASSWORD'],
  account=os.environ['SNOWFLAKE_SANDBOX_ACCOUNTNAME'],
)

try:
    result = conn.cursor().execute("""
        SELECT * FROM SNOWFLAKE_SAMPLE_DATA."TPCDS_SF100TCL".CATALOG_RETURNS LIMIT 10000000
    """).fetchall()
    print(result)
finally:
    conn.close()


And I ran it. Then, midway while it was running, when RAM usage was already rapidly going up, I interrupted the cell.

I then checked both the system monitor for RAM usage. And I verified with `ps -aux` that the python process where I ran this code was the culprit and it was consuming significant memory.

What did you expect to see?

I expected the RAM usage to go back down to the level where it was before running the SQL query.

Can you set logging to DEBUG and collect the logs?

Full logs here: https://gist.github.com/the21st/2ce87fb574ef1f4115a69b204dc3fa65
@github-actions github-actions bot changed the title Memory leak after interrupting a SELECT query mid-way while it's fetching data SNOW-1044042: Memory leak after interrupting a SELECT query mid-way while it's fetching data Feb 6, 2024
@sfc-gh-aling
Copy link
Collaborator

hi @the21st, thanks for reaching out.

I believe the huge memory consumption is because of the program is trying to fetching all the data at one time -- fetchall().

when this method is called, underneath the connector will download all the data first, appending the data to an internal list, and return until all the data has been downloaded, leading to the memory consumption accumulated internally.

to retrieve such huge amount of data, I would recommend you to go fetch chunk by chunk:

        result = conn.cursor().execute("""
            SELECT * FROM SNOWFLAKE_SAMPLE_DATA."TPCDS_SF100TCL".CATALOG_RETURNS LIMIT 10000000
        """)
        chunk_size = 1000
        ret = result.fetchmany(chunk_size)
        while len(ret):
            ret = result.fetchmany(chunk_size)

@the21st
Copy link
Author

the21st commented Feb 13, 2024

@sfc-gh-aling Thanks for the reply!

Unfortunately I cannot change the code to batching because I am using pandas' read_sql_query with the snowflake connector, which is most likely implemented via fetchall().

Is there any way to make sure the internal list you mention is garbage-collectable after an interrupt? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants