SNOW-1362666: Multiple calls to `write_pandas` with the same global random seed can result in stage name collisions #1937

scottfleming · 2024-05-03T19:05:17Z

Python version

Python 3.11.6 (main, Oct 8 2023, 05:06:43) [GCC 13.2.0]

Operating system and processor architecture

Linux-6.6.16-linuxkit-aarch64-with-glibc2.38

Installed packages

snowflake-connector-python==3.5.0

What did you do?

I call write_pandas in a loop, but there's another function from a different package in that same loop that resets the global seed to a specific state (I recognize this is poor form and am working on a fix for that separately).

The problem is that snowflake-connector-python's _create_temp_stage function calls random_string to generate temporary stage names, over which the user does not have control:

snowflake-connector-python/src/snowflake/connector/pandas_tools.py

Line 95 in 8cdca67

stage_name = random_string()

random_string in turn relies on the random module:

snowflake-connector-python/test/randomize.py

Line 30 in 8cdca67

random_part = "".join([random.choice(choices) for _ in range(length)])

so if the user sets the seed deterministically outside the scope of calling write_pandas then this will result in stage name collisions throwing an opaque error such as the following:

snowflake.connector.errors.ProgrammingError: 002002 (42710): 01b41633-0305-1f43-0030-3596065ff54f: SQL compilation error:
Object 'USERDB.USERSCHEMA."ziwewnsovq"' already exists.

which triggers on line 81 in _do_create_temp_stage:

snowflake-connector-python/src/snowflake/connector/pandas_tools.py

Line 81 in 8cdca67

    
           create_stage_sql = f"CREATE TEMP STAGE /* Python:snowflake.connector.pandas_tools.write_pandas() */ {stage_location} FILE_FORMAT=(TYPE=PARQUET COMPRESSION={compression}{' BINARY_AS_TEXT=FALSE' if auto_create_table or overwrite else ''})"

What did you expect to see?

What I wish would have happened instead is that there would be one of the following:

An approach for random generation of stage names that does not rely on the global random seed, such as uuid
A more informative error message
Automatic handling of collisions by eg deleting or overwriting an existing stage with the same name, with optional user control over which action to take (could be that the user actually wants to error out in these cases for whatever reason)
A way for the user to directly override the randomly generated stage_name

Can you set logging to DEBUG and collect the logs?

No response

The text was updated successfully, but these errors were encountered:

sfc-gh-sghosh · 2024-05-06T01:06:56Z

Hello @scottfleming ,

Thanks for raising the issue, we are looking into it, will update.

Regards,
Sujan

sfc-gh-sghosh · 2024-05-13T07:34:42Z

Hello @scottfleming ,

I tried to reproduce the issue with Python connector 3.6.0 and Python 3.11, but it did not throw any collision with stage names.

Could you share the code snippet to reproduce the issue.

`import random
import snowflake.connector
import logging

for logger_name in ['snowflake.connector', 'botocore', 'boto3']:
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
ch = logging.FileHandler('/Users/sghosh/IdeaProjects/Python_prj/tmp/seed.log')
ch.setLevel(logging.DEBUG)
ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
logger.addHandler(ch)

def reset_seed():
random.seed(42)

def another_function():
reset_seed() # Reset seed before calling write_pandas

def simulate_data_processing(connection):
cursor = connection.cursor()
try:
cursor.execute("CREATE or replace table my_table (column1 INT, column2 STRING)")
data = [(1, 'a'), (2, 'b'), (3, 'c')]

    for row in data:
        cursor.execute("INSERT INTO my_table  VALUES (%s, %s)", row)

    connection.commit()  # Commit the transaction

    # Fetch values from the table
    cursor.execute("SELECT * FROM my_table")
    rows = cursor.fetchall()
    print("Values from the table:")
    for row in rows:
        print(row)

finally:
    cursor.close()

Main function

def main():

random.seed(123)
# Connect to Snowflake
conn = snowflake.connector.connect(
    user='xxx',
    password='xxxx',
    account='xxxx',
    warehouse='SUJAN_WH',
    database='SAMPLEDATABASE',
    schema='TEST'
)

for i in range(5):  # Repeat the process multiple times
    another_function()  # Call another function that resets the seed
    simulate_data_processing(conn)  # Simulate data processing

conn.close()  # Close the Snowflake connection

if name == "main":
main()
`

Regards,
Sujan

scottfleming added bug needs triage labels May 3, 2024

github-actions bot changed the title ~~Multiple calls to write_pandas with the same global random seed can result in stage name collisions~~ SNOW-1362666: Multiple calls to write_pandas with the same global random seed can result in stage name collisions May 3, 2024

sfc-gh-sghosh self-assigned this May 6, 2024

sfc-gh-sghosh added status-triage Issue is under initial triage and removed bug needs triage labels May 6, 2024

sfc-gh-sghosh added the status-information_needed Additional information is required from the reporter label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1362666: Multiple calls to `write_pandas` with the same global random seed can result in stage name collisions #1937

SNOW-1362666: Multiple calls to `write_pandas` with the same global random seed can result in stage name collisions #1937

scottfleming commented May 3, 2024 •

edited

sfc-gh-sghosh commented May 6, 2024

sfc-gh-sghosh commented May 13, 2024 •

edited

SNOW-1362666: Multiple calls to write_pandas with the same global random seed can result in stage name collisions #1937

SNOW-1362666: Multiple calls to write_pandas with the same global random seed can result in stage name collisions #1937

Comments

scottfleming commented May 3, 2024 • edited

Python version

Operating system and processor architecture

Installed packages

What did you do?

What did you expect to see?

Can you set logging to DEBUG and collect the logs?

sfc-gh-sghosh commented May 6, 2024

sfc-gh-sghosh commented May 13, 2024 • edited

Main function

SNOW-1362666: Multiple calls to `write_pandas` with the same global random seed can result in stage name collisions #1937

SNOW-1362666: Multiple calls to `write_pandas` with the same global random seed can result in stage name collisions #1937

scottfleming commented May 3, 2024 •

edited

sfc-gh-sghosh commented May 13, 2024 •

edited