Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1362666: Multiple calls to write_pandas with the same global random seed can result in stage name collisions #1937

Open
scottfleming opened this issue May 3, 2024 · 2 comments
Assignees
Labels
status-information_needed Additional information is required from the reporter status-triage Issue is under initial triage

Comments

@scottfleming
Copy link

scottfleming commented May 3, 2024

Python version

Python 3.11.6 (main, Oct 8 2023, 05:06:43) [GCC 13.2.0]

Operating system and processor architecture

Linux-6.6.16-linuxkit-aarch64-with-glibc2.38

Installed packages

snowflake-connector-python==3.5.0

What did you do?

I call write_pandas in a loop, but there's another function from a different package in that same loop that resets the global seed to a specific state (I recognize this is poor form and am working on a fix for that separately).

The problem is that snowflake-connector-python's _create_temp_stage function calls random_string to generate temporary stage names, over which the user does not have control:

random_string in turn relies on the random module:

random_part = "".join([random.choice(choices) for _ in range(length)])

so if the user sets the seed deterministically outside the scope of calling write_pandas then this will result in stage name collisions throwing an opaque error such as the following:

snowflake.connector.errors.ProgrammingError: 002002 (42710): 01b41633-0305-1f43-0030-3596065ff54f: SQL compilation error:
Object 'USERDB.USERSCHEMA."ziwewnsovq"' already exists.

which triggers on line 81 in _do_create_temp_stage:

create_stage_sql = f"CREATE TEMP STAGE /* Python:snowflake.connector.pandas_tools.write_pandas() */ {stage_location} FILE_FORMAT=(TYPE=PARQUET COMPRESSION={compression}{' BINARY_AS_TEXT=FALSE' if auto_create_table or overwrite else ''})"

What did you expect to see?

What I wish would have happened instead is that there would be one of the following:

  1. An approach for random generation of stage names that does not rely on the global random seed, such as uuid
  2. A more informative error message
  3. Automatic handling of collisions by eg deleting or overwriting an existing stage with the same name, with optional user control over which action to take (could be that the user actually wants to error out in these cases for whatever reason)
  4. A way for the user to directly override the randomly generated stage_name

Can you set logging to DEBUG and collect the logs?

No response

@github-actions github-actions bot changed the title Multiple calls to write_pandas with the same global random seed can result in stage name collisions SNOW-1362666: Multiple calls to write_pandas with the same global random seed can result in stage name collisions May 3, 2024
@sfc-gh-sghosh sfc-gh-sghosh self-assigned this May 6, 2024
@sfc-gh-sghosh sfc-gh-sghosh added status-triage Issue is under initial triage and removed bug needs triage labels May 6, 2024
@sfc-gh-sghosh
Copy link

Hello @scottfleming ,

Thanks for raising the issue, we are looking into it, will update.

Regards,
Sujan

@sfc-gh-sghosh
Copy link

sfc-gh-sghosh commented May 13, 2024

Hello @scottfleming ,

I tried to reproduce the issue with Python connector 3.6.0 and Python 3.11, but it did not throw any collision with stage names.

Could you share the code snippet to reproduce the issue.

`import random
import snowflake.connector
import logging

for logger_name in ['snowflake.connector', 'botocore', 'boto3']:
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
ch = logging.FileHandler('/Users/sghosh/IdeaProjects/Python_prj/tmp/seed.log')
ch.setLevel(logging.DEBUG)
ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
logger.addHandler(ch)

def reset_seed():
random.seed(42)

def another_function():
reset_seed() # Reset seed before calling write_pandas

def simulate_data_processing(connection):
cursor = connection.cursor()
try:
cursor.execute("CREATE or replace table my_table (column1 INT, column2 STRING)")
data = [(1, 'a'), (2, 'b'), (3, 'c')]

    for row in data:
        cursor.execute("INSERT INTO my_table  VALUES (%s, %s)", row)

    connection.commit()  # Commit the transaction

    # Fetch values from the table
    cursor.execute("SELECT * FROM my_table")
    rows = cursor.fetchall()
    print("Values from the table:")
    for row in rows:
        print(row)

finally:
    cursor.close()

Main function

def main():

random.seed(123)
# Connect to Snowflake
conn = snowflake.connector.connect(
    user='xxx',
    password='xxxx',
    account='xxxx',
    warehouse='SUJAN_WH',
    database='SAMPLEDATABASE',
    schema='TEST'
)

for i in range(5):  # Repeat the process multiple times
    another_function()  # Call another function that resets the seed
    simulate_data_processing(conn)  # Simulate data processing

conn.close()  # Close the Snowflake connection

if name == "main":
main()
`

Regards,
Sujan

@sfc-gh-sghosh sfc-gh-sghosh added the status-information_needed Additional information is required from the reporter label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status-information_needed Additional information is required from the reporter status-triage Issue is under initial triage
Projects
None yet
Development

No branches or pull requests

2 participants