Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for AWS Glue DynamicFrame #781

Open
catworlddomination opened this issue Jan 18, 2024 · 2 comments
Open

Support for AWS Glue DynamicFrame #781

catworlddomination opened this issue Jan 18, 2024 · 2 comments
Labels

Comments

@catworlddomination
Copy link

catworlddomination commented Jan 18, 2024

When adding the Spline agent bundle to an AWS Glue Python script (Spark 3.3, Python 3), lineage is produced when using the standard patterns like df = spark.read.csv(file_path, header=True, inferSchema=True) and df.write... as expected.

However, AWS Glue does have a concept of Dynamic Frames usage of which which looks something like

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Amazon S3
f0 = glueContext.create_dynamic_frame.from_options(
    format_options={
        "quoteChar": -1,
        "withHeader": True,
        "separator": ",",
        "optimizePerformance": False,
    },
    connection_type="s3",
    format="csv",
    connection_options={
	## Need to replace with S3 path with file name
        "paths": ["s3://...."],
        "recurse": True,
    },
    transformation_ctx="f0",
)

# Script generated for node Amazon S3
f1 = glueContext.write_dynamic_frame.from_options(
    frame=f0,
    connection_type="s3",
    format="csv",
    connection_options={
	## Need to replace with S3 path
        "path": "s3://....",
        "partitionKeys": [],
    },
    transformation_ctx="f1",
)

job.commit()

Can Spline support this dynamic frame pattern in AWS Glue? I used the spark-3.3-spline-agent-bundle_2.12-2.0.0.jar bundle - Spline agent initialized successfully, but could not produce lineage.

@wajda
Copy link
Contributor

wajda commented Jan 20, 2024

I didn't try it specifically, but from the AWS doc on the DynamicFrame there is a chance that it would not work. The crucial thing for Spline agent is the existence of the internal Spark write event that the agent can intercept and grab the execution plan from it. That only exists in the Spark SQL API, meaning the DataFrame. For instance RDD lineage isn't supported because of that very reason - Spark doesn't provide any usable (for lineage purposes) logical plan on RDD operations. I don't know how exactly the DynamicFrame is implemented (it's closed source), so it's unclear if DynamicFrame operations eventually translate to DataFrame ones or not. If they don't, Spline don't have ability to track them.

@wajda wajda added the feature label Jan 20, 2024
@wajda wajda changed the title Can Spline support lineage for AWS glue Spark dynamic frames? Support for AWS Glue DynamicFrame Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New
Development

No branches or pull requests

2 participants