Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spline Support of Statment Create table as select #784

Open
zacayd opened this issue Jan 28, 2024 · 11 comments
Open

Spline Support of Statment Create table as select #784

zacayd opened this issue Jan 28, 2024 · 11 comments
Labels
bug Something isn't working dependency: Spark 3.0+

Comments

@zacayd
Copy link

zacayd commented Jan 28, 2024

I have a statment in a Notebook in Databricks that is

create table
  lineage_data.lineagedemo.dinner_1
AS 
SELECT
  recipe_id, concat(app," + ", main," + ",dessert)
AS
  full_menu
FROM
  lineage_data.lineagedemo.menu

the lineage is not shown
Only on insert into - it shows?
Is it by design
I use version
spark_3_3_spline_agent_bundle_2_12_2_1_0_SNAPSHOT.jar

@wajda
Copy link
Contributor

wajda commented Jan 30, 2024

It should be supported.
Is there any error in logs?
Try "USING DELTA" (not sure if it's going to make any difference, but we use it like that).

@zacayd
Copy link
Author

zacayd commented Jan 31, 2024

Hi
i have run this code

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import col

# Assuming SparkSession is already available as 'spark' in Databricks
# spark = SparkSession.builder.appName("example").getOrCreate()

# Define the schema for the table
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Create a DataFrame with the schema
data = [(1, "Alice", 30), (2, "Bob", 35), (3, "Charlie", 40)]
df = spark.createDataFrame(data, schema)

# Define the table name and path (replace with your desired table name and path)
silver_table_name = "silver_table_name1"
path = "/mnt/delta/" + silver_table_name

# Write the DataFrame to the table
df.write.format("delta").mode("overwrite").save(path)

# Register the table in the metastore
spark.sql(f"CREATE TABLE {silver_table_name} USING DELTA LOCATION '{path}'")

# Create a materialized view from the silver layer table
materialized_view_name = "your_materialized_view"
create_mv_sql = f"""
CREATE  VIEW {materialized_view_name}
AS SELECT * FROM {silver_table_name}
"""
spark.sql(create_mv_sql)

# Optionally, verify the contents of the table and materialized view
spark.sql(f"SELECT * FROM {silver_table_name}").show()
spark.sql(f"SELECT * FROM {materialized_view_name}").show()

and according to the execution plan i got in spline
it shows only target /mnt/delta/silver_table_name1
with columns
seems that the create view statemet is not registered in the lineage

@zacayd
Copy link
Author

zacayd commented Jan 31, 2024

It should be supported. Is there any error in logs? Try "USING DELTA" (not sure if it's going to make any difference, but we use it like that).

@wajda
also when i added Using delta is didnt solved it

-- Drop the table 'Accounts1' if it exists
DROP TABLE IF EXISTS lineage_data.lineagedemo.Accounts1;
-- Drop the table 'V_Accounts1' if it exists
DROP TABLE IF EXISTS lineage_data.lineagedemo.V_Accounts1;

-- Creating the table 'Accounts1' using Delta format
CREATE TABLE lineage_data.lineagedemo.Accounts1(
    Id INT,
    Name STRING
) USING DELTA;

-- Creating or replacing the table 'V_Accounts1' based on the 'Accounts1' table, using Delta format
CREATE TABLE lineage_data.lineagedemo.V_Accounts1
USING DELTA AS
SELECT *
FROM lineage_data.lineagedemo.Accounts1;

@zacayd
Copy link
Author

zacayd commented Jan 31, 2024

@wajda if i add Using delta it works also in create as select

-- Drop the table 'Accounts1' if it exists
DROP TABLE IF EXISTS Accounts1;
-- Drop the table 'V_Accounts1' if it exists
DROP TABLE IF EXISTS V_Accounts1;

-- Creating the table 'Accounts1' using Delta format
CREATE TABLE Accounts1(
    Id INT,
    Name STRING
) USING DELTA;

-- Creating or replacing the table 'V_Accounts1' based on the 'Accounts1' table, using Delta format
CREATE TABLE V_Accounts1
USING DELTA AS
SELECT *
FROM Accounts1;

@wajda
Copy link
Contributor

wajda commented Jan 31, 2024

It might be possible that CREATE VIEW is either not supported by Spline, or is not trackable at all in Spark. A more thorough investigation is needed. Try to enable DEBUG logging in Spark and see if there is a message like

XXX was not recognized as a write-command

or Write extraction failed following an object structure dump.

Another small thing to try is to enable non persistent actions capturing like this:

spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true

If you still do no see any mentioning of your "VIEW" anywhere - in logs, errors, or lineage - then it's very likely that Spark simply do not provide the information about that action at the first place. Please try and let us know what you've found.

@zacayd
Copy link
Author

zacayd commented Feb 1, 2024

I will try
spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true
But how can enable debug logging on Databricks?

@zacayd
Copy link
Author

zacayd commented Feb 1, 2024

I have added spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true
to the Advanced Setting
it caused additional of execution plans- but none of them showed the view and the columns
@wajda - Do you know how can i enable DEBUG logging in Databricks?

@zacayd
Copy link
Author

zacayd commented Feb 1, 2024

By the way,Unity Catalog of Databricks support create view as select * from table

@wajda
Copy link
Contributor

wajda commented Feb 1, 2024

Do you know how can i enable DEBUG logging in Databricks?

spark.sparkContext.setLogLevel("DEBUG")

@zacayd
Copy link
Author

zacayd commented Feb 1, 2024

I put on the head of my Databricks notebook

%python
spark.sparkContext.setLogLevel("debug")
%python
sc._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)
%SQL

-- Drop the table 'Accounts1' if it exists
DROP TABLE IF EXISTS Accounts1;
-- Drop the table 'V_Accounts1' if it exists

-- Creating the table 'Accounts1' using Delta format
CREATE TABLE Accounts1(
    Id INT,
    Name STRING
) USING DELTA;


-- Creating or replacing the table 'V_Accounts1' based on the 'Accounts1' table, using Delta format
CREATE OR REPLACE view V_Accounts1
 AS
SELECT *
FROM Accounts1;

But dont see any error on the driver logs

@uday1409
Copy link
Contributor

@wajda I faced the similar issue and was able to see the plan. Although I don't have details right now to see what spline is generating for CTAS, I identified that because of below change in Spark caused this issue. Although this change is applicable starting Spark 3.3, usually databricks ports back changes to older version as well. This was true for ListQuery expression as well I fixed earlier in Spline.

Need to extract the logical plan class correctly for Spline to work

apache/spark@aa1b16b#diff-594addaa7ed3dd43fd0dad5fa81ade5ec2e570adf7c982c0570e3358a08a8e0a

@wajda wajda added bug Something isn't working dependency: Spark 3.0+ labels Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dependency: Spark 3.0+
Projects
Status: New
Development

No branches or pull requests

3 participants