Spark 3.4 support #705

wajda · 2023-06-16T17:27:39Z

there seems to be binary incompatible changes in API of Delta and Spark SQL that Spline core compiled against Spark 2.4 version cannot work with. E.g. RDDPlugin

Todo:

Check how many places in code that actually incompatible and if there is a quick simple way to fix it.
Otherwise, we'll probably need to move toward Spark version specific builds (Change the POM and CD configs to make agent-core Spark specific #604)

The text was updated successfully, but these errors were encountered:

cerveada · 2023-08-29T11:24:46Z

Regarding Delta: As I understand it, they dropped support for using Delta without DSV2, which means Spark users will need to migrate to DSV2 Delta, but for us, it's no problem. All DeltaDSV2Spec tests are still passing.

cerveada · 2023-08-30T09:45:03Z

Integration tests not working:

DeltaSpec - Not an issue, just DSV2 delta works now
BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs" in - Seems like some optimization cause the lineage to be directly taking the data from source instead of the intermididate table.
RddSpec - fixed by changing the plugin
KafkaSinkSpec - not fixed yet

Unit tests not working:

some are failing as well, this needs further investigation

Neelotpaul · 2023-10-10T09:15:57Z

Hi @wajda @cerveada any timelines for the agent to be available?

wajda · 2023-10-11T22:21:55Z

Sorry, no :( We are completely buried with another work. Upgrading to Spark 3.4 is not a priority for our organisation at the moment. If somebody from the community is up for implementing it we would be happy to accept pull-requests and release an upgrade.

cerveada · 2023-10-12T06:57:26Z

Some work is already done in the linked draft PR.

Neelotpaul · 2023-10-31T09:47:23Z

Thanks @wajda @cerveada for the update. It would be great if you could notify when the development is completed. I will also check the feasibility of our development team on building on the pull-request.

wajda · 2023-10-31T10:05:12Z

Unfortunately we can't give any ETA on this due to reprioritisation and team capacity. As Adam said some work has started in the draft PR #739. It needs to be tested, it might require adding a separate bundle module for 3.4 and potentially other fixes. If you could help with it that would be amazing. Any questions please ask.

cerveada · 2023-12-15T09:30:26Z

There were some question about what needs to be done to support the new Spark version. So it comes down to two things:

run all tests on the new Spark version, find out where spark changed and fix the agent to accommodate those changes without breaking anything in the older versions.
create pom for the new Spark uber jar (we create one for each Spark version) - the uber jar should include everything agent needs to run while also excluding anything that is already available in the Spark

You can use #459 as inspiration for this.
The current pr #739 is already solving part of the issues.

wajda · 2023-12-22T18:07:39Z

create pom for the new Spark uber jar

just a tip: we use the https://github.com/wajda/jars-pommefizer to generate a pom.xml form a downloaded Apache Spark distribution. Then just make some minor corrections in the POM manually, by comparing it with another similar POM in the project.

rycowhi · 2024-03-07T05:26:50Z

Hey there @wajda / @cerveada - not looking to make any promises here until I understand the full amount of work that might need to be done here after going through the above comments.

If we can get a full build running successfully based off of #739 via the below, what else is there left to do?

mvn clean test -Pspark3.4

Edit: To add some clarity here, do all the tests passing in this profile address @cerveada 's concern?

run all tests on the new Spark version, find out where spark changed and fix the agent to accommodate those changes without breaking anything in the older versions.

I see

Create 3.4 bundle POM per the above
Do we need to change something in the CI as well? I see PRs mention runs on each Spark version as well but it looks like a 3.4 run would be needed too?

In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue.

Thanks in advance!

cerveada · 2024-03-07T09:55:05Z

By all tests, I meant, all unit tests and also all tests in integration-tests maven module. If you run mvn clean test -Pspark3.4 in the root folder of repo, it should run all of them.

We use teamcity for ci, We can modify it ourselfs when this is ready.

cerveada · 2024-03-07T10:08:20Z

In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue.

The test must validate that when you write to a data source and then read from it the URI will be identical. To simulate this I do somethin like this:

df = read(A)
df.createTable(B) // I compare this uri, but now it's A not B

df2 = read(B) // and this uri

I think the issue there is that the Spark will now give you the URI of the original data (A), not the artificial table (B) created from it. So it must be somehow improve or modified to test the same thing as before.

Hope it makes sense, I don't remember the actual issue. But from what I wrote here before I think this is it.

rycowhi · 2024-03-08T22:31:15Z

@cerveada Thanks this is helpful.

WRT the BasicIntegrationTests issue I found something interesting.

I'm still learning my way around the codebase but I gathered I could find differences by running the test in 3.3 then in 3.4, while printing out the logical plan in LineageHarvester with a quick print statement.

Interestingly enough, they look pretty close to the same, with a few new fields added in 3.4

Here comes the fun part - my print in LineageHarvester is done twice on this test in 3.3, but four times in 3.4!

In 3.3 this makes sense - the test runs and creates two lineage events since there were two writes. There are two CreateDataSourceTableAsSelectCommand commands in the output.

In 3.4 it gets weird - the same two events are above, but now two additional logical plans are created as well! Each write has an additional InsertIntoHadoopFsRelationCommand that corresponds to data being inserted. Some of the semantics around this look a little different from the regular op mappings.

The test is then failing because the lineage captor for the second write is actually getting the second event for the first write. If I ignore the second event (by calling another captor) it actually passes! I don't know if this is the right thing to do given that Spline will be firing extra events.

I ran into this same issue while fixing another test - it appears Spark is doing this for both CTAS in regular Spark table and Hive table.

Some great news:

I got everything passing (verified in both spark-3.3 and spark-3.4 profiles)!

Some ok news:

I am a bit worried about the double events we're getting from CTAS but it looks like 3.4 is not affecting other commands since remaining tests are passing.

Going to look into the POM piece now - there is a PR #793 addressing what I've done thus far. Would appreciate a look to see if we are fine with this approach.

wajda added this to the 1.2.0 milestone Jun 16, 2023

wajda added the enhancement label Jun 16, 2023

wajda removed this from the 1.2.0 milestone Jun 16, 2023

wajda added a commit that referenced this issue Jun 19, 2023

issue #705 Spark 3.4 support (WOP)

68f4ab9

wajda mentioned this issue Jul 3, 2023

Scala 2.13 Support #717

Open

wajda added the dependency: Spark 3.4+ label Jul 3, 2023

wajda mentioned this issue Aug 4, 2023

spline agent version required for spark 3.4 #734

Closed

wajda added this to the 2.1.0 milestone Aug 5, 2023

cerveada added a commit that referenced this issue Aug 30, 2023

spark agent #705 Spark 3.4 support

d440eb7

wajda assigned cerveada Aug 30, 2023

cerveada linked a pull request Aug 31, 2023 that will close this issue

spark agent #705 Spark 3.4 support #739

Draft

wajda added the help wanted Extra attention is needed label Oct 11, 2023

wajda mentioned this issue Nov 13, 2023

Spline Agent required for spark 3.4 version and Scala 2.12 #764

Closed

wajda mentioned this issue Nov 29, 2023

Support for spark 3.5 #770

Open

rycowhi mentioned this issue Mar 8, 2024

Provide Spark 3.4 Support for Spline w/ Backwards Compatibility #793

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.4 support #705

Spark 3.4 support #705

wajda commented Jun 16, 2023 •

edited

cerveada commented Aug 29, 2023

cerveada commented Aug 30, 2023 •

edited

Neelotpaul commented Oct 10, 2023 •

edited

wajda commented Oct 11, 2023

cerveada commented Oct 12, 2023

Neelotpaul commented Oct 31, 2023

wajda commented Oct 31, 2023

cerveada commented Dec 15, 2023

wajda commented Dec 22, 2023 •

edited

rycowhi commented Mar 7, 2024 •

edited

cerveada commented Mar 7, 2024

cerveada commented Mar 7, 2024

rycowhi commented Mar 8, 2024

Spark 3.4 support #705

Spark 3.4 support #705

Comments

wajda commented Jun 16, 2023 • edited

cerveada commented Aug 29, 2023

cerveada commented Aug 30, 2023 • edited

Integration tests not working:

Unit tests not working:

Neelotpaul commented Oct 10, 2023 • edited

wajda commented Oct 11, 2023

cerveada commented Oct 12, 2023

Neelotpaul commented Oct 31, 2023

wajda commented Oct 31, 2023

cerveada commented Dec 15, 2023

wajda commented Dec 22, 2023 • edited

rycowhi commented Mar 7, 2024 • edited

cerveada commented Mar 7, 2024

cerveada commented Mar 7, 2024

rycowhi commented Mar 8, 2024

wajda commented Jun 16, 2023 •

edited

cerveada commented Aug 30, 2023 •

edited

Neelotpaul commented Oct 10, 2023 •

edited

wajda commented Dec 22, 2023 •

edited

rycowhi commented Mar 7, 2024 •

edited