Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.4 support #705

Open
wajda opened this issue Jun 16, 2023 · 13 comments · May be fixed by #739
Open

Spark 3.4 support #705

wajda opened this issue Jun 16, 2023 · 13 comments · May be fixed by #739
Assignees
Milestone

Comments

@wajda
Copy link
Contributor

wajda commented Jun 16, 2023

there seems to be binary incompatible changes in API of Delta and Spark SQL that Spline core compiled against Spark 2.4 version cannot work with. E.g. RDDPlugin

Todo:

  1. Check how many places in code that actually incompatible and if there is a quick simple way to fix it.
  2. Otherwise, we'll probably need to move toward Spark version specific builds (Change the POM and CD configs to make agent-core Spark specific #604)
@wajda wajda added this to the 1.2.0 milestone Jun 16, 2023
@wajda wajda removed this from the 1.2.0 milestone Jun 16, 2023
wajda added a commit that referenced this issue Jun 19, 2023
@wajda wajda added this to the 2.1.0 milestone Aug 5, 2023
@cerveada
Copy link
Contributor

Regarding Delta: As I understand it, they dropped support for using Delta without DSV2, which means Spark users will need to migrate to DSV2 Delta, but for us, it's no problem. All DeltaDSV2Spec tests are still passing.

cerveada added a commit that referenced this issue Aug 30, 2023
@cerveada
Copy link
Contributor

cerveada commented Aug 30, 2023

Integration tests not working:

  • DeltaSpec - Not an issue, just DSV2 delta works now
  • BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs" in - Seems like some optimization cause the lineage to be directly taking the data from source instead of the intermididate table.
  • RddSpec - fixed by changing the plugin
  • KafkaSinkSpec - not fixed yet

Unit tests not working:

  • some are failing as well, this needs further investigation

@cerveada cerveada linked a pull request Aug 31, 2023 that will close this issue
@Neelotpaul
Copy link

Neelotpaul commented Oct 10, 2023

Hi @wajda @cerveada any timelines for the agent to be available?

@wajda
Copy link
Contributor Author

wajda commented Oct 11, 2023

Sorry, no :( We are completely buried with another work. Upgrading to Spark 3.4 is not a priority for our organisation at the moment. If somebody from the community is up for implementing it we would be happy to accept pull-requests and release an upgrade.

@wajda wajda added the help wanted Extra attention is needed label Oct 11, 2023
@cerveada
Copy link
Contributor

Some work is already done in the linked draft PR.

@Neelotpaul
Copy link

Thanks @wajda @cerveada for the update. It would be great if you could notify when the development is completed. I will also check the feasibility of our development team on building on the pull-request.

@wajda
Copy link
Contributor Author

wajda commented Oct 31, 2023

Unfortunately we can't give any ETA on this due to reprioritisation and team capacity. As Adam said some work has started in the draft PR #739. It needs to be tested, it might require adding a separate bundle module for 3.4 and potentially other fixes. If you could help with it that would be amazing. Any questions please ask.

@cerveada
Copy link
Contributor

There were some question about what needs to be done to support the new Spark version. So it comes down to two things:

  • run all tests on the new Spark version, find out where spark changed and fix the agent to accommodate those changes without breaking anything in the older versions.
  • create pom for the new Spark uber jar (we create one for each Spark version) - the uber jar should include everything agent needs to run while also excluding anything that is already available in the Spark

You can use #459 as inspiration for this.
The current pr #739 is already solving part of the issues.

@wajda
Copy link
Contributor Author

wajda commented Dec 22, 2023

create pom for the new Spark uber jar

just a tip: we use the https://github.com/wajda/jars-pommefizer to generate a pom.xml form a downloaded Apache Spark distribution. Then just make some minor corrections in the POM manually, by comparing it with another similar POM in the project.

@rycowhi
Copy link

rycowhi commented Mar 7, 2024

Hey there @wajda / @cerveada - not looking to make any promises here until I understand the full amount of work that might need to be done here after going through the above comments.

If we can get a full build running successfully based off of #739 via the below, what else is there left to do?

mvn clean test -Pspark3.4 

Edit: To add some clarity here, do all the tests passing in this profile address @cerveada 's concern?

run all tests on the new Spark version, find out where spark changed and fix the agent to accommodate those changes without breaking anything in the older versions.

I see

  • Create 3.4 bundle POM per the above
  • Do we need to change something in the CI as well? I see PRs mention runs on each Spark version as well but it looks like a 3.4 run would be needed too?

In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue.

Thanks in advance!

@cerveada
Copy link
Contributor

cerveada commented Mar 7, 2024

By all tests, I meant, all unit tests and also all tests in integration-tests maven module. If you run mvn clean test -Pspark3.4 in the root folder of repo, it should run all of them.

We use teamcity for ci, We can modify it ourselfs when this is ready.

@cerveada
Copy link
Contributor

cerveada commented Mar 7, 2024

In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue.

The test must validate that when you write to a data source and then read from it the URI will be identical. To simulate this I do somethin like this:

df = read(A)
df.createTable(B) // I compare this uri, but now it's A not B

df2 = read(B) // and this uri

I think the issue there is that the Spark will now give you the URI of the original data (A), not the artificial table (B) created from it. So it must be somehow improve or modified to test the same thing as before.

Hope it makes sense, I don't remember the actual issue. But from what I wrote here before I think this is it.

@rycowhi
Copy link

rycowhi commented Mar 8, 2024

@cerveada Thanks this is helpful.

WRT the BasicIntegrationTests issue I found something interesting.

I'm still learning my way around the codebase but I gathered I could find differences by running the test in 3.3 then in 3.4, while printing out the logical plan in LineageHarvester with a quick print statement.

Interestingly enough, they look pretty close to the same, with a few new fields added in 3.4

Here comes the fun part - my print in LineageHarvester is done twice on this test in 3.3, but four times in 3.4!

In 3.3 this makes sense - the test runs and creates two lineage events since there were two writes. There are two CreateDataSourceTableAsSelectCommand commands in the output.

In 3.4 it gets weird - the same two events are above, but now two additional logical plans are created as well! Each write has an additional InsertIntoHadoopFsRelationCommand that corresponds to data being inserted. Some of the semantics around this look a little different from the regular op mappings.

The test is then failing because the lineage captor for the second write is actually getting the second event for the first write. If I ignore the second event (by calling another captor) it actually passes! I don't know if this is the right thing to do given that Spline will be firing extra events.

I ran into this same issue while fixing another test - it appears Spark is doing this for both CTAS in regular Spark table and Hive table.

Some great news:

  • I got everything passing (verified in both spark-3.3 and spark-3.4 profiles)!

Some ok news:

  • I am a bit worried about the double events we're getting from CTAS but it looks like 3.4 is not affecting other commands since remaining tests are passing.

Going to look into the POM piece now - there is a PR #793 addressing what I've done thus far. Would appreciate a look to see if we are fine with this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Icebox
Development

Successfully merging a pull request may close this issue.

4 participants