Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spline agent affecting databricks driver performance #747

Open
ganeshnikumbh opened this issue Sep 28, 2023 · 7 comments
Open

Spline agent affecting databricks driver performance #747

ganeshnikumbh opened this issue Sep 28, 2023 · 7 comments

Comments

@ganeshnikumbh
Copy link

ganeshnikumbh commented Sep 28, 2023

Hi @wajda, @cerveada

We are using spline agent with databricks and sending lineage by http requests using the httpsdispatcher . We are using Azure function to collect the lineage. What we saw was, during high loads (and so high response times) on the function, if the agent is not able to establish connection to the gateway, it continues to retry every 2 mins. But during this time all operations on the cluster was hanged. I am attaching the logs here for your reference. We had to remove spline installation and restart the cluster to make it normal. Though we are working on improving the Azure function response time by correctly sizing it, but we want to know if we can do anything in the spline setting as well to stop retries if once the gate connection is failed. We plan to install spline on 100 clusters and do not want to lose business team's trust. Please Help!

log4j-2023-09-12-08 (1).log

@cerveada
Copy link
Contributor

I don't know who is doing the retry, but the Agent does not. It just initializes and try to connect to the endpoint and then fails, that's it. Something else is then running the whole job again, I guess?

You can disable the connection check at the http dispatcher initialization, but if the endpoint is not available when the lineage is supposed to be sent, it will still fail then.

What version of Databricks and Spark this runs on?

@wajda
Copy link
Contributor

wajda commented Sep 29, 2023

We plan to install spline on 100 clusters and do not want to lose business team's trust. Please Help!

On production, you definitely want to decouple your main Spark jobs from any secondary dependencies. We recommend to use any resilient messaging system for this purpose. Spline Agent comes with the embedded KafkaDispatcher for example. Alternatively, you can setup a highly available HTTP gateway (maybe Azur function), that would accept connections from Spline and send it to a messaging system to decouple it from a potentially expensive/unstable further processing of the lineage metadata. Such technique would allow your Spark jobs send the lineage info and carry on with its main job, making the whole system more robust.

We had to remove spline installation and restart the cluster to make it norma

To temporarily disable Spline Agent you can simply set the property spline.mode=DISABLED. No need to actually uninstall it.

@ganeshnikumbh
Copy link
Author

ganeshnikumbh commented Oct 5, 2023

I don't know who is doing the retry, but the Agent does not. It just initializes and try to connect to the endpoint and then fails, that's it. Something else is then running the whole job again, I guess?

You can disable the connection check at the http dispatcher initialization, but if the endpoint is not available when the lineage is supposed to be sent, it will still fail then.

What version of Databricks and Spark this runs on?

Hi @cerveada , @wajda we are using different DBR like 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12), 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12) and 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12). And use the spline agent version according to spark version.

We also see that, even in normal scenario (no load on azure function), when the cluster starts, spline initialization happens 2 times. Pls see attached logs at the time of cluster start. You will see the "Spark Lineage tracking is ENABLED" message 2 times, one at "23/10/05 07:48:21" and another at "23/10/05 07:48:37". Any idea why it is trying to enable itself two times.
log4j-clusterStart.txt

@cerveada
Copy link
Contributor

cerveada commented Oct 5, 2023

Could you try to use programmatic initialization instead of codeless?
https://github.com/AbsaOSS/spline-spark-agent#initialization

According to this guide, there were issues with codeless init:
https://github.com/AbsaOSS/spline-getting-started/tree/main/spline-on-databricks

@wajda
Copy link
Contributor

wajda commented Oct 6, 2023

The init type is codeless, it's visible from the logs. Also, from what I can see, there must have been two independent spark sessions or even contexts creating. I don't know why this is happening, but it has nothing to do with Spline. Spline agent is just a Spark listener registered via the Spark public API, that's it. Spline agent listener doesn't contain any shared state, so if for some reason Spark driver decides to create two instances of the same listener there should be no impact (though we didn't test this scenario as normally this doesn't happen and listeners are shared between sessions). In other words, I don't know why agent is double initialised in your setup, but it hardly creates further issues by itself, you should get lineage normally. Try to switch the dispatcher from http to console or logging to remove dependency on your Azur function and see if it makes any difference. If it works and you see lineage JSON in logs then the issue is definitely in your Azure function.

@ganeshnikumbh
Copy link
Author

Sorry to bother with this again, but receiving lineage not an issue even with Azure function and we are receiving lineage fine. Only concern we had was the spline initializing 2 times at the start of cluster and when we had the function response issue, the agent goes in loop to connect even if it failed trying to connect the first time. Appreciate if you can check this when you get some time.

@wajda
Copy link
Contributor

wajda commented Oct 16, 2023

As I tried to explain above, the only reason I see for multiple Spline inits is that there are multiple Spark inits. The Spark session might be repeatedly timing out and something re-runs your Spark job. Otherwise I cannot explain it. Try to enable DEBUG or even TRACE log level and see what's happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

No branches or pull requests

3 participants