Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock issue when generating CC and retrieving cohort characterization analysis list #700

Closed
chrisknoll opened this issue Nov 17, 2018 · 12 comments
Assignees
Projects
Milestone

Comments

@chrisknoll
Copy link
Collaborator

There appears to be some kind of deadlock issue when submitting/processing cohort characterization that is leading to a deadlock when retrieving cohort characterization list. Below is screenshot of the request for the cohort characterization, note the 9.5 minute execution time (which did finally return results:

image

It is unclear that the actual execution of the CC analysis is leading to this behavior, but it has been reproduced at least 2 times in our internal systems.

This is on an internal deployment of v2.6.0 (not from latest master).

@chrisknoll
Copy link
Collaborator Author

This is also happenign when pulling up incidence rate analysis lists:
image

It hung for 4 minutes, but would have been longer if @anthonysena didn't kill some locks on the WebAPI db. @anthonysena, can you provide details on which tables you saw locks on?

@anthonysena
Copy link
Collaborator

Didn't grab the details this time around but will next time this happens.

@anthonysena anthonysena added this to the V2.7.0. "Bug free" milestone Feb 11, 2019
@anthonysena
Copy link
Collaborator

May be related to #797

@pavgra
Copy link
Contributor

pavgra commented Feb 13, 2019

@anthonysena / @chrisknoll , cannot reproduce this on a local instance of the latest WebAPI + MS SQL 2017. Are you able to test the latest code in your env (epi.jnj.com) once more?

@anthonysena anthonysena mentioned this issue Feb 19, 2019
8 tasks
@chrisknoll
Copy link
Collaborator Author

Still seeing this behavior in our environment, here's a capture of various calls that take >4 minutes to complete:
image

I'm very worried about the stability of the codebase for the 2.7 release. I am unable to determine if it is our own internal enviornment that is the issue, or if it has been the commits to master that have made the problem more pronounced, but my completely unscientific perception (since I did not write down performance metrics between merges to master) is that the situation is getting worse.

@chrisknoll
Copy link
Collaborator Author

Bumping this: we're seeing this now in our beta environment after a lot of generation activity was put into the job logs. The way this is impacting a client is: the browser only has a fixed number of active connections (in Chrome I think it is 6). As we have various background activities these HTTP connections start go get consumed, and soon we get to a point where all the active connections are busy, and all the network activity looks like 'pending'. This causes stalls in the app. BTW: This connection limit is across all tabs so having 6 browser sessions open will eat up all your browser connections.

We seem to see this focused around the /notifications endpoint. I'm looking at the service call:

            final List<JobExecution> page = jobExecutionDao.getJobExecutions(start, PAGE_SIZE);

Looking at the object that's returned, it has a lot of useless stuff for the /notifications endpoint. And I wonder if there's a way to just query only the information we need out of the batch tables directly.

@pavgra , @wivern : any thoughts?

@chrisknoll
Copy link
Collaborator Author

chrisknoll commented Mar 28, 2019

Update:

The below query may be misleading, as people report not to use sysprocesses but instead use the other views. I will post information to this thread once we see an active blocked connection.

Additional info:

Looking at the sleeping connections in the database, a few of them actually have open transactions:
image

We have many more connections open to the server (from the DB connection pool) but only these 3 connections show an open transaction, whith the following query:

(@P0 bigint)SELECT JOB_EXECUTION_ID, KEY_NAME, TYPE_CD, STRING_VAL, DATE_VAL, LONG_VAL, DOUBLE_VAL, IDENTIFYING from dbo.BATCH_JOB_EXECUTION_PARAMS where JOB_EXECUTION_ID = @P0        

Note, the dbid = 7 from the image is related to our beta database, 13 is our production. So, there are 2 connections with open transactions on our beta environment with the above query.

@pavgra
Copy link
Contributor

pavgra commented Jun 4, 2019

Received deadlock on Postgres, w/o running CC. The deadlock seems to be related to Spring Batch:
image

@anthonysena anthonysena added this to TO DO in Atlas v2.8 Dec 2, 2019
@olga-ganina olga-ganina assigned olga-ganina and unassigned pavgra Dec 17, 2019
@olga-ganina olga-ganina moved this from TO DO to In Progress in Atlas v2.8 Feb 11, 2020
@YaroslavTir
Copy link

@chrisknoll @anthonysena Does the issue still happen in your environment? and if so, is there any way to reproduce it, or any clue how to reproduce it.

@chrisknoll
Copy link
Collaborator Author

We're now on PostgreSQL as our WebAPI host, so I think we're avoiding the issue.

The issue is localized to MSSQL server and how spring batch uses a sequence table to create batchIDs (but requires a full table scan). It is also hard to reproduce, because apparantly table geometry matters (ie: if there's a clustered index on the table or not). In addition, we'll eventually only be supporting PostgreSQL as the WebAPI database so, we could just leave this as a known issue.

But the real solution to this (if we want to solve this) is to try to implement a custom sequencer for spring batch, described in this PR: #834, but direct link is here: spring-projects/spring-framework#21425

@YaroslavTir
Copy link

@chrisknoll thank you for such a detailed answer. You save me a lot of time. I see that the issue has a long history. But what should we do for now ? just close the story?

@chrisknoll
Copy link
Collaborator Author

I think we can close it here as a known issue, and hope that eventually spring batch will update their support to SqlServer 2012 or 2016. As noted in the spring-framework thread, the SqlServer 2008 is going End Of Life this year, so they could drop support for that platform in a future release.

@YaroslavTir YaroslavTir moved this from In Progress to TESTED in Atlas v2.8 Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Atlas v2.8
  
TESTED
Development

No branches or pull requests

5 participants