Deadlock issue when generating CC and retrieving cohort characterization analysis list #700

chrisknoll · 2018-11-17T07:54:17Z

There appears to be some kind of deadlock issue when submitting/processing cohort characterization that is leading to a deadlock when retrieving cohort characterization list. Below is screenshot of the request for the cohort characterization, note the 9.5 minute execution time (which did finally return results:

It is unclear that the actual execution of the CC analysis is leading to this behavior, but it has been reproduced at least 2 times in our internal systems.

This is on an internal deployment of v2.6.0 (not from latest master).

chrisknoll · 2019-01-22T15:36:49Z

This is also happenign when pulling up incidence rate analysis lists:

It hung for 4 minutes, but would have been longer if @anthonysena didn't kill some locks on the WebAPI db. @anthonysena, can you provide details on which tables you saw locks on?

anthonysena · 2019-01-22T15:40:12Z

Didn't grab the details this time around but will next time this happens.

anthonysena · 2019-02-11T15:23:48Z

May be related to #797

pavgra · 2019-02-13T18:27:51Z

@anthonysena / @chrisknoll , cannot reproduce this on a local instance of the latest WebAPI + MS SQL 2017. Are you able to test the latest code in your env (epi.jnj.com) once more?

chrisknoll · 2019-02-27T18:51:20Z

Still seeing this behavior in our environment, here's a capture of various calls that take >4 minutes to complete:

I'm very worried about the stability of the codebase for the 2.7 release. I am unable to determine if it is our own internal enviornment that is the issue, or if it has been the commits to master that have made the problem more pronounced, but my completely unscientific perception (since I did not write down performance metrics between merges to master) is that the situation is getting worse.

chrisknoll · 2019-03-28T22:23:23Z

Bumping this: we're seeing this now in our beta environment after a lot of generation activity was put into the job logs. The way this is impacting a client is: the browser only has a fixed number of active connections (in Chrome I think it is 6). As we have various background activities these HTTP connections start go get consumed, and soon we get to a point where all the active connections are busy, and all the network activity looks like 'pending'. This causes stalls in the app. BTW: This connection limit is across all tabs so having 6 browser sessions open will eat up all your browser connections.

We seem to see this focused around the /notifications endpoint. I'm looking at the service call:

            final List<JobExecution> page = jobExecutionDao.getJobExecutions(start, PAGE_SIZE);

Looking at the object that's returned, it has a lot of useless stuff for the /notifications endpoint. And I wonder if there's a way to just query only the information we need out of the batch tables directly.

@pavgra , @wivern : any thoughts?

chrisknoll · 2019-03-28T23:45:27Z

Update:

The below query may be misleading, as people report not to use sysprocesses but instead use the other views. I will post information to this thread once we see an active blocked connection.

Additional info:

Looking at the sleeping connections in the database, a few of them actually have open transactions:

We have many more connections open to the server (from the DB connection pool) but only these 3 connections show an open transaction, whith the following query:

(@P0 bigint)SELECT JOB_EXECUTION_ID, KEY_NAME, TYPE_CD, STRING_VAL, DATE_VAL, LONG_VAL, DOUBLE_VAL, IDENTIFYING from dbo.BATCH_JOB_EXECUTION_PARAMS where JOB_EXECUTION_ID = @P0

Note, the dbid = 7 from the image is related to our beta database, 13 is our production. So, there are 2 connections with open transactions on our beta environment with the above query.

pavgra · 2019-06-04T16:01:52Z

Received deadlock on Postgres, w/o running CC. The deadlock seems to be related to Spring Batch:

YaroslavTir · 2020-02-11T16:06:15Z

@chrisknoll @anthonysena Does the issue still happen in your environment? and if so, is there any way to reproduce it, or any clue how to reproduce it.

chrisknoll · 2020-02-11T18:18:30Z

We're now on PostgreSQL as our WebAPI host, so I think we're avoiding the issue.

The issue is localized to MSSQL server and how spring batch uses a sequence table to create batchIDs (but requires a full table scan). It is also hard to reproduce, because apparantly table geometry matters (ie: if there's a clustered index on the table or not). In addition, we'll eventually only be supporting PostgreSQL as the WebAPI database so, we could just leave this as a known issue.

But the real solution to this (if we want to solve this) is to try to implement a custom sequencer for spring batch, described in this PR: #834, but direct link is here: spring-projects/spring-framework#21425

YaroslavTir · 2020-02-12T09:47:27Z

@chrisknoll thank you for such a detailed answer. You save me a lot of time. I see that the issue has a long history. But what should we do for now ? just close the story?

chrisknoll · 2020-02-12T14:35:28Z

I think we can close it here as a known issue, and hope that eventually spring batch will update their support to SqlServer 2012 or 2016. As noted in the spring-framework thread, the SqlServer 2008 is going End Of Life this year, so they could drop support for that platform in a future release.

chrisknoll mentioned this issue Nov 20, 2018

Cohort Characterization generation causes other cohort characterizations screens to hang OHDSI/Atlas#1136

Closed

anthonysena added the bug label Nov 21, 2018

anthonysena added this to the V2.7.0. "Bug free" milestone Feb 11, 2019

anthonysena mentioned this issue Feb 19, 2019

Release V2.7.0 #843

Closed

8 tasks

anthonysena modified the milestones: V2.7.0. "Bug free", V2.8.0. "Bug free + optimized performance" Apr 8, 2019

anthonysena modified the milestones: V2.8.0. "Bug free + optimized performance", V2.8.1 May 14, 2019

anthonysena modified the milestones: V2.8.1, V2.8.0. "Bug free + optimized performance" Jun 7, 2019

anthonysena added the system label Jul 2, 2019

pbr6cornell assigned pavgra Jul 14, 2019

anthonysena added this to TO DO in Atlas v2.8 Dec 2, 2019

olga-ganina assigned olga-ganina and unassigned pavgra Dec 17, 2019

olga-ganina assigned YaroslavTir and unassigned olga-ganina Feb 11, 2020

olga-ganina moved this from TO DO to In Progress in Atlas v2.8 Feb 11, 2020

YaroslavTir closed this as completed Feb 13, 2020

YaroslavTir moved this from In Progress to TESTED in Atlas v2.8 Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock issue when generating CC and retrieving cohort characterization analysis list #700

Deadlock issue when generating CC and retrieving cohort characterization analysis list #700

chrisknoll commented Nov 17, 2018

chrisknoll commented Jan 22, 2019

anthonysena commented Jan 22, 2019

anthonysena commented Feb 11, 2019

pavgra commented Feb 13, 2019

chrisknoll commented Feb 27, 2019

chrisknoll commented Mar 28, 2019

chrisknoll commented Mar 28, 2019 •

edited

pavgra commented Jun 4, 2019

YaroslavTir commented Feb 11, 2020

chrisknoll commented Feb 11, 2020

YaroslavTir commented Feb 12, 2020

chrisknoll commented Feb 12, 2020

Deadlock issue when generating CC and retrieving cohort characterization analysis list #700

Deadlock issue when generating CC and retrieving cohort characterization analysis list #700

Comments

chrisknoll commented Nov 17, 2018

chrisknoll commented Jan 22, 2019

anthonysena commented Jan 22, 2019

anthonysena commented Feb 11, 2019

pavgra commented Feb 13, 2019

chrisknoll commented Feb 27, 2019

chrisknoll commented Mar 28, 2019

chrisknoll commented Mar 28, 2019 • edited

pavgra commented Jun 4, 2019

YaroslavTir commented Feb 11, 2020

chrisknoll commented Feb 11, 2020

YaroslavTir commented Feb 12, 2020

chrisknoll commented Feb 12, 2020

chrisknoll commented Mar 28, 2019 •

edited