Sync To User perf improvements (part 2) #1926

kitallis · 2020-12-29T10:03:32Z

We recently put up some sync performance fixes to ensure that block-level syncs are fast, for a reasonably large load, to allow for a safe incremental release. After pushing these changes to production, we discovered that the improvements actually negatively impacted regular FacilityGroup syncing.

This impact wasn't much, but it was enough that we decided to revert the changes.

what actually causes the slowness

It lies in a couple of distinct places. The first one is the query that fetches the records for "other facilities" (not the current facility) and the second is how many times this query is called during a request-response cycle.

The 1st one is more nuanced than just write a faster query. In this diff from the revert, notice the reverted changes (in red),

This piece of code was one of the bottlenecks,

current_sync_region.syncable_patients.pluck(:id) - current_facility.prioritized_patients.pluck(:id)

This causes a kind of an "inversion" point where up until a certain number of syncable_patients are reached, this is faster for most of syncs. However, for FG, this number is typically very large and causes it to be slower than the original query itself.

After running some benchmarks, we discovered that the inversion point is somewhere around 4-5k syncable_patients.

This stands to reason that if we simply branched the code by counting the syncable_patients, we'd achieve good results for both scenarios – this is mostly true, but there's more...

To understand precisely why the original query is slow, it's worth looking at it's bottlenecks in different scenarios. From further benchmarking, we find that this query,

region_records
  .where.not(registration_facility: current_facility)
  .updated_on_server_since(other_facilities_processed_since, other_facilities_limit)

is actually blazing fast for a large number of region_records (or a large number of syncable patients) but extremely slow when that number is very small. This is quite unintuitive behavior, but looking at the EXPLAIN and the query, it's clear that the primary limiting factor is the LIMIT.

Effectively, the probability of finding the first 1000 BPs in 30k patients is higher than finding the first 1000 BPs in 2k patients. In other words, finding the first 1000 rows matching X becomes slower as the condition X becomes more restrictive.

any

If we look at the slower EXPLAIN, we notice that the most amount of time spent is in the loops. We're looping over each blood pressure and checking if there's a BP for the matching patient. This is very inefficient generally as explained earlier.

Changing this query to an ANY(array()) reaps benefits. EXPLAIN. Notice how the loops have dramatically reduced. What we've done is effectively do an in-built ANY check (on an array) rather than iterating over each BP.

This isn't as fast as the fastest regular query, but it's overall better normalized performance in different cases[1]

However, if we take a step back, we aren't really optimizing for mass resyncing a FacilityGroup, all we really need is:

Normal FG syncs finish in reasonable times
Bulk block re-syncs happen fast

The query itself is unnecessarily complicated, since it tries to make two different where and where.not clauses,

BloodPressure
  .with_discarded
  .where(patient: Patient.syncable_to_region(region))
  .where.not(patient: prioritized_patients)
  .updated_on_server_since(current_facility_processed_since, limit)

can be rewritten as,

BloodPressure
  .with_discarded
  .where(patient: region.syncable_patients))
  .where.not(patient: prioritized_patients)
  .updated_on_server_since(current_facility_processed_since, limit)

can be rewritten as,

BloodPressure
  .with_discarded
  .where(patient: region.syncable_patients.where.not(facility: current_facility))
  .updated_on_server_since(current_facility_processed_since, limit)

which is both, fewer indirections and faster since we avoid an entire separate where.not clause.

memoization

We end up calling records_to_sync / current_facility_records / other_facility_records multiple times for a single request. Rails helps with query caching, but a fair few of these queries have added modifiers like count which don't really hit the Rails query cache per request. Memoizing these queries gives us a ton of gains. It also has a nice side-effect which is that the benchmarking is on an even keel between different queries and all the processing noise is eliminated.

what indexes are we using now?

With the final bunch of queries, we tend to use only:

updated_at index on all the models except Patient
Compound index of id and updated_at on Patient

My recommendation would be to continue to keep the existing index infrastructure and monitoring the index usage periodically and remove the non-essentials.

does this work?

Before and after graphs on perf for both Block and FG,

master code with 5% of users resyncing FacilityGroup

sync-perf-fixes code with 5% of users resyncing Block

master code with DB CPU utilization on 5% of users resyncing FacilityGroup

sync-perf-fixes code with DB CPU utilization on 5% of users resyncing Block

master code with 1 instance running (8 vCPUs)

sync-perf-fixes code with 1 instance running (8 vCPUs)

I'm confident that these changes should work for a 5% rollout for block-sync and existing regular FG syncs as they happen.

__references

[1] The cases we've been testing are: Fetch 1000 BPs and Patients from beginning, Fetch 1000 BPs and Patients from different time segments since 2018 to 2020.

The text was updated successfully, but these errors were encountered:

kitallis mentioned this issue Dec 29, 2020

Sync performance fixes #1921

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync To User perf improvements (part 2) #1926

Sync To User perf improvements (part 2) #1926

kitallis commented Dec 29, 2020 •

edited

Sync To User perf improvements (part 2) #1926

Sync To User perf improvements (part 2) #1926

Comments

kitallis commented Dec 29, 2020 • edited

what actually causes the slowness

any

memoization

what indexes are we using now?

does this work?

__references

kitallis commented Dec 29, 2020 •

edited