Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync To User perf improvements (part 2) #1926

Open
kitallis opened this issue Dec 29, 2020 · 0 comments
Open

Sync To User perf improvements (part 2) #1926

kitallis opened this issue Dec 29, 2020 · 0 comments

Comments

@kitallis
Copy link
Contributor

kitallis commented Dec 29, 2020

We recently put up some sync performance fixes to ensure that block-level syncs are fast, for a reasonably large load, to allow for a safe incremental release. After pushing these changes to production, we discovered that the improvements actually negatively impacted regular FacilityGroup syncing.

This impact wasn't much, but it was enough that we decided to revert the changes.

what actually causes the slowness

It lies in a couple of distinct places. The first one is the query that fetches the records for "other facilities" (not the current facility) and the second is how many times this query is called during a request-response cycle.

The 1st one is more nuanced than just write a faster query. In this diff from the revert, notice the reverted changes (in red),

Screenshot 2020-12-29 at 2 54 23 PM

This piece of code was one of the bottlenecks,

current_sync_region.syncable_patients.pluck(:id) - current_facility.prioritized_patients.pluck(:id)

This causes a kind of an "inversion" point where up until a certain number of syncable_patients are reached, this is faster for most of syncs. However, for FG, this number is typically very large and causes it to be slower than the original query itself.

After running some benchmarks, we discovered that the inversion point is somewhere around 4-5k syncable_patients.

This stands to reason that if we simply branched the code by counting the syncable_patients, we'd achieve good results for both scenarios – this is mostly true, but there's more...

To understand precisely why the original query is slow, it's worth looking at it's bottlenecks in different scenarios. From further benchmarking, we find that this query,

region_records
  .where.not(registration_facility: current_facility)
  .updated_on_server_since(other_facilities_processed_since, other_facilities_limit)

is actually blazing fast for a large number of region_records (or a large number of syncable patients) but extremely slow when that number is very small. This is quite unintuitive behavior, but looking at the EXPLAIN and the query, it's clear that the primary limiting factor is the LIMIT.

Effectively, the probability of finding the first 1000 BPs in 30k patients is higher than finding the first 1000 BPs in 2k patients. In other words, finding the first 1000 rows matching X becomes slower as the condition X becomes more restrictive.

any

If we look at the slower EXPLAIN, we notice that the most amount of time spent is in the loops. We're looping over each blood pressure and checking if there's a BP for the matching patient. This is very inefficient generally as explained earlier.

Changing this query to an ANY(array()) reaps benefits. EXPLAIN. Notice how the loops have dramatically reduced. What we've done is effectively do an in-built ANY check (on an array) rather than iterating over each BP.

This isn't as fast as the fastest regular query, but it's overall better normalized performance in different cases[1]

However, if we take a step back, we aren't really optimizing for mass resyncing a FacilityGroup, all we really need is:

  • Normal FG syncs finish in reasonable times
  • Bulk block re-syncs happen fast

The query itself is unnecessarily complicated, since it tries to make two different where and where.not clauses,

BloodPressure
  .with_discarded
  .where(patient: Patient.syncable_to_region(region))
  .where.not(patient: prioritized_patients)
  .updated_on_server_since(current_facility_processed_since, limit)

can be rewritten as,

BloodPressure
  .with_discarded
  .where(patient: region.syncable_patients))
  .where.not(patient: prioritized_patients)
  .updated_on_server_since(current_facility_processed_since, limit)

can be rewritten as,

BloodPressure
  .with_discarded
  .where(patient: region.syncable_patients.where.not(facility: current_facility))
  .updated_on_server_since(current_facility_processed_since, limit)

which is both, fewer indirections and faster since we avoid an entire separate where.not clause.

memoization

We end up calling records_to_sync / current_facility_records / other_facility_records multiple times for a single request. Rails helps with query caching, but a fair few of these queries have added modifiers like count which don't really hit the Rails query cache per request. Memoizing these queries gives us a ton of gains. It also has a nice side-effect which is that the benchmarking is on an even keel between different queries and all the processing noise is eliminated.

what indexes are we using now?

With the final bunch of queries, we tend to use only:

  • updated_at index on all the models except Patient
  • Compound index of id and updated_at on Patient

My recommendation would be to continue to keep the existing index infrastructure and monitoring the index usage periodically and remove the non-essentials.

does this work?

Before and after graphs on perf for both Block and FG,

master code with 5% of users resyncing FacilityGroup
Screenshot 2020-12-30 at 1 25 19 PM

sync-perf-fixes code with 5% of users resyncing Block

Screenshot 2020-12-30 at 1 25 39 PM

master code with DB CPU utilization on 5% of users resyncing FacilityGroup
Screenshot 2020-12-30 at 1 31 01 PM

sync-perf-fixes code with DB CPU utilization on 5% of users resyncing Block
Screenshot 2020-12-30 at 1 31 10 PM

master code with 1 instance running (8 vCPUs)
Screenshot 2020-12-30 at 1 34 43 PM

sync-perf-fixes code with 1 instance running (8 vCPUs)
Screenshot 2020-12-30 at 1 34 15 PM

I'm confident that these changes should work for a 5% rollout for block-sync and existing regular FG syncs as they happen.

__references

[1] The cases we've been testing are: Fetch 1000 BPs and Patients from beginning, Fetch 1000 BPs and Patients from different time segments since 2018 to 2020.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant