Skip to content
This repository has been archived by the owner on Aug 26, 2022. It is now read-only.

Address MDN performance issue #6606

Closed
atopal opened this issue Feb 27, 2020 · 13 comments
Closed

Address MDN performance issue #6606

atopal opened this issue Feb 27, 2020 · 13 comments
Labels
p2 Medium Priority: One of the next sprints

Comments

@atopal
Copy link
Contributor

atopal commented Feb 27, 2020

Summary
time to interactive examples for 90th percentile has regressed by .3s some time after February 7.

Steps To Reproduce (STR)
[see 90th percentile data](MDN time to interactive examples for 90th percentile has regressed by .3s)

Actual behavior
90th percentile is at 3.7s

Expected behavior
90th percentile should be at 3.4s

Additional context
Is there anything else we should know?

@atopal atopal added the status: needs triage Status: Untriaged label Feb 27, 2020
@callahad
Copy link
Contributor

callahad commented Mar 2, 2020

Here are the deployments that happened around the time that this issue cropped up:

The issue definitely was not happening prior to February 6th, and definitely was happening by February 13th. If the regression was caused by a change to our code, it must have been a change in this range: f48f3fd...7778679

@callahad
Copy link
Contributor

callahad commented Mar 2, 2020

That range covers the following individual commits. I've struck through ones that absolutely could not have contributed to this regression.

Of particular note, we know that 669ec0a (Switch to async GA tag) cannot be at issue because we reverted it and re-deployed last Wednesday, and the performance regression is still present in our data.

@callahad
Copy link
Contributor

callahad commented Mar 2, 2020

That leaves us with the much more manageable list of the following 9 commits that we can't trivially exclude:

@atopal
Copy link
Contributor Author

atopal commented Mar 2, 2020

Confirming that yes, #6601 didn't have a performance impact. Thanks Dan for divining into this and making this issue much more actionable.

What's the suggestion for next steps here? Do we pick the things that that are easily revertible and try them first? We need at least 1 full day of data to verify the impact.

@callahad
Copy link
Contributor

callahad commented Mar 2, 2020

There's a pretty good chance we could exclude half of the commits above, too, but they just require a slightly closer read than that first pass :)

I think the next step is to have two devs hop on Zoom, synchronously read those commits, and come up with a prioritized list of ones we should try reverting.

Unfortunately, this is unlikely to happen this week: Django 2 has been merged and we need to get it deployed ASAP so we're ready for the security patch that's landing Wednesday morning. And then we'll probably need a few days to see if Django 2 itself changes our baseline performance before we can start testing those commits 🙁

@atopal
Copy link
Contributor Author

atopal commented Mar 2, 2020

That schedule works for me. It's important we find the issue and fix it, so we don't end up compounding performance regressions, but it's not time sensitive.

@callahad
Copy link
Contributor

callahad commented Mar 2, 2020

Gregor and I spent half an hour reading each commit line-by-line, and there's nothing that we can conceive of impacting frontend performance in any way.

The Stripe (d477a40) and jQuery UI (ec74256) commits were the only ones to really touch the frontend in any meaningful way, but both are constrained to the Wiki.

At this point, we think the regression is unlikely to have been caused by code changes in Kuma.

@callahad
Copy link
Contributor

callahad commented Mar 2, 2020

I don't know where else to look, but perhaps existing Waffle flags or settings were changed in the Django Admin, maybe there was a new deploy of the Interactive Examples service? Or something else entirely?

@atopal
Copy link
Contributor Author

atopal commented Mar 2, 2020

Hmm, we've had something in the soapbox for the last 2 weeks roughly. Maybe that?

@callahad
Copy link
Contributor

callahad commented Mar 2, 2020

Could be! Is that scheduled to end any time soon?

@tobinmori tobinmori removed the status: needs triage Status: Untriaged label Mar 2, 2020
@callahad
Copy link
Contributor

callahad commented Mar 2, 2020

NOTE: This is measuring Time to Interactive Examples, a custom metric that we track, not the more generic "TTI" metric in, say, Lighthouse.

@tobinmori
Copy link

tobinmori commented Mar 2, 2020

@atopal will check on soapbox and turn it off wed 3/11.

@tobinmori tobinmori added the p2 Medium Priority: One of the next sprints label Mar 2, 2020
@atopal
Copy link
Contributor Author

atopal commented Mar 23, 2020

okay, we have an alternative explanation: Corona

@atopal atopal closed this as completed Mar 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
p2 Medium Priority: One of the next sprints
Projects
None yet
Development

No branches or pull requests

3 participants