Profile memory usage from processing HTML contents #2088

ashmaroli · 2020-09-30T07:13:00Z

What problem is this PR intended to solve?

This pull request adds a GitHub Actions Workflow that profiles memory usage from using Nokogiri to parse and serialize HTML content.

The core of the workflow is split into two steps:

Parsing an entire HTML page with Nokogiri::HTML::Document and dumping it back into HTML string.
Parsing just the <body> tag contents with Nokogiri::HTML::DocumentFragment and dumping it back into HTML string.

(The profiling involves parsing and serializing the same input a 1000 times).

The source material for both steps are from a physical HTML file (in turn sourced from the current state of https://nokogiri.org/).
Having a static fixture ensures consistent sample across multiple runs of the workflow.

codeclimate · 2020-09-30T07:16:38Z

Code Climate has analyzed commit 2213ba1 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (80% is the threshold).

This pull request will bring the total coverage in the repository to 94.3% (0.0% change).

View more on Code Climate.

AppVeyorBot · 2020-09-30T07:39:36Z

✅ Build nokogiri 1.0.685 completed (commit 794609a419 by @ashmaroli)

flavorjones · 2020-10-01T17:29:33Z

Please ignore the failed CI status -- that was due to my config mistake related to #2089. This PR was green and should be considered green.

AppVeyorBot · 2020-10-15T14:05:33Z

✅ Build nokogiri 1.0.711 completed (commit cb122b5a75 by @ashmaroli)

ilyazub · 2020-11-27T16:58:13Z

@ashmaroli How to evaluate memory profiling results? I mean, you wouldn't go checking CI logs on every commit and manually compare memory usage between commits.

ashmaroli · 2020-11-27T17:17:11Z

you wouldn't go checking CI logs on every commit and manually compare memory usage between commits.

@ilyazub Honestly, I was planning on doing just that — compare manually between commits of interest — say, the current master stats vs stats for a proposed memory optimization.

Theoretically, if there was a way to get the log for an action that ran on master, then one could just execute a script to extract the necessary numbers from that log, compare it with the numbers on an action run in a pull request of interest, and post it as a comment to the said pull request (much like how AppVeyorBot did on this PR).

flavorjones · 2020-11-29T20:14:02Z

@ashmaroli Thanks for taking the time to put this together, and sorry for not commenting before now.

I think there's value in being able to look at data like this, but (as @ilyazub is implying) I worry that flagging memory leaks requires more rigor that just emitting a log file.

I'd ideally like to embed this approach in a CI test that emits pass/fail status. Can you think of a way to integrate this data better in the testing cycle?

I'm also curious about combining this approach with the long-defunct memory test suite (see #1603) and also incorporate testing with compaction and either valgrind or ASAN. Any thoughts about that?

ilyazub · 2020-11-30T10:15:29Z

For the CI integration

The simplest thing is to check memory usage limits: fail if script allocated more than allowed and pass otherwise.
The more advanced thing may be to store historical memory usage data in a file (eg., JSON or Yaml) and fail the CI job if the script used more than prev_run + N% of memory. To store that file we can use build artifacts in Concourse CI (not sure), separate repository with a single file (deno.land approach), GitHub gist, etc.
Same as the second option, but to use the server to write, read and compare historical data of memory usage (kinda Lighthouse CI Server approach). Seems to be overkill.

Regarding #1603, we can store historical data for each test in test/test_memory_leak.rb (the second approach). The output of each test case could be stored in an array (like for test_leaking_namespace_node_strings). But I don't know what to do with endless loops (eg., test_leak_on_node_replace).

@flavorjones @ashmaroli What do you think?

ashmaroli · 2020-11-30T13:53:03Z

Thanks for the feedback on this idea, @flavorjones and @ilyazub.
So, the first step is to handle passing data across CI builds. I'm not familiar with Concourse CI, but if suggestion#2 from @ilyazub is implementable, that'd be nice to have.

Regarding increasing the rigor of profiling memory, we may need to look into other options.
To my knowledge, memory_profiler only measures Ruby objects. It won't detect allocations from native extensions.
Once we have a satisfactory way to pass data across builds, we could modify the profiling script to parse the artifact data, compare and issue an exit status accordingly.

Profile memory usage from processing HTML contents

2213ba1

ashmaroli force-pushed the memory-profile-workflow branch from 8e510b7 to 2213ba1 Compare October 15, 2020 13:38

flavorjones added meta/discussion topic/memory Segfaults, memory leaks, valgrind testing, etc. labels Nov 29, 2020

Base automatically changed from master to main January 17, 2021 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile memory usage from processing HTML contents #2088

Profile memory usage from processing HTML contents #2088

ashmaroli commented Sep 30, 2020 •

edited

codeclimate bot commented Sep 30, 2020 •

edited

AppVeyorBot commented Sep 30, 2020

flavorjones commented Oct 1, 2020

AppVeyorBot commented Oct 15, 2020

ilyazub commented Nov 27, 2020

ashmaroli commented Nov 27, 2020

flavorjones commented Nov 29, 2020

ilyazub commented Nov 30, 2020

ashmaroli commented Nov 30, 2020

Profile memory usage from processing HTML contents #2088

Are you sure you want to change the base?

Profile memory usage from processing HTML contents #2088

Conversation

ashmaroli commented Sep 30, 2020 • edited

codeclimate bot commented Sep 30, 2020 • edited

AppVeyorBot commented Sep 30, 2020

flavorjones commented Oct 1, 2020

AppVeyorBot commented Oct 15, 2020

ilyazub commented Nov 27, 2020

ashmaroli commented Nov 27, 2020

flavorjones commented Nov 29, 2020

ilyazub commented Nov 30, 2020

ashmaroli commented Nov 30, 2020

ashmaroli commented Sep 30, 2020 •

edited

codeclimate bot commented Sep 30, 2020 •

edited