Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile memory usage from processing HTML contents #2088

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ashmaroli
Copy link
Contributor

@ashmaroli ashmaroli commented Sep 30, 2020

What problem is this PR intended to solve?

This pull request adds a GitHub Actions Workflow that profiles memory usage from using Nokogiri to parse and serialize HTML content.

The core of the workflow is split into two steps:

  • Parsing an entire HTML page with Nokogiri::HTML::Document and dumping it back into HTML string.
  • Parsing just the <body> tag contents with Nokogiri::HTML::DocumentFragment and dumping it back into HTML string.

(The profiling involves parsing and serializing the same input a 1000 times).

The source material for both steps are from a physical HTML file (in turn sourced from the current state of https://nokogiri.org/).
Having a static fixture ensures consistent sample across multiple runs of the workflow.

@codeclimate
Copy link

codeclimate bot commented Sep 30, 2020

Code Climate has analyzed commit 2213ba1 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (80% is the threshold).

This pull request will bring the total coverage in the repository to 94.3% (0.0% change).

View more on Code Climate.

@AppVeyorBot
Copy link

@flavorjones
Copy link
Member

Please ignore the failed CI status -- that was due to my config mistake related to #2089. This PR was green and should be considered green.

@AppVeyorBot
Copy link

@ilyazub
Copy link
Contributor

ilyazub commented Nov 27, 2020

@ashmaroli How to evaluate memory profiling results? I mean, you wouldn't go checking CI logs on every commit and manually compare memory usage between commits.

@ashmaroli
Copy link
Contributor Author

you wouldn't go checking CI logs on every commit and manually compare memory usage between commits.

@ilyazub Honestly, I was planning on doing just that — compare manually between commits of interest — say, the current master stats vs stats for a proposed memory optimization.

Theoretically, if there was a way to get the log for an action that ran on master, then one could just execute a script to extract the necessary numbers from that log, compare it with the numbers on an action run in a pull request of interest, and post it as a comment to the said pull request (much like how AppVeyorBot did on this PR).

@flavorjones
Copy link
Member

@ashmaroli Thanks for taking the time to put this together, and sorry for not commenting before now.

I think there's value in being able to look at data like this, but (as @ilyazub is implying) I worry that flagging memory leaks requires more rigor that just emitting a log file.

I'd ideally like to embed this approach in a CI test that emits pass/fail status. Can you think of a way to integrate this data better in the testing cycle?

I'm also curious about combining this approach with the long-defunct memory test suite (see #1603) and also incorporate testing with compaction and either valgrind or ASAN. Any thoughts about that?

@flavorjones flavorjones added meta/discussion topic/memory Segfaults, memory leaks, valgrind testing, etc. labels Nov 29, 2020
@ilyazub
Copy link
Contributor

ilyazub commented Nov 30, 2020

For the CI integration

  1. The simplest thing is to check memory usage limits: fail if script allocated more than allowed and pass otherwise.
  2. The more advanced thing may be to store historical memory usage data in a file (eg., JSON or Yaml) and fail the CI job if the script used more than prev_run + N% of memory. To store that file we can use build artifacts in Concourse CI (not sure), separate repository with a single file (deno.land approach), GitHub gist, etc.
  3. Same as the second option, but to use the server to write, read and compare historical data of memory usage (kinda Lighthouse CI Server approach). Seems to be overkill.

Regarding #1603, we can store historical data for each test in test/test_memory_leak.rb (the second approach). The output of each test case could be stored in an array (like for test_leaking_namespace_node_strings). But I don't know what to do with endless loops (eg., test_leak_on_node_replace).

@flavorjones @ashmaroli What do you think?

@ashmaroli
Copy link
Contributor Author

Thanks for the feedback on this idea, @flavorjones and @ilyazub.
So, the first step is to handle passing data across CI builds. I'm not familiar with Concourse CI, but if suggestion#2 from @ilyazub is implementable, that'd be nice to have.

Regarding increasing the rigor of profiling memory, we may need to look into other options.
To my knowledge, memory_profiler only measures Ruby objects. It won't detect allocations from native extensions.
Once we have a satisfactory way to pass data across builds, we could modify the profiling script to parse the artifact data, compare and issue an exit status accordingly.

Base automatically changed from master to main January 17, 2021 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta/discussion topic/memory Segfaults, memory leaks, valgrind testing, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants