Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(robot-server, api, performance-metrics): get performance metrics to work on robot #15008

Closed
wants to merge 23 commits into from

Conversation

DerekMaggio
Copy link
Member

@DerekMaggio DerekMaggio commented Apr 25, 2024

Overview

Will close https://opentrons.atlassian.net/browse/EXEC-413

Various fixes to get performance metrics working on the robot

Test Plan

Test Instructions

  1. Factory reset your robot with only the "runs and protocols" setting. (This will clear any saved protocol analysis results)
  2. Pull this branch
  3. Go to the performance-metrics dir
  4. Set an env variable for your host. Makes this way easier
export host=<your-robot-ip>
  1. Run the follow to enable the performance-metrics feature flag
make add-performance-metrics-ff
  1. Push the new code at your robot
make push-no-restart-ot3 && (cd ../api && make push-no-restart-ot3) && (cd ../robot_server && make push-ot3)
  1. Once the robot finishes restarting, ssh into the robot with
(cd ../api && make term)
  1. Go to the /data directory
cd /data
  1. Verify the performance_metrics_data directory exists - ls -l
  2. Go to the directory
cd performance_metrics_data
  1. Check that the directory is empty
  2. From the app, load a protocol and go to start setup
  3. Wait for the robot to finish analyzing the protocol
  4. In the performance_metrics_data there should be 2 files:
    1. robot_context_data
    2. robot_context_data_headers
  5. Check the content of the files
cat robot_context_data
cat robot_context_data_headers
  1. robot_context_data should contain a single line something like, 2, a big number, another big number
  2. robot_context_data_headers should contain a single line state_id,function_start_time,duration
  3. Cancel the protocol
  4. Create another run for the same protocol and wait for it to finish analyzing
  5. Check the 2 files again. They should NOT have changed (This is because we are loading the cached analysis and not rerunning it)
  6. Cancel the protocol
  7. Create another run with a different protocol and wait for it to finish analyzing
  8. recheck the 2 files. robot_context_data_headers should be the same. robot_context_data should have another line something like, 2, a big number, another big number

TESTS

  • test on the OT-2
  • test on the Flex

Changelog

Fixes

There are a lot of little issues that I found while trying to get this to work

  • api/src/opentrons/util/performance_helpers.py
    • The idea of calling tracker.store() inside of track_analysis for sure will not work because the underlying decorated function has not finished running when this is called. This must be deferred to RobotContextTracker to handle. This is the reasoning for removing the conditional store and instead passing the STORE_EACH var to RobotContextTracker
  • performance-metrics/src/performance_metrics/robot_context_tracker.py
    • There are a couple things here
    • Somehow, I removed asynchronous function support. This was not failing any of the tests because the calls were still being timed, but the coroutine hadn't been awaited. This sucked to find. But anyways, I added support back for async functions
    • Added aforementioned deferred storage
  • I also found that I was wrapping the incorrect call to analysis. This is why robot_server/protocols/protocol_analyzer is now wrapped

Testing Changes

  • test_robot_context_tracker.py
    • Add testing for the store_each functionality
  • test_cli.py
    • Remove the context_tracker.store() case. This is outside the scope of the cli module.
  • test_protocol_analyzer.py
    • Add validation that analysis results are stored after each call to analyze.

Utility Changes

  • Pulled boilerplate makefile stuff from other projects to use for building and pushing packages
  • Added a few utility functions to the makefile to assist with debugging/testing

Review requests

  • Can someone verify this for me on the Flex?
  • Any other testing around making sure performance-metrics doesn't affect other code when it is supposed to be disabled?
  • Would like at least a few reviews, specifically looking for edge cases that this could fail and halt robot execution

Risk assessment

Medium to High: This is the first time performance-metrics is going to actually be able to interact with a robot and its file system.

@DerekMaggio DerekMaggio changed the base branch from edge to EXEC-408-add-storing-analysis-context-on-track April 25, 2024 20:29
@DerekMaggio DerekMaggio marked this pull request as ready for review April 26, 2024 15:43
@DerekMaggio DerekMaggio requested review from a team as code owners April 26, 2024 15:43
@DerekMaggio DerekMaggio requested a review from mjhuff April 26, 2024 15:46
@DerekMaggio DerekMaggio changed the title Getting performance metrics to work on robot feat(robot-server, api, performance-metrics): get performance metrics to work on robot Apr 26, 2024
@y3rsh
Copy link
Collaborator

y3rsh commented Apr 26, 2024

As is high risk, l would like this to not merge to edge until 5/2 post code freeze for 7.3.0.
If there is a counterargument that this will in some way enhance the 7.3.0 release please let me know.

@DerekMaggio
Copy link
Member Author

Nope, will mark do not merge.

@DerekMaggio DerekMaggio marked this pull request as draft April 26, 2024 15:53
@DerekMaggio DerekMaggio self-assigned this Apr 26, 2024
@DerekMaggio DerekMaggio added the DO NOT MERGE Indicates a PR should not be merged, even if there's a shiny green merge button available label Apr 26, 2024
@DerekMaggio DerekMaggio changed the base branch from EXEC-408-add-storing-analysis-context-on-track to edge April 26, 2024 16:01
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see what we're going for here and accept that you want to change it soon, but I really think we should start as we mean to go on: just don't have store_each, and have the analysis tracker explicitly store once in a finally after the analysis ends. store_each would mean a disk hit after every tracked function call and totally kill perf, and I think it's equally clear to just have it at the end of the analyze call.

http://$(host):31950/settings/log_level/local | jq

.PHONY: add-performance-metrics-ff
add-performance-metrics-ff:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, feel free to ignore: I might call this set-performance-metrics-ff or enable-performance-metrics-ff.

performance-metrics/Makefile Show resolved Hide resolved
Comment on lines +36 to +41
# Enable tracking for the RobotContextTracker
# This must come before the import of the analyze CLI
context_tracker = _get_robot_context_tracker()

# Ignore the type error for the next line, as we're setting a private attribute for testing purposes
context_tracker._should_track = True # type: ignore[attr-defined]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so as far as I can tell:

  • Our performance tracking code is global, for the same reason that logging is global: ease of access.
  • Our performance tracking code has import-time logic, for performance reasons. (To avoid having to check if running_on_robot: ... every time it's called.)
  • As we're beginning to see here, both of the above are going to make it very annoying and icky to test that the code under measurement calls the performance tracking code correctly,

To "fix" this, I would support just not having unit tests like this. We just won't check that the code under measurement emits metrics. By analogy, we're also okay with not checking that the code under test emits logs. It's not worthwhile to test because it's noncritical and also not error-prone. And we want it to be cheap and easy to move measurements around as we hone in on good targets.

@y3rsh y3rsh removed the DO NOT MERGE Indicates a PR should not be merged, even if there's a shiny green merge button available label May 31, 2024
@DerekMaggio
Copy link
Member Author

Closing in favor of, #15289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants