Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-559: use LookupAndDelete to read maps #283

Merged
merged 1 commit into from Mar 18, 2024

Conversation

jotak
Copy link
Member

@jotak jotak commented Feb 29, 2024

Description

  • 2-passes LookupAndDelete: first to get ids, then for atomic Lookup+Delete.
  • Keep legacy version for old kernels (<4.20)
  • The terrible performance of the previous implementation seems due to deleting while iterating

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 29, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:c5b2f78

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=c5b2f78 make set-agent-image

@jotak
Copy link
Member Author

jotak commented Feb 29, 2024

my perf numbers look unsanely awesome I need to check with NDH
https://docs.google.com/spreadsheets/d/1qakBaK1dk_rERO30k1cSR4W-Nn0SXW4A3lqQ1sZC4rE/edit#gid=1192055209
image

Copy link

codecov bot commented Feb 29, 2024

Codecov Report

Attention: Patch coverage is 5.26316% with 108 lines in your changes are missing coverage. Please review.

Project coverage is 36.32%. Comparing base (58d01d9) to head (da2f5e5).

Files Patch % Lines
pkg/ebpf/tracer.go 0.00% 69 Missing ⚠️
pkg/ebpf/tracer_legacy.go 0.00% 36 Missing ⚠️
pkg/utils/utils.go 66.66% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #283      +/-   ##
==========================================
- Coverage   36.74%   36.32%   -0.42%     
==========================================
  Files          42       43       +1     
  Lines        3813     3879      +66     
==========================================
+ Hits         1401     1409       +8     
- Misses       2334     2391      +57     
- Partials       78       79       +1     
Flag Coverage Δ
unittests 36.32% <5.26%> (-0.42%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jotak
Copy link
Member Author

jotak commented Feb 29, 2024

@msherif1234 we really really really need to look at this. The perfs are insane, or I should say, our current version is really wrong, much more than I thought, with LookupAndDelete.
This PR must not be merged as is (flows could be missed between the lookup and the delete) but we need to find a solution to get this level of performance without affecting correctness

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 1, 2024
@jotak jotak changed the title Tweak LookupAndDelete - do not merge NETOBSERV-559: use LookupAndDelete to read maps Mar 1, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 1, 2024

@jotak: This pull request references NETOBSERV-559 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

This is for performance testing only. Flows might be dropped due to not locking map between lookup and delete

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Mar 1, 2024

@jotak: This pull request references NETOBSERV-559 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

  • 2-passes LookupAndDelete: first to get ids, then for atomic Lookup+Delete.
  • Keep legacy version for old kernels (<4.20)
  • The terrible performance of the previous implementation seems due to deleting while iterating

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak
Copy link
Member Author

jotak commented Mar 1, 2024

Another set of tests still shows much much improved performances: https://docs.google.com/spreadsheets/d/1qakBaK1dk_rERO30k1cSR4W-Nn0SXW4A3lqQ1sZC4rE/edit#gid=807334756

image

@netobserv netobserv deleted a comment from openshift-ci-robot Mar 1, 2024
@netobserv netobserv deleted a comment from openshift-ci-robot Mar 1, 2024
@netobserv netobserv deleted a comment from openshift-ci-robot Mar 1, 2024
@netobserv netobserv deleted a comment from openshift-ci-robot Mar 1, 2024
@jotak jotak requested a review from msherif1234 March 1, 2024 09:34
Comment on lines 833 to 842
ids = append(ids, id)
}

// Run the atomic Lookup+Delete; if new ids have been inserted in the meantime, they'll be fetched next time
for _, id = range ids {
if err := packetMap.LookupAndDelete(&id, &packet); err != nil {
log.WithError(err).WithField("packetID", id).Warnf("couldn't delete entry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using BatchLookupAndDelete directly here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because of:
cilium/ebpf#1078
cilium/ebpf#1080

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there's a follow-up for the batches: https://issues.redhat.com/browse/NETOBSERV-1550

// Run the atomic Lookup+Delete; if new ids have been inserted in the meantime, they'll be fetched next time
for _, id = range ids {
if err := flowMap.LookupAndDelete(&id, &metrics); err != nil {
log.WithError(err).WithField("flowId", id).Warnf("couldn't delete flow entry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if u do it like this is bad ?

for iterator.Next(&id, &metrics) {
   if err := flowMap.LookupAndDelete(&id, &metrics); err != nil {
      log.WithError(err).WithField("flowId", id).Warnf("couldn't delete flow entry")
    }
    flows[id] = append(flows[id], metrics...)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried that one in particular but I think the same issue will come up as there's still a delete within the iteration. In the previous code, doing the Delete within the iteration resulted in screwed up keys ending up in iterating over the full 100K map entries

@msherif1234
Copy link
Contributor

Thanks @jotak changes looks good to me pls run with large file with sampling of 1 to be certain we don't miss any flows
/lgtm

@openshift-ci openshift-ci bot added lgtm and removed lgtm labels Mar 4, 2024
@msherif1234
Copy link
Contributor

/LGTM

@openshift-ci openshift-ci bot added the lgtm label Mar 4, 2024
@jotak
Copy link
Member Author

jotak commented Mar 4, 2024

@msherif1234 I addressed the feedback:

  • switching to legacy mode on error rather than based on kernel version (tested on a 4.12 cluster where I confirm it switches to legacy)
  • tested large file download: a ~600MB file, correctly reported:

Capture d’écran du 2024-03-04 16-26-19

Also I wanted to make sure that there was no dup id anymore in the map when iterating in LookupAndDelete, so this is confirmed with the new metrics where hasmap-unique equals hashmap-total
Capture d’écran du 2024-03-04 16-27-10

@msherif1234
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 4, 2024
@jotak
Copy link
Member Author

jotak commented Mar 4, 2024

/approve

@openshift-ci openshift-ci bot added the approved label Mar 4, 2024
@jotak jotak removed the approved label Mar 4, 2024
@msherif1234
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 4, 2024
Copy link

github-actions bot commented Mar 4, 2024

New image:
quay.io/netobserv/netobserv-ebpf-agent:94cfb8c

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=94cfb8c make set-agent-image

Keep legacy code for old kernels

Do not base support detection on kernel version

Instead, just try and fallback to legacy when relevant
Copy link

openshift-ci bot commented Mar 15, 2024

New changes are detected. LGTM label has been removed.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 15, 2024
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 15, 2024
@jotak
Copy link
Member Author

jotak commented Mar 15, 2024

(rebased)

Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:a0238fe

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=a0238fe make set-agent-image

@jotak jotak removed the approved label Mar 15, 2024
@jotak
Copy link
Member Author

jotak commented Mar 18, 2024

/approve

Copy link

openshift-ci bot commented Mar 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jotak jotak merged commit 45926b9 into netobserv:main Mar 18, 2024
11 of 12 checks passed
@nathan-weinberg
Copy link

Note this was merged despite regression test failures without QE approval

@jotak jotak deleted the tweak-lad branch March 21, 2024 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference ok-to-test To set manually when a PR is safe to test. Triggers image build on PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants