Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++: Implement models-as-data #15371

Merged
merged 147 commits into from
Apr 16, 2024
Merged

C++: Implement models-as-data #15371

merged 147 commits into from
Apr 16, 2024

Conversation

geoffw0
Copy link
Contributor

@geoffw0 geoffw0 commented Jan 18, 2024

Implement models-as-data for C++. That is, support for CSV formatted flow sources, sinks and summaries that look something like this:

    ";;false;getc;;;ReturnValue;remote",

The implementation is ported from Swift, and uses the shared MAD library to do the heavy lifting. I've created a range of "synthetic" tests (that is, tests that use models defined in the tests), and also created "real" sources for getc and friends (which have corresponding "real" tests).

This is currently a draft PR, and is missing most of the results it should find in tests. Things I need to do:

  • address TODO: fix this, there's no good reason for it..
  • finish implementation of encodeContent.
  • get sinks working
    • and add / convert a query sink to MAD?
  • get summaries working
    • and add / convert a summary model to MAD?
  • get indirection working
    • syntax TBD
  • address any other missing test cases / features?
    • union content --- created follow-up issue
    • post-update nodes? --- created follow-up issue
    • indirect global variables --- created follow-up issue
    • (all other missing test cases have follow-up issues now as well)
  • check performance on DCA
    • there are a couple of “pre-requisites” we expect to need to fix here
  • add change note(s)

@MathiasVP I would appreciate an early review and constructive feedback, when you're available.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Looking good so far! I think starting simply with sources and sinks was a great idea as it avoids you having to deal with a large part of the dataflow library right away!

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jan 23, 2024

Thanks for the early review @MathiasVP . It's going to take me a bit of time to address everything, along with the known issues in this PR.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Jan 24, 2024

I’ve fixed TInterpretNode by removing the TDataFlowCall_ case - which is redundant as the Element case can already represent a call, and that ambiguity was causing stuff to get lost. Fixed up asCall and getCalltarget in that class. We get more results now, and we don’t need the bodge in shared/dataflow any more.

We do, however, introduce a new bodge in TReturnKind, which I believe will is a symptom of not having addressed the ‘pre-requisite’ tasks yet, so I’ll be addressing that next.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Feb 1, 2024

Just added four commits. Broadly:

  • incorporated changes by Mathias which make dataflow ArgumentPosition/ParameterPosition and ReturnKind no longer depend on Node. This was referred to in the PR description above as "pre-requisites", and we expect to need it here soon.
  • fixed the issue with missing TReturnKinds in one of the tests.
  • tried to document a few things better.

@MathiasVP I'd like your opinion on the last commit in particular, which modifies TReturnKind again. I think this is what I want but I'm not entirely confident with the predicates it's built atop, nor whether using Function.getUnspecifiedType() (even via Ssa::Function) is OK in data flow. If it's not, I suppose I could create an SSA wrapper / abstraction for what I need?

Comment on lines 445 to 449
indirectionIndex =
[0 .. max(Ssa::Function f |
|
Ssa::getMaxIndirectionsForType(f.getUnspecifiedType()) - 1 // -1 because a returned value is a prvalue not a glvalue
)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of relying on the presence of a return value instruction we're now relying on the function's return type? I can see how that also works, but I have some concerns:

  • This makes dataflow more dependent on the surface syntax in a way that it wasn't before. I'd prefer to keep dataflow completely unaware of the C/C++ AST (and with a very limited set of cases it is) since it means we can freely change AST classes in the future, and as long as the IR stays correct dataflow will do the right thing.
    An example of where this may bite us: In the IR we synthesize function bodies for some things that are not Functions. In particular, we do so for the initializer of a global variable and for static local variables.

  • Eventually, I'd like to make the construction of indirection nodes smarter to handle things like:

    unsigned long get_data_as_long() {
      char* my_data = source(); // *my_data is tainted
      unsigned long my_data_as_an_unsigned_long = (unsigned long)my_data; // *my_data_as_an_unsigned_long should be tainted, but unsigned longs have no indirection nodes. So flow is lost!
      return my_data_as_an_unsigned_long;
    }
    
    unsigned long data = my_data_as_an_unsigned_long();
    char* my_data2 = (char*)data;
    sink(my_data2);

    The solution to the above will be to be smarter about how many dataflow nodes we allocate (see this internal link for more information on this).

    And since the current logic in the TNormalReturnKind branch only talks about the presence of an indirection operand that logic will continue to work once allocate indirection nodes in a smarter way. In contrast, the change here hardwires that we create return kinds based on the return type of a function.

Instead, I think a better solution would be:

  • Keep the existing logic
  • Replace the indirectionIndex = 0 // TODO: very much a bodge so that it works on the test that has no return statements disjunct with a call into the flow summary library (i.e., the library you're filling in in this PR) that finds the set of indirections necessary for the modelled functions that exists as MaD rows. That is, we should have a predicate in FlowSummaryImpl that reads something like:
/**
 * Gets the maximum number of indirections that can be returned by a the function
 * modelled using the MaD row `package;type;subtypes;name;signature;ext`.
 */
int maxIndirectionForModelledFunction(string package, string type, boolean subtypes, string name, string signature, string ext) {
  exists(interpretElement(package, type, subtypes, name, signature, ext)) and
  result = /* Extract the number of stars in the return type specified by signature */
}

and then we could call this predicate instead of hardcoding the indirectionIndex = 0 case. i.e., we'd do something like:

indirectionIndex = maxIndirectionForModelledFunction(_, _, _, _, _, _)

this would ensure that we have ReturnKinds for all the MaD rows we have, and we keep the nice property that we don't depend on the C++ surface AST for most of dataflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, though notably the original code doesn't even work any more - I'm trying to debug a monotonic recursion error with that, then wire up the new case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's odd. It worked after 2bea0ad, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I've rebased this PR a couple of times so the history is quite confusing.

The issue appears to be that TNormalReturnKind depends on DataFlow::Node in order to find all the return kinds / indirection levels; and DataFlow::Node now includes TFlowSummaryNode, i.e. depends on the flow summary library, and that of course depends on return kinds. Theoretically we can break the loop by having TNormalReturnKind only depend on the non-summary DataFlow::Nodes (and having a separate check of the flow summaries themselves as you suggest) ... but simply adding not return instanceof TFlowSummaryNode in TNormalReturnKind does not actually break the dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I'm coming back to the idea that return kinds should be lower level than data flow nodes, and thus should not depend on them. This way its easy to reason about them and it doesn't produce any dependency cycles. If we do smarter things with indirection nodes in the future, we'll just have to extend the ReturnKind type manually as part of that process.

The above means leaving the code more-or-less how it currently is in this PR. It might be worth adding the maxIndirectionForModelledFunction thing as well. Is there actually any alternative design?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but simply adding not return instanceof TFlowSummaryNode in TNormalReturnKind does not actually break the dependency.

Indeed, that won't break any dependencies.

I'll pull your PR down locally and see if I can see what's wrong. The changes we did very early in this PR should've prevented exactly this non-monotonic recursion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've just gotten a bit lost in the commit history 😂 Doing the changes in 2bea0ad solves the non-monotonic recursion problem. When you said:

Makes sense, though notably the original code doesn't even work any more - I'm trying to debug a monotonic recursion error with that, then wire up the new case.

I guess you mean that the code on main doesn't work any more. That's indeed expected - and that's why we did 2bea0ad. Reintroducing those changes fixed the non-motonic recursion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I'm coming back to the idea that return kinds should be lower level than data flow nodes, and thus should not depend on them. This way its easy to reason about them and it doesn't produce any dependency cycles.

I totally agree. That's what 2bea0ad did. It ensured that ReturnKinds didn't depend on dataflow nodes, but only on SSA.

Copy link
Contributor

@MathiasVP MathiasVP Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed two commits now:

  • 2159256 reintroduces the changes from 2bea0ad (This is now slightly simpler because we don't need the parameterIsRedefined predicate since it was deleted from the code on main 🎉). The diff is slightly longer than intended because I autoformatted the code by mistake as I was writing it 😅. The only change is the one in TReturnKind, I promise!
  • 448a901 adds a predicate to compute the indirections necessary from all the MaD models like I described in C++: Implement models-as-data #15371 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(following 1:1 meeting and a few more commits) I'm not cautiously optimistic that this issue is resolved.

@geoffw0
Copy link
Contributor Author

geoffw0 commented Apr 10, 2024

I've just added the change notes and fixed the other minor issues. This PR is now ready for review. I will shortly begin a second DCA run.

@geoffw0 geoffw0 marked this pull request as ready for review April 10, 2024 14:21
@geoffw0 geoffw0 requested review from a team as code owners April 10, 2024 14:21
Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going through all the changes one last time. There are some small nits here and there, so I may add more comments throughout today

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final comment, but otherwise I think we're good to go!

@geoffw0
Copy link
Contributor Author

geoffw0 commented Apr 11, 2024

The second DCA run:

  • was technically a failure because two projects failed to build; caused by problems outside of this PR.
  • overall analysis time is 2.8% slower, a little more than 0.8% on the first run; probably a mix of wobble and genuinely doing more work.
  • string cache size increase of 37.4% (same as the first run). We agreed on the previous run that it probably comes from the shared library and was not too concerning.
  • 4 new query results (same as the first run, all good results).

So there's probably a small performance loss, not unacceptable I'd say, but if there are any leads for possible improvements I would like to follow them. We may well be better off doing general performance work in future rather than pursuing tiny gains here.


I also ran a Swift DCA experiment, which was uneventful, unsurprisingly as the Swift changes in this PR are very minor (they just mirror CPP changes in a few places, mostly comments).

@geoffw0
Copy link
Contributor Author

geoffw0 commented Apr 12, 2024

I've fixed the CI issues (hopefully) and the merge, though alert provenance is not properly computed (I've created another follow-up issue to cover that).

@geoffw0
Copy link
Contributor Author

geoffw0 commented Apr 15, 2024

Ready for final approval + merge.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's Get This Merged 🎉

@MathiasVP MathiasVP merged commit 2627a3d into github:main Apr 16, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants