Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customizing string representation of data flow nodes in SARIF or CSV results for Taint Tracking #16143

Open
saikatG opened this issue Apr 7, 2024 · 10 comments
Labels
question Further information is requested

Comments

@saikatG
Copy link

saikatG commented Apr 7, 2024

Hi,

Currently, when running a cwe query such as TaintedPath (cwe 22) on a java project, I retrieve the CodeFlow for each result in the SARIF files as shown below. Is there an easy way to customize the string representation of nodes in the output, such as, in the message["text"] part for the node? For instance, for a method call I would like the string to have the format "package:class:methodname" instead of "methodname(...)" -- which is the default. Would I need to override the data flow node for this?

Thanks in advance!

...
 "codeFlows" : [ {
        "threadFlows" : [ {
          "locations" : [ {
            "location" : {
              "physicalLocation" : {
                "artifactLocation" : {
                  "uri" : "one-java-agent/src/main/java/com/alibaba/oneagent/AgentImpl.java",
                  "uriBaseId" : "%SRCROOT%",
                  "index" : 0
                },
                "region" : {
                  "startLine" : 81,
                  "startColumn" : 35,
                  "endColumn" : 69
                }
              },
              "message" : {
                "text" : "getFile(...) : String"
              }
            }
          },
....
@saikatG saikatG added the question Further information is requested label Apr 7, 2024
@mbg
Copy link
Member

mbg commented Apr 8, 2024

Hi @saikatG 👋

I am not sure there is a super easy way of doing that. Overriding the toString predicate may be possible here to make this work. However, since you have the location information and, presumably, also have the source code, you could retrieve the information you are after directly from the source code with the help of a Java parser. This is probably the easiest.

@saikatG
Copy link
Author

saikatG commented Apr 9, 2024

Thanks @mbg. Overriding the toString of DataFlow Node works well!

@saikatG saikatG closed this as completed Apr 9, 2024
@saikatG
Copy link
Author

saikatG commented Apr 15, 2024

Hi CodeQL devs, re-opening this issue because I ran into a problem.

I tried something like this:

bindingset[n]
string getFullName(DataFlow::Node n){
   
  result = (n.asExpr().(Call)).getCallee().getDeclaringType().getQualifiedName() + "." + (n.asExpr().(Call)).getCallee().getName()
  or 
  result = (n.asExpr().(Argument)).getCall().getCallee().getDeclaringType().getAnAncestor().getQualifiedName() + "." +  (n.asExpr().(Argument)).getCall().getCallee().getName() + "::arg::" + (n.asExpr().(Argument)).getPosition()
  or
  result = (n.(DataFlow::ParameterNode).asParameter().getCallable().getDeclaringType().getQualifiedName() + "." + (n.(DataFlow::ParameterNode).asParameter().getCallable().getDeclaringType().getName()) + "::par::" + (n.(DataFlow::ParameterNode).asParameter().getPosition())) 
  or
  result = n.toString()
}
 class MyNode extends DataFlow::Node {
   override string toString() {
     result = getFullName(this)
   }
 }

It works well, but its turning out to be too expensive to compute in some cases. Any suggestions on how to optimize this?

@saikatG saikatG reopened this Apr 15, 2024
@mbg
Copy link
Member

mbg commented Apr 16, 2024

its turning out to be too expensive to compute in some cases

Could you provide more information? How does this manifest for you (long analysis time / out of memory errors)? How big is the Java project you are running this against?

@saikatG
Copy link
Author

saikatG commented Apr 16, 2024

So codeql just takes a very long time to generate the csv/sarif files after the "Interpreting results" stage. I waited for 10-15 mins and then killed it. Ofc, it finishes fast w/o the string conversion code above. The project has 97k lines of code. I am running the XSS query, but that shouldn't be an important factor i think?

Running queries.
Compiling query plan for .../XSS.ql.
[1/1 comp 33.6s] Compiled .../XSS.ql.
Starting evaluation of .../XSS.ql.
[1/1 eval 10s] Evaluation done; writing results to .../XSS.bqrs.
Shutting down query evaluator.
Interpreting results.

@mbg
Copy link
Member

mbg commented Apr 16, 2024

I assume you're running the CodeQL CLI directly on the command-line. What is the command you used and which version of CodeQL are you using?

I tried running the XSS query with your additions on the database for the repository you linked to and this worked fine for me using the VSCode extension (although I note that there were no results).

@saikatG
Copy link
Author

saikatG commented Apr 16, 2024

Sorry about the missing details.

Yes, I am using the codeql cli 2.15.3. I use database analyze --rerun [db_dir] --format=sarif-latest --output=results.sarif [query_path]

I am using a custom version of the XSS query that uses my own list of 100 sources and 38 sinks. I think this as well may be causing the difference in execution. So, I have a MySources.qll and MySinks.qll file that looks like this:

predicate isMySourceMethod(Method m)
{
(
    m.getName() = "get" and
    m.getDeclaringType().getSourceDeclaration().hasQualifiedName("java.util", "Map<String,String>")
)
or
(
    m.getName() = "put" and
    m.getDeclaringType().getSourceDeclaration().hasQualifiedName("java.util", "Map<String,String>")
)
or 
...

and I have a MyXSSQuery.qll file that uses it

module MyXssConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) { 
    isMySourceMethod(source.asExpr().(MethodCall).getMethod()) 
  }
...

(Same for sinks)

I can't share the full files right now. But I hope this gives you an idea. Do you think this may be causing the slowdown? Interestingly, this still works ok when I don't use "toString". Note that I put the "toString" code in this MyXSSQuery.qll file and not the XSS.ql file.

@mbg
Copy link
Member

mbg commented Apr 16, 2024

I am using the codeql cli 2.15.3.

This is a fairly old version. It may be worth upgrading to the latest version to see if that resolves the issue for you. (We release new versions every two weeks.)

I am using a custom version of the XSS query that uses my own list of 100 sources and 38 sinks. I think this as well may be causing the difference in execution.

This is very much possible of course. You could try to run the XSS query without your custom sources and sinks, but still with the overriden toString predicate, and see if that works OK. If that does work, I'd introduce a source/sink that you know will lead to a result being found and try it with just those.

Let me know what the outcomes of those experiments are and we can go from there.

@saikatG
Copy link
Author

saikatG commented Apr 16, 2024

Thanks a lot @mbg! I will try it out.

While we are on this, another question for you: Is there an easy way to specify summaries with a custom tag? For instance, i see yml files in ext folder that contain specs for many libraries, but they only have "taint" or "value" as tags. Is it possible to use my own custom tag and only consider them as summaries during execution? I see that it can be done for sources and sinks with isSinkNode(node, tag) and isSourceNode(node, tag) predicates. Anything similar for summaries?

@mbg
Copy link
Member

mbg commented Apr 17, 2024

While we are on this, another question for you: Is there an easy way to specify summaries with a custom tag? For instance, i see yml files in ext folder that contain specs for many libraries, but they only have "taint" or "value" as tags. Is it possible to use my own custom tag and only consider them as summaries during execution? I see that it can be done for sources and sinks with isSinkNode(node, tag) and isSourceNode(node, tag) predicates. Anything similar for summaries?

There is some documentation in the code which comments on this:

  1. The kind column is a tag that can be referenced from QL to determine to
    which classes the interpreted elements should be added. For example, for
    sources "remote" indicates a default remote flow source, and for summaries
    "taint" indicates a default additional taint step and "value" indicates a
    globally applicable value-preserving step. For neutrals the kind can be summary,
    source or sink to indicate that the neutral is neutral with respect to
    flow (no summary), source (is not a source) or sink (is not a sink).

So in other words, taint and value have special meanings that affect how the models are used.

What would you hope to accomplish with a custom tag?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants