Introduction to data flow

Finding SPARQL injection vulnerabilities in Java

Setup

For this example you should download:

Note

For this example, we will be analyzing VIVO Vitro.

You can also query the project in the query console on LGTM.com.

Note that results generated in the query console are likely to differ to those generated in the QL plugin as LGTM.com analyzes the most recent revisions of each project that has been added–the CodeQL database available to download above is based on an historical version of the codebase.

Agenda

  • SPARQL injection
  • Data flow
  • Modules and libraries
  • Local data flow
  • Local taint tracking

Motivation

SPARQL is a language for querying key-value databases in RDF format, which can suffer from SQL injection-like vulnerabilities:

sparqlAskQuery("ASK { <" + individualURI + "> ?p ?o }")

individualURI is provided by a user, allowing an attacker to prematurely close the >, and provide additional content.

Goal: Find query strings that are created by concatenation.

Note

If you have completed the “Example: Query injection” slide deck which was part of the previous course, this example will look familiar to you.

To understand the scope of this vulnerability, consider what would happen if a malicious user could provide the following as the content of the individualURI variable:

“http://vivoweb.org/ontology/core#FacultyMember> ?p ?o . FILTER regex("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!", "(.*a){50}") } #

Example: SPARQL injection

We can write a simple query that finds string concatenations that occur in calls to SPARQL query APIs.

import java

class StringConcat extends AddExpr {
  StringConcat() { getType() instanceof TypeString }
}

from MethodAccess ma
where
  ma.getMethod().getName().matches("sparql%Query") and
  ma.getArgument(0) instanceof StringConcat
select ma, "SPARQL query vulnerable to injection."

Note

This is similar, but not identical, to the formulation we had in the previous training deck. It has been rewritten to make it easier for the next step.

Success! But also missing results…

Query finds a CVE reported by Semmle (CVE-2019-6986), plus one other result, but misses other opportunities where:

  • String concatenation occurs on a different line in the same method.
  • String concatenation occurs in a different method.
  • String concatenation occurs through StringBuilders or similar.
  • Entirety of user input is provided as the query.

We want to improve our query to catch more of these cases.

Note

For more details of the CVE, see: https://github.com/Semmle/SecurityExploits/tree/master/vivo-project/CVE-2019-6986

As an example, consider this SPARQL query call:

String queryString = "ASK { <" + individualURI + "> ?p ?o }";
 sparqlAskQuery(queryString);

Here the concatenation occurs before the call, so the existing query would miss this - the string concatenation does not occur directly as the first argument of the call.

Data flow analysis

  • Models flow of data through the program.
  • Implemented in the module semmle.code.<lang>.dataflow.DataFlow.
  • Class DataFlow::Node represents program elements that have a value, such as expressions and function parameters.
    • Nodes of the data flow graph.
  • Various predicates represent flow between these nodes.
    • Edges of the data flow graph.

Note

The solution here is to use data flow. Data flow is, as the name suggests, about tracking the flow of data through the program. It helps answers questions like: does this expression ever hold a value that originates from a particular other place in the program?

We can visualize the data flow problem as one of finding paths through a directed graph, where the nodes of the graph are elements in program, and the edges represent the flow of data between those elements. If a path exists, then the data flows between those two edges.

Data flow graphs

Example:

int func(int tainted) {
   int x = tainted;
   if (someCondition) {
     int y = x;
     callFoo(y);
   } else {
     return x;
   }
   return -1;
}

Data flow graph:

digraph {
   graph [ dpi = 1000 ]
   node [shape=polygon,sides=4,color=blue4,style="filled,rounded",   fontname=consolas,fontcolor=white]
   a [label=<tainted<BR /><FONT POINT-SIZE="10">ParameterNode</FONT>>]
   b [label=<tainted<BR /><FONT POINT-SIZE="10">ExprNode</FONT>>]
   c [label=<x<BR /><FONT POINT-SIZE="10">ExprNode</FONT>>]
   d [label=<x<BR /><FONT POINT-SIZE="10">ExprNode</FONT>>]
   e [label=<y<BR /><FONT POINT-SIZE="10">ExprNode</FONT>>]

   a -> b
   b -> {c, d}
   c -> e

}

Local vs global data flow

  • Local (“intra-procedural”) data flow models flow within one function; feasible to compute for all functions in a CodeQL database
  • Global (“inter-procedural”) data flow models flow across function calls; not feasible to compute for all functions in a CodeQL database
  • Different APIs, so discussed separately
  • This slide deck focuses on the former

Note

For further information, see:

Local data flow

Importing data flow

To use the data flow library, add the following import:

import semmle.code.<language>.dataflow.DataFlow

Note: this library contains an explicit “module” declaration:

module DataFlow {
  class Node extends ... { ... }
  predicate localFlow(Node source, Node sink) {
            localFlowStep*(source, sink)
         }
  ...
}

So all references will need to be qualified (that is, DataFlow::Node)

Note

A query library is file with the extension .qll. Query libraries do not contain a query clause, but may contain modules, classes, and predicates. For further information on the data flow libraries, see the following links:

A module is a way of organizing QL code by grouping together related predicates, classes, and (sub-)modules. They can be either explicitly declared or implicit. A query library implicitly declares a module with the same name as the QLL file.

For further information on libraries and modules in QL, see the chapter on Modules in the QL language handbook. For further information on importing QL libraries and modules, see the chapter on Name resolution in the QL language handbook.

Data flow graph

  • Class DataFlow::Node represents data flow graph nodes
  • Predicate DataFlow::localFlowStep represents local data flow graph edges, DataFlow::localFlow is its transitive closure
  • Data flow graph nodes are not AST nodes, but they correspond to AST nodes, and there are predicates for mapping between them:
    • Expr Node.asExpr()
    • Parameter Node.asParameter()
    • DataFlow::Node DataFlow::exprNode(Expr e)
    • DataFlow::Node DataFlow::parameterNode(Parameter p)
    • etc.

Note

The DataFlow::Node class is shared between both the local and global data flow graphs–the primary difference is the edges, which in the “global” case can link different functions.

localFlowStep is the “single step” flow relation–that is, it describes single edges in the local data flow graph. localFlow represents the transitive closure of this relation–in other words, it contains every pair of nodes where the second node is reachable from the first in the data flow graph.

The data flow graph is separate from the AST, to allow for flexibility in how data flow is modeled. There are a small number of data flow node types–expression nodes, parameter nodes, uninitialized variable nodes, and definition by reference nodes. Each node provides mapping functions to and from the relevant AST (for example Expr, Parameter etc.) or symbol table (for example Variable) classes.

Taint tracking

  • Usually, we want to generalise slightly by not only considering plain data flow, but also “taint” propagation, that is, whether a value is influenced by or derived from another.

  • Examples:

    sink = source;        // source -> sink: data and taint
    strcat(sink, source); // source -> sink: taint, not data
    
  • Library semmle.code.<language>.dataflow.TaintTracking provides predicates for tracking taint; TaintTracking::localTaintStep represents one (local) taint step, TaintTracking::localTaint is its transitive closure.

Note

Taint tracking can be thought of as another type of data flow graph. It usually extends the standard data flow graph for a problem by adding edges between nodes where one one node influences or taints another.

The taint-tracking API is almost identical to that of the local data flow. All we need to do to switch to taint tracking is import semmle.code.<language>.dataflow.TaintTracking instead of semmle.code.<language>.dataflow.DataFlow, and instead of using localFlow, we use localTaint.

Exercise: revisiting SPARQL injection

Refine the query to find string concatenation that occurs in the same method, but a different line.

Hint: Use DataFlow::localFlow to assert that the result flows to the SPARQL call argument, using DataFlow::exprNode to get the data flow nodes for the relevant expression nodes.

import java
import semmle.code.java.dataflow.DataFlow::DataFlow

from MethodAccess ma, StringConcat stringConcat
where
  ma.getMethod().getName().matches("sparql%Query") and
  localFlow(exprNode(stringConcat), exprNode(ma.getArgument(0)))
select ma, "SPARQL query vulnerable to injection."

Refinements (take home exercise)

In Java, strings are often created using StringBuilder and StringBuffer classes. For example:

StringBuilder queryBuilder = new StringBuilder();
queryBuilder.add("ASK { <");
queryBuilder.add(individualURI);
queryBuilder.add("> ?p ?o }");
sparqlAskQuery(queryBuilder);

Exercise: Refine the query to consider strings created from StringBuilder and StringBuffer classes as sources of concatenation.

Beyond local data flow

  • We are still missing possible results.
    • Concatenation that occurs outside the enclosing method.
  • Instead, let’s turn the problem around and find user-controlled data that flows into a printf format argument, potentially through calls.
  • This needs global data flow.