Introduction to data flow analysis in QL

Overview

Data flow analysis computes the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used. Many of Semmle’s built-in security queries implement data flow analysis, which can highlight the fate of potentially malicious or insecure data that can cause vulnerabilities in your code base. These queries help you understand if data is used in an insecure way, whether dangerous arguments are passed to functions, or whether sensitive data can leak. As well as highlighting potential security issues, you can also use data flow analysis to understand other aspects of how a program behaves, by finding, for example, uses of unititialised variables and resource leaks.

The following sections provide a brief introduction to data flow analysis in QL.

See the following tutorials for more information about analyzing data flow in specific languages:

Note

Data flow analysis is used extensively in path queries. To learn more about path queries, see Constructing path queries.

Analyzing data flow with QL

The QL data flow libraries implement data flow analysis on a program or function by modeling its data flow graph. The graph is computed using classes to model the program elements that represent the graph’s nodes. The flow of data between the nodes is modeled using predicates to compute the graph’s edges.

Computing an accurate and complete data flow graph presents several challenges:

  • It isn’t possible to compute data flow through standard library functions, where the source code is unavailable.
  • Some behavior isn’t determined until run time, which means that the data flow library must take extra steps to find potential call targets.
  • Aliasing between variables can result in a single write changing the value that multiple pointers point to.
  • The data flow graph can be very large and slow to compute.

To overcome these potential problems, two kinds of data flow are modeled in the QL libraries:

  • Local data flow, concerning the data flow within a single function. It is generally sufficiently fast, efficient and precise for many queries, and it is usually possible to compute the local data flow for all functions in a snapshot.
  • Global data flow, concerning the data flow within an entire program. Computing global data flow is typically more time and energy intensive than local data flow, therefore queries should be refined to look for more specific sources and sinks.

Many of the built-in queries included in the latest Semmle release contain examples of both local and global data flow analysis. See the built-in queries for details.

Normal data flow vs taint tracking

In the QL standard libraries, we make a distinction between ‘normal’ data flow and taint tracking. The normal data flow libraries are used to analyze the information flow in which data values are preserved at each step.

For example, if you are tracking an insecure object x (which might be some untrusted or potentially malicious data), a step in the program may ‘change’ its value. So, in a simple process such as y = x + 1, a normal data flow analysis will highlight the use of x, but not y. However, since y is derived from x, it is influenced by the untrusted or ‘tainted’ information, and therefore it is also tainted. Analyzing the flow of the taint from x to y is known as taint tracking.

In QL, taint tracking extends data flow analysis by including steps in which the data values are not necessarily preserved, but the potentially insecure object is still propagated. These flow steps are modeled in the taint-tracking library using predicates that hold if taint is propagated between nodes.

What next?

  • Search for DataFlow and TaintTracking in the QL standard libraries to learn more about the technical implementation of data flow analysis in QL for specific programming languages.
  • Visit Learning QL to find language-specific QL tutorials on data flow and other topics.