Introduction to variant analysis

CodeQL for Java

Setup

For this example you should download:

Note

For this example, we will be analyzing Apache Struts.

You can also query the project in the query console on LGTM.com.

Note that results generated in the query console are likely to differ to those generated in the QL plugin as LGTM.com analyzes the most recent revisions of each project that has been added–the CodeQL database available to download above is based on an historical version of the codebase.

A few years ago, somewhere between Earth and Mars…

Crashing bug found in Curiosity’s landing module through routine testing.

Patching is still possible mid-flight, but what if there are more such issues?

../_images/curiosity21.png

Note

When the Curiosity Rover was on its way to Mars in 2012, a flight software developer at NASA JPL discovered a mission-critical bug through manual code review. The problem occurred in Curiosity’s Entry, Descent, and Landing software–the software responsible for the Rover’s descent through the Martian atmosphere and landing it safely on the surface. of Mars.

The bug, which had gone undetected by traditional solutions, was likely to prevent the capsule’s parachutes from opening, resulting in the Rover crashing onto the red planet’s rocky surface.

Zoom in on the code…

(For illustration only, not actually NASA code!)

void fire_thrusters(double vectors[12]) {
  for (int i = 0; i < 12 i++) {
    ... vectors[i] ...
  }
}
double thruster[3] = ... ;
fire_thrusters(thruster);
  • In C, array types of parameters degrade to pointer types.
  • The size is ignored!
  • No protection from passing a mismatched array.

Note

The problem stems from a peculiarity of the C language. You can declare parameters to have array types (with stated sizes), but that means nothing: the parameter type “degrades” to a raw pointer type with no size information.

The pseudocode in the slide illustrates this. The function is declared to take an array of length 12 (presumably three data points for each thruster). However, there’s no sanity checking, and a developer might call it with an array that’s too short, holding direction information for only one of the thrusters. The function will then read past the end of the array, and unpredictable results occur.

Write a query…

…to find all instances of the problem.

Complete text of the analysis (nothing left out!):

import cpp

from Function f, FunctionCall c, int i, int a, int b
where f = c.getTarget()
  and a = c.getArgument(i).getType().(ArrayType).getArraySize()
  and b = f.getParameter(i).getType().(ArrayType).getArraySize()
  and a < b
select c.getArgument(i), "Array of size " + a
       + " passed to $@, which expects an array of size " + b + ".",
       f, f.getName()

Note

Once the mission critical bug was discovered on Curiosity, JPL contacted Semmle for help discovering whether variants of the problem might exist elsewhere in the Curiosity control software. In 20 minutes, research engineers from Semmle produced a CodeQL query and shared it with the JPL team. It finds all functions that are passed an array as an argument whose size is smaller than expected.

(The goal here is not to fully understand the query, but to illustrate the power of the language and its standard libraries.)

Find all instances!

  • When applied to the code, the query found the original bug and around 30 others.
  • Three of those were in the same entry, descent and landing module.
  • All were fixed with a mid-flight patch.
  • For more detail on the collaboration between Semmle and NASA, see our case study: Semmle at NASA: Landing Curiosity safely on Mars.

Note

The JPL team ran the query across the full Curiosity control software–it identified the original problem, and more than 30 other variants, of which three were in the critical Entry, Descent, and Landing module.

The team addressed all issues, and patched the firmware remotely. Not long after, the Curiosity Rover landed safely on Mars.

How it all works

Analysis overview

../_images/analysis-overview1.png

Note

Semmle’s analysis works by extracting a queryable database from your project. For compiled languages, Semmle’s tools observe an ordinary build of the source code. Each time a compiler is invoked to process a source file, a copy of that file is made, and all relevant information about the source code (syntactic data about the abstract syntax tree, semantic data like name binding and type information, data on the operation of the C preprocessor, etc.) is collected. For interpreted languages, the extractor gathers similar information by running directly on the source code. Multi-language code bases are analyzed one language at a time.

Once the extraction finishes, all this information is collected into a single CodeQL database, which is then ready to query, possibly on a different machine. A copy of the source files, made at the time the database was created, is also included in the CodeQL database so analysis results can be displayed at the correct location in the code. The database schema is (source) language specific.

Queries are written in QL and usually depend on one or more of the standard CodeQL libraries (and of course you can write your own custom libraries). They are compiled into an efficiently executable format by the QL compiler and then run on a CodeQL database by the QL evaluator, either on a remote worker machine or locally on a developer’s machine.

Query results can be interpreted and presented in a variety of ways, including displaying them in an IDE plugin such as QL for Eclipse, or in a web dashboard as on LGTM.

Introducing QL

QL is the query language running all CodeQL analysis.

QL is:

  • a logic language based on first-order logic
  • a declarative language without side effects
  • an object-oriented language
  • a query language working on a read-only CodeQL database
  • equipped with rich standard libraries for program analysis

Note

QL is the high-level, object-oriented logic language that underpins all CodeQL libraries and analyses. You can learn lots more about QL by visiting Introduction to the QL language and About QL. The key features of QL are:

  • All common logic connectives are available, including quantifiers like exist, which can also introduce new variables.
  • The language is declarative–the user focuses on stating what they would like to find, and leaves the details of how to evaluate the query to the engine.
  • The object-oriented layer allows Semmle to distribute rich standard libraries for program analysis. These model the common AST node types, control flow and name lookup, and define further layers on top–for example control flow or data flow analysis. The standard CodeQL libraries and queries ship as source and can be inspected by the user, and new abstractions are readily defined.
  • The database generated by Semmle’s tools is treated as read-only; queries cannot insert new data into it, though they can inspect its contents in various ways.

You can start writing running queries on open source projects in the query console on LGTM.com. You can also download CodeQL databases from LGTM.com to query locally, by running queries in your IDE.

Oops

int write(int[] buf, int size, int loc, int val) {
    if (loc >= size) {
       // return -1;
    }

    buf[loc] = val;

    return 0;
}
  • The return statement has been commented out (during debugging?)
  • The if statement is now dead code
  • No explicit bounds checking, will throw ArrayIndexOutOfbounds

Note

Here’s a simple (artificial) bug, which we’ll develop a query to catch.

This function writes a value to a given location in an array, first trying to do a bounds check to validate that the location is within bounds. However, the return statement has been commented out, leaving a redundant if statement and no bounds checking.

This case can act as our “patient zero” in the variant analysis game.

A simple CodeQL query

import java

from IfStmt ifstmt, Block block
where
  block = ifstmt.getThen() and
  block.getNumStmt() = 0
select ifstmt, "This if-statement is redundant."

Note

We are going to write a simple query which finds “if statements” with empty “then” blocks, so we can highlight the results like those on the previous slide. The query can be run in the query console on LGTM, or in your IDE.

A query consists of a “select” clause that indicates what results should be returned. Typically it will also provide a “from” clause to declare some variables, and a “where” clause to state conditions over those variables. For more information on the structure of query files (including links to useful topics in the QL language handbook), see Introduction to query files.

In our example here, the first line of the query imports the CodeQL library for Java, which defines concepts like IfStmt and Block. The query proper starts by declaring two variables–ifStmt and block. These variables represent sets of values in the database, according to the type of each of the variables. For example, ifStmt has the type IfStmt, which means it represents the set of all if statements in the program.

If we simply selected these two variables:

from IfStmt ifStmt, Block block
select ifStmt, block

We would get a result row for every combination of blocks and if statements in the program. This is known as a cross-product, because there is no logical condition linking the two variables. We can use the where clause to specify the condition that we are only interested in rows where the “block” is the “then” part of the if statement. We do this by specifying:

block = ifStmt.getThen()

This states that the block is equal to (not assigned!) the “then” part of the ifStmt. getThen() is an operation which is available on any IfStmt. One way to interpret this is as a filtering operation – starting with every pair of block and if statements, check each one to whether the logical condition holds, and only keep the row if that is the case. We can add a second condition that specifies the block must be empty:

and block.isEmpty()

The isEmpty() operation is available on any Block, and is only true if the “block” has no children.

Finally, we select a location, at which to report the problem, and a message, to explain what the problem is.

Structure of a query

A query file has the extension .ql and contains a query clause, and optionally predicates, classes, and modules.

A query library has the extension .qll and does not contain a query clause, but may contain modules, classes, and predicates.

Each query library also implicitly defines a module.

Import statements allow the classes and predicates defined in one module to be used in another.

Note

Queries are always contained in query files with the file extension .ql. Quick queries, run in QL for Eclipse, are no exception: the quick query window maintains a temporary QL file in the background.

Parts of queries can be lifted into library files with the extension .qll. Definitions within such libraries can be brought into scope using “import” statements, and similarly QLL files can import each other’s definitions using “import” statements.

Logic can be encapsulated as user-defined predicates and classes, and organized into modules. Each QLL file implicitly defines a module, but QL and QLL files can also contain explicit module definitions, as we will see later.

Predicates

A predicate allows you to pull out and name parts of a query.

import java

from IfStmt ifstmt, Block block
where
  block = ifstmt.getThen() and
  block.getNumStmt() = 0
select ifstmt, "This if-statement is redundant."
import java

predicate isEmpty(Block block) {
  block.getNumStmt() = 0
}

from IfStmt ifstmt
where isEmpty(ifstmt.getThen())
select ifstmt

Note

A predicate takes zero or more parameters, and its body is a condition on those parameters. The predicate may (or may not) hold. Predicates may also be recursive, simply by referring to themselves (directly or indirectly).

You can imagine a predicate to be a self-contained from-where-select statement, that produces an intermediate relation, or table. In this case, the isEmpty predicate will be the set of all blocks which are empty.

Classes in QL

A QL class allows you to name a set of values and define (member) predicates on them.

A class has at least one supertype and optionally a characteristic predicate; it contains the values that belong to all supertypes and satisfy the characteristic predicate, if provided.

Member predicates are inherited and can be overridden.

class EmptyBlock extends Block {
  EmptyBlock() {
    this.getNumStmt() = 0
  }
}

Note

Classes model sets of values from the database. A class has one or more supertypes, and inherits member predicates (methods) from each of them. Each value in a class must be in every supertype, but additional conditions can be stated in a so-called characteristic predicate, which looks a bit like a zero-argument constructor.

In the example, declaring a variable “EmptyBlock e” will allow it to range over only those blocks that have zero statements.

Classes in QL continued

import java

predicate isEmpty(Block block) {
  block.getNumStmt() = 0
}

from IfStmt ifstmt
where isEmpty(ifstmt.getThen())
select ifstmt
import java

class EmptyBlock extends Block {
  EmptyBlock() {
    this.getNumStmt() = 0
  
}

from IfStmt ifstmt
where ifstmt.getThen() instanceof
      EmptyBlock
select ifstmt

Note

As shown in this example, classes behave much like unary predicates, but with instanceof instead of predicate calls to check membership. Later on, we will see how to define member predicates on classes.

Iterative query refinement

  • Common workflow: Start with a simple query, inspect a few results, refine, repeat.
  • For example, empty then branches are not a problem if there is an else.
  • Exercise: How can we refine the query to take this into account?

Hints:

  • Use member predicate IfStmt.getElse()
  • Use not exists(...)

Note

CodeQL makes it very easy to experiment with analysis ideas. A common workflow is to start with a simple query (like our “redundant if-statement” example), examine a few results, refine the query based on any patterns that emerge and repeat.

As an exercise, refine the redundant-if query based on the observation that if the if-statement has an “else” clause, then even if the body of the “then” clause is empty, it’s not actually redundant.

Model answer: redundant if-statement

import java

class EmptyBlock extends Block {
  EmptyBlock() { this.getNumStmt() = 0 }
}

from IfStmt ifstmt
where
  ifstmt.getThen() instanceof EmptyBlock and
  not exists(ifstmt.getElse())
select ifstmt, "This if-statement is redundant."

Note

You can explore the results generated when this query is run on apache/struts in LGTM here.