Tutorial: Points-to analysis and type inference

This topic contains worked examples of how to write queries using the standard QL library classes for Python type inference.

The Value class

The Value class and its subclasses FunctionValue, ClassValue and ModuleValue represent the values an expression may hold at runtime.

Summary

Class hierarchy for Value:

  • Value
    • ClassValue
    • FunctionValue
    • ModuleValue

Points-to analysis and type inference

Points-to analysis, sometimes known as pointer analysis, allows us to determine which objects an expression may “point to” at runtime.

Type inference allows us to infer what the types (classes) of an expression may be at runtime.

The predicate ControlFlowNode.pointsTo(...) shows which object a control flow node may “point to” at runtime.

ControlFlowNode.pointsTo(...) has three variants:

predicate pointsTo(Value object)
predicate pointsTo(Value object, ControlFlowNode origin)
predicate pointsTo(Context context, Value object, ControlFlowNode origin)
object is an object that the control flow node refers to, and origin is where the object comes from, which is useful for displaying meaningful results.
The third form includes the context in which the control flow node refers to the object. This form can usually be ignored.

Note

ControlFlowNode.pointsTo() cannot find all objects that a control flow node might point to as it is impossible to be accurate and to find all possible values. We prefer precision (no incorrect values) over recall (finding as many values as possible). We do this so that queries based on points-to analysis have fewer false positive results and are thus more useful.

For complex data flow analyses, involving multiple stages, the ControlFlowNode version is more precise, but for simple use cases the Expr based version is easier to use. For convenience, the Expr class also has the same three predicates. Expr.pointsTo(...) also has three variants:

predicate pointsTo(Value object)
predicate pointsTo(Value object, AstNode origin)
predicate pointsTo(Context context, Value object, AstNode origin)

Using points-to analysis

In this example we use points-to analysis to build a more complex query. This query is included in the standard query set.

We want to find except blocks in a try statement that are in the wrong order. That is, where a more general exception type precedes a more specific one, which is a problem as the second except handler will never be executed.

First we can write a query to find ordered pairs of except blocks for a try statement.

Ordered except blocks in same ``try`` statement

import python

from Try t, ExceptStmt ex1, ExceptStmt ex2
where
exists(int i, int j |
    ex1 = t.getHandler(i) and ex2 = t.getHandler(j) and i < j
)
select t, ex1, ex2

See this in the query console. Many projects contain ordered except blocks in a try statement.

Here ex1 and ex2 are both except handlers in the try statement t. By using the indices i and j we can also ensure that ex1 precedes ex2.

The results of this query need to be filtered to return only results where ex1 is more general than ex2. We can use the fact that an except block is more general than another block if the class it handles is a superclass of the other.

More general ``except`` block

exists(ClassValue cls1, ClassValue cls2 |
    ex1.getType().pointsTo(cls1) and
    ex2.getType().pointsTo(cls2) |
    not cls1 = cls2 and
    cls1 = cls2.getASuperType()
)

The line:

ex1.getType().pointsTo(cls1)

ensures that cls1 is a ClassValue that the except block would handle.

Combining the parts of the query we get this:

More general ``except`` block precedes more specific

import python

from Try t, ExceptStmt ex1, ExceptStmt ex2
where
exists(int i, int j |
    ex1 = t.getHandler(i) and ex2 = t.getHandler(j) and i < j
)
and
exists(ClassValue cls1, ClassValue cls2 |
    ex1.getType().pointsTo(cls1) and
    ex2.getType().pointsTo(cls2) |
    not cls1 = cls2 and
    cls1 = cls2.getASuperType()
)
select t, ex1, ex2

See this in the query console. This query finds only one result in the demo projects on LGTM.com (youtube-dl). The result is also highlighted by the standard query: Unreachable ‘except’ block.

Note

If you want to submit a query for use in LGTM, then the format must be of the form select element message. For example, you might replace the select statement with: select t, "Incorrect order of except blocks; more general precedes more specific"

Using type inference

In this example we use type inference to determine when an object is used as a sequence in a for statement, but that object might not be an "iterable".

First of all find what object is used in the for loop:

from For loop, Value iter
where loop.getIter().pointsTo(iter)
select loop, iter

Then we need to determine if the object iter is iterable. We can test ClassValue to see if it has the __iter__ attribute.

Find non-iterable object used as a loop iterator

import python

from For loop, Value iter, ClassValue cls
where loop.getIter().getAFlowNode().pointsTo(iter) and
  cls = iter.getClass() and
  not exists(cls.lookup("__iter__"))
select loop, cls

See this in the query console. Many projects use a non-iterable as a loop iterator.

Many of the results shown will have cls as NoneType. It is more informative to show where these None values may come from. To do this we use the final field of pointsTo, as follows:

Find non-iterable object used as a loop iterator 2

import python

from For loop, Value iter, ClassValue cls, AstNode origin
where loop.getIter().pointsTo(iter, origin) and
  cls = iter.getClass() and
  not cls.hasAttribute("__iter__")
select loop, cls, origin

See this in the query console. This reports the same results, but with a third column showing the source of the None values.

Finding calls using call-graph analysis

The Value class has a method getACall() which allows us to find calls to a particular function (including builtin functions).

If we wish to restrict the callables to actual functions we can use the FunctionValue class, which is a subclass of Value and corresponds to function objects in Python, in much the same way as the ClassValue class corresponds to class objects in Python.

Returning to an example from Tutorial: Functions, we wish to find calls to the eval function.

The original query looked this:

import python

from Call call, Name name
where call.getFunc() = name and name.getId() = "eval"
select call, "call to 'eval'."

See this in the query console. Some of the demo projects on LGTM.com have calls that match this pattern.

There are two problems with this query:

  • It assumes that any call to something named “eval” is a call to the builtin eval function, which may result in some false positive results.
  • It assumes that eval cannot be referred to by any other name, which may result in some false negative results.

We can get much more accurate results using call-graph analysis. First, we can precisely identify the FunctionValue for the eval function, by using the Value::named QL predicate as follows:

import python

from Value eval
where eval = Value::named("eval")
select eval

Then we can use Value.getACall() to identify calls to the eval function, as follows:

import python

from ControlFlowNode call, Value eval
where eval = Value::named("eval") and
      call = eval.getACall()
select call, "call to 'eval'."

See this in the query console. This accurately identifies calls to the builtin eval function even when they are referred to using an alternative name. Any false positive results with calls to other eval functions, reported by the original query, have been eliminated.

What next?

For more information on writing QL, see: