Introducing the QL libraries for Python

These libraries have been created to help you analyze Python code, providing an object-oriented layer on top of the raw data in the snapshot database. They are written in standard QL.

The QL libraries all have a .qll extension, to signify that they contain QL library code but no actual queries. Each file contains a QL class or hierarchy of classes.

You can include all of the standard libraries by beginning each query with this statement:

import python

The rest of this tutorial summarizes the contents of the standard QL libraries. We recommend that you read this and then work through the practical examples in the Python tutorials shown at the end of the page.

Overview of the library

The QL Python library incorporates a large number of classes, each class corresponding either to one kind of entity in Python source code or to an entity that can be derived form the source code using static analysis. These classes can be divided into four categories:

  • Syntactic - classes that represent entities in the Python source code.
  • Control flow - classes that represent entities from the control flow graphs.
  • Data flow - classes that assist in performing data flow analyses on Python source code.
  • Type inference - classes that represent the inferred types of entities in the Python source code.

Syntactic classes

This part of the library represents the Python source code. The Module, Class and Function classes correspond to Python modules, classes and functions respectively, collectively these are known as Scope classes. Each Scope contains a list of statements each of which is represented by a subclass of the class Stmt. Statements themselves can contain other statements or expressions which are represented by subclasses of Expr. Finally, there are a few additional classes for the parts of more complex expressions such as list comprehensions. Collectively these classes are subclasses of AstNode and form an Abstract syntax tree (AST). The root of each AST is a Module.

Symbolic information is attached to the AST in the form of variables (represented by the class Variable).

Scope

A Python program is a group of modules. Technically a module is just a list of statements, but we often think of it as composed of classes and functions. These top-level entities, the module, class and function are represented by the three classes (Module, Class and Function which are all subclasses of Scope.

  • Scope
    • Module
    • Class
    • Function

All scopes are basically a list of statements, although Scope classes have additional attributes such as names. For example, the following query finds all functions whose scope (the scope in which they are declared) is also a function:

import python

from Function f
where f.getScope() instanceof Function
select f

See this in the query console. Many projects have nested functions.

Statement

A statement is represented by the Stmt class which has about 20 subclasses representing the various kinds of statements, such as the Pass statement, the Return statement or the For statement. Statements are usually made up of parts. The most common of these is the expression, represented by the Expr class. For example, take the following Python for statement:

for var in seq:
    pass
else:
    return 0

The QL For class representing the for statement has a number of member predicates to access its parts:

  • getTarget() returns the Expr representing the variable var.
  • getIter() returns the Expr resenting the variable seq.
  • getBody() returns the statement list body.
  • getStmt(0) returns the pass Stmt.
  • getOrElse() returns the StmtList containing the return statement.

Expression

Most statements are made up of expressions. The Expr class is the superclass of all expression classes, of which there are about 30 including calls, comprehensions, tuples, lists and arithmetic operations. For example, the Python expression a+2 is represented by the BinaryExpr class:

  • getLeft() returns the Expr representing the a.
  • getRight() returns the Expr representing the 2.

As an example, to find expressions of the form a+2 where the left is a simple name and the right is a numeric constant we can use the following query:

Finding expressions of the form “a+2”

import python

from BinaryExpr bin
where bin.getLeft() instanceof Name and bin.getRight() instanceof Num
select bin

See this in the query console. Many projects include examples of this pattern.

Variable

Variables are represented by the Variable class in the Python QL library. There are two subclasses, LocalVariable for function-level and class-level variables and GlobalVariable for module-level variables.

Other source code elements

Although the meaning of the program is encoded by the syntactic elements, Scope, Stmt and Expr there are some parts of the source code not covered by the abstract syntax tree. The most useful of these is the Comment class which describes comments in the source code.

Examples

Each syntactic element in Python source is recorded in the snapshot. These can be queried via the corresponding class. Let us start with a couple of simple examples.

1. Finding all finally blocks

For our first example, we can find all finally blocks by using the Try class:

Find all ``finally`` blocks

import python

from Try t
select t.getFinalbody()

See this in the query console. Many projects include examples of this pattern.

2. Finding ‘except’ blocks that do nothing

For our second example, we can use a simplified version of a query from the standard query set. We look for all except blocks that do nothing.

A block that does nothing is one that contains no statements except pass statements. We can encode this as:

not exists(Stmt s | s = ex.getAStmt() | not s instanceof Pass)

where ex is an ExceptStmt and Pass is the class representing pass statements. Instead of using the double negative, “nostatements that arenotpass statements”, this can also be expressed positively, “all statements must be pass statements”. The positive form is expressed in QL using the forall quantifier:

forall(Stmt s | s = ex.getAStmt() | s instanceof Pass)

Both forms are equivalent. Using the positive QL expression, the whole query looks like this:

Find pass-only ``except`` blocks

import python

from ExceptStmt ex
where forall(Stmt s | s = ex.getAStmt() | s instanceof Pass)
select ex

See this in the query console. Many projects include pass-only except blocks.

Summary

The most commonly used standard QL library classes in the syntactic part of the library are organized as follows:

Module, Class, Function, Stmt and Expr - they are all subclasses of AstNode.

Abstract Syntax Tree
  • AstNode
    • Module – A Python module
    • Class – The body of a class definition
    • Function – The body of a function definition
    • Stmt – A statement
      • Assert – An assert statement
      • Assign – An assignment
        • AssignStmt – An assignment statement, x = y
        • ClassDef – A class definition statement
        • FunctionDef – A function definition statement
      • AugAssign – An augmented assignment, x += y
      • Break – A break statement
      • Continue – A continue statement
      • Delete – A del statement
      • ExceptStmt – The except part of a try statement
      • Exec – An exec statement
      • For – A for statement
      • If – An if statement
      • Pass – A pass statement
      • Print – A print statement (Python 2 only)
      • Raise – A raise statement
      • Return – A return statement
      • Try – A try statement
      • While – A while statement
      • With – A with statement
    • Expr – An expression
      • Attribute – An attribute, obj.attr
      • Call – A function call, f(arg)
      • IfExp – A conditional expression, x if cond else y
      • Lambda A lambda expression
      • Yield – A yield expression
      • Bytes – A bytes literal, b"x" or (in Python 2) "x"
      • Unicode – A unicode literal, u"x" or (in Python 3) "x"
      • Num – A numeric literal, 3 or 4.2
        • IntegerLiteral
        • FloatLiteral
        • ImaginaryLiteral
      • Dict – A dictionary literal, {'a': 2}
      • Set – A set literal, {'a', 'b'}
      • List – A list literal, ['a', 'b']
      • Tuple – A tuple literal, ('a', 'b')
      • DictComp – A dictionary comprehension, {k: v for ...}
      • SetComp – A set comprehension, {x for ...}
      • ListComp – A list comprehension, [x for ...]
      • GenExpr – A generator expression, (x for ...)
      • Subscript – A subscript operation, seq[index]
      • Name – A reference to a variable, var
      • UnaryExpr – A unary operation, -x
      • BinaryExpr – A binary operation, x+y
      • Compare – A comparison operation, 0 < x < 10
      • BoolExpr – Short circuit logical operations, x and y, x or y
Variables
  • Variable – A variable
    • LocalVariable – A variable local to a function or a class
    • GlobalVariable – A module level variable
Other
  • Comment – A comment

Control flow classes

This part of the library represents the control flow graph of each Scope (classes, functions and modules). Each Scope contains a graph of ControlFlowNode elements. Each scope has a single entry point and at least one (potentially many) exit points. To speed up control and data flow analysis, control flow nodes are grouped into basic blocks.

As an example, we might want to find the longest sequence of code without any branches. A BasicBlock is, by definition, a sequence of code without any branches, so we just need to find the longest BasicBlock.

First of all we introduce a simple predicate bb_length() which relates BasicBlocks to their length.

int bb_length(BasicBlock b) {
    result = max(int i | exists(b.getNode(i))) + 1
}

Each ControlFlowNode within a BasicBlock is numbered consecutively, starting from zero, therefore the length of a BasicBlock is equal to one more than the largest index within that BasicBlock.

Using this predicate we can select the longest BasicBlock by selecting the BasicBlock whose length is equal to the maximum length of any BasicBlock:

Find the longest sequence of code without branches

import python

int bb_length(BasicBlock b) {
    result = max(int i | exists(b.getNode(i)) | i) + 1
}

from BasicBlock b
where bb_length(b) = max(bb_length(_))
select b

See this in the query console. When we ran it on the LGTM.com demo projects, the openstack/nova and rg3/youtube-dl projects both contained source code results for this query.

Note

The special underscore variable _ means any value; so bb_length(_) is the length of any block.

Summary

The classes in the control-flow part of the library are:

  • ControlFlowNode – A control-flow node. There is a one-to-many relation between AST nodes and control-flow nodes.
  • BasicBlock – A non branching list of control-flow nodes.

Data flow

The SsaVariable class represents static single assignment form variables (SSA variables). There is a one-to-many relation between variables and SSA variables. The SsaVariable class provides an accurate and fast means of tracking data flow from definition to use; the SsaVariable class is an important element for building data flow analyses, including type inference.

Type-inference classes

The QL library for Python also supplies some classes for accessing the inferred types of values. The classes Object and ClassObject allow you to query the possible classes that an expression may have at runtime. For example, which ClassObjects are iterable can be determined using the query:

Find iterable ``ClassObjects``

import python

from ClassObject cls
where cls.hasAttribute("__iter__")
select cls

See this in the query console This query returns a list of classes for the projects analyzed. If you want to include the results for builtin classes, which do not have any Python source code, show the non-source results.

Summary

  • Object
    • ClassObject
    • FunctionObject
    • ModuleObject

These classes are explained in more detail in Tutorial: Points-to analysis and type inference.

What next?