Semmle 1.18
Skip to end of metadata
Go to start of metadata

On this page:

Related topics:

HIDDEN

This topic describes how Semmle Core analysis works.

Data collection

Semmle code analysis works by monitoring the build process for compiled languages—it listens to system calls, and detects calls to the compiler. The interception works completely transparently: you make your normal build command available for monitoring by Semmle Core, and then everything happens automatically. You don't need to change your build scripts in any way. By this process of interception, for every source file we run an "extractor" that converts the source to a relational representation, which we call a "trap" file. For instance for Java source code, we produce a .trap file for every .java file, in the same way that the normal compiler produces a .class file.

The above description of build interception applies to compiled languages (such as Java). For other languages (such as Python and JavaScript), the extractor runs directly on the source code—resolving dependencies for accuracy, where possible. The resulting .trap files are treated in exactly the same way as for compiled languages. Other data formats extracted directly in this way are XML and CSV files. These files are typically analyzed to get further configuration information relevant to the code base, for example: Spring XML configuration files.

Advantages of this approach

Through the interception of compiler calls, the extractor has all the information available to the compiler, such as the class path, to bind symbols to their definition accurately. This accuracy is critical for deep non-local analyses, that cross the boundaries of compilation units. For example, propagating tainted data to find security vulnerabilities, such as cross-site scripting and SQL injection.

As a simple concrete example, consider the following rule for Java: when you call x.equals(y), the type of x (say T1) and y (say T2) should be compatible. That is, T1 and T2 should have a common subtype in the inheritance hierarchy. This is a very effective check, finding real bugs on almost every large Java code base. To find code that breaks this rule you need to run a build, as the analysis needs to properly understand the inheritance hierarchy. One could try to get away with partial processing of the inheritance hierarchy, but at that point we'd be re-implementing a Java compiler.

For some languages, like C#, there already is a compiler available that is extremely robust against errors in the input (Roslyn for C# has that property), and so for these languages, even when the standard build method fails we can still produce a reasonable relational representation.

Technology

There is one extractor for each supported language: we do not have a universal representation that is used for all languages. This ensures that analysis is as accurate as possible, accounting for the fact that:

  • Languages can be vastly different and require very different processing. For example, XML is processed very differently from Python.

  • Subtle differences must be accommodated. For example, the notion of inheritance between C++ and Java.

Database creation

Analogous to the creation of a .jar file from .class files, when data collection is complete the extractor takes all the .trap files and imports them into a database which represents the "whole" program. On completion, the extractor creates a snapshot containing the database and a copy of each source file analyzed. This allows results to be shown directly in the source code file.

Each language has its own unique database schema. The schema specifies, for instance, that there are tables of methods, of expressions, and so forth—a table for every language construct. Typically we also create libraries of common idioms that make it easy to ask questions about a particular language, and indeed there are standard query libraries for each language supported.

The underlying database platform and query language are the same no matter what data is processed. It's a general relational database, which is optimized for the type of data found in software engineering, in two ways:

  1. It provides an efficient implementation of recursion, which is needed if you want to ask deep questions of hierarchies and graphs.

  2. It would be painful to write queries on the raw data (the C++ database schema specifies over 160 tables), so we offer an object-oriented mechanism for views to abstract from the physical data layout. Such extensive use of views would be prohibitively expensive on traditional databases, but we've invented ways of having the luxury of abstraction without the runtime cost.

Analysis

When the snapshot is complete, it is analyzed by running your chosen set of queries on the database. Typically each query identifies code that breaks best coding practices or calculates metrics for the code base. Examples of customer queries include queries to find code that fails to implement internal frameworks correctly, to flag coding patterns which have caused past production incidents, and to highlight code which needs to be upgraded before the application can support a new technology.

More information

For more information about the query language used to write queries, see QL resources. For details of some of the standard rules and metrics available in Semmle analysis, see: