Semmle 1.20
Skip to end of metadata
Go to start of metadata

This topic describes how Semmle Core analysis works.

Data collection

Semmle Core analysis works by creating a 'snapshot' of your project. A snapshot contains a copy of your source files which is used to generate a relational database, representative of the whole code base at the time the copy was made. The database is generated in a language-specific extraction process. For compiled languages, such as Java, the analysis monitors the build process by intercepting and listening to system calls, and detecting calls to the compiler. The interception works completely transparently–you make your normal build command available for monitoring by Semmle Core, and then everything happens automatically. You don't need to change your build scripts in any way. By this process of interception, for every source file we run an extractor that converts the source to a relational representation, which we call a "trap" file. For instance for Java source code, we produce a .trap file for every .java file, in the same way that the normal compiler produces a .class file.

For languages that are not compiled (such as Python and JavaScript), the extractor runs directly on the source code, resolving dependencies for accuracy where possible. The resulting .trap files are treated in exactly the same way as for compiled languages. Other data formats extracted directly in this way are XML and CSV files. These files are typically analyzed to get further configuration information relevant to the code base, for example: Spring XML configuration files.

Advantages of this approach

Through the interception of compiler calls, the extractor has all the information available to the compiler (such as the class path) to accurately bind symbols to their definition. This accuracy is critical for deep non-local analyses that cross the boundaries of compilation units. For example, propagating tainted data to find security vulnerabilities, such as cross-site scripting and SQL injection.

As a simple concrete example, consider the following rule for Java: when you call x.equals(y), the type of x (say T1) and y (say T2) should be compatible. That is, T1 and T2 should have a common subtype in the inheritance hierarchy. This is a very effective check, finding real bugs on almost every large Java code base. To find code that breaks this rule you need to run a build, as the analysis needs to properly understand the inheritance hierarchy. One could try to get away with partial processing of the inheritance hierarchy, but at that point we'd be re-implementing a Java compiler.

For some languages, like C#, there already is a compiler available that is extremely robust against errors in the input (Roslyn for C# has that property), and so for these languages, even when the standard build method fails we can still produce a reasonable relational representation.

Technology

There is one extractor for each supported language: we do not have a universal representation that is used for all languages. This ensures that analysis is as accurate as possible, accounting for the fact that:

  • Languages can be vastly different and require very different processing. For example, XML is processed very differently from Python.

  • Subtle differences must be accommodated. For example, the notion of inheritance between C++ and Java.

Database creation

Analogous to the creation of a .jar file from .class files, when data collection is complete the extractor takes all the .trap files and imports them into a database which represents the whole program. On completion, the snapshot that has been created contains the database alongside a copy of each source file analyzed. This allows the results of any further analysis of the database to be shown directly in the appropriate source code file.

Each language has its own unique database schema. The schema specifies that there is a table for every language construct, including methods, expressions, and so forth. Typically, we also create libraries of common idioms or queries, written in QL, that make it easy to ask questions about projects written in a particular language. Indeed, there are standard query libraries for each language supported in Semmle Core.

The underlying database platform and query language are the same no matter what data is processed. It's a general relational database, which is optimized for the type of data found in software engineering in two ways:

  1. It provides an efficient implementation of recursion, which is needed if you want to ask deep questions of hierarchies and graphs.

  2. It would be painful to write queries on the raw data (the C++ database schema specifies over 160 tables), so we offer an object-oriented mechanism for views to abstract from the physical data layout. Such extensive use of views would be prohibitively expensive on traditional databases, but we've invented ways of having the luxury of abstraction without the runtime cost.

Analysis

When the snapshot creation is complete, it is analyzed by running your chosen set of queries on the database. Typically, each query identifies code that breaks best coding practices or calculates metrics for the code base. Examples of customer queries include queries to find code that fails to implement internal frameworks correctly, to flag coding patterns which have caused past production incidents, and to highlight code which needs to be upgraded before the application can support a new technology.

More information

For more information about the query language used to write queries, see QL resources. For details of some of the standard rules and metrics available in Semmle analysis, see: