Semmle 1.22
Skip to end of metadata
Go to start of metadata

Overview

In this tutorial, we explore how to incorporate external data into a snapshot database and define custom queries to analyze it.

This tutorial covers:

  • Advanced project setup with custom build commands.
  • Custom data sources in the form of comma-separated value (CSV) files.
  • Custom QL queries ranging over the new data.

When you have completed this tutorial, you will see warnings issued by the compiler during the build included in the results of analysis. This tutorial uses the open-source IRC client, irssi , as the running example.

The examples in this tutorial have been tested on Unix systems. A similar approach can be used on Windows servers, but you may need to make a few adjustments. For example, you will need to change the Perl script to use Windows paths and, if bash is not available then you may need to use an alternative scripting language to define the custom build command.

Prerequisites

This page covers an advanced topic, and we will assume familiarity with the QL language and the basics of managing Semmle Core. In particular, you will need to know how to set up a project configuration.

Getting started: creating the project configuration

For the purposes of this tutorial, we include information on creating a new project configuration from scratch. If you already have a project configuration and want to extend it to include external data, then you can use the existing configuration – skip this section.

An easy way to get started is to use the bootstrap command to create the skeleton project configuration. Follow the usual steps, selecting a C/C++ project and defining a checkout from Subversion ( http://svn.irssi.org/repos/irssi/trunk/ ). This project can be built using gcc (the default path for Linux computers is /usr/bin/gcc ), when prompted define a clean command of  ./autogen.sh and a build command of  make . Allow bootstrap to add the first snapshot and build it. The generated project file is given below for reference.

The project configuration for irssi
<project language="cpp">
  <ram>2048</ram>
  <timeout>600</timeout>
  <autoupdate>
    <checkout>svn checkout "http://svn.irssi.org/repos/irssi/trunk/" ${src}</checkout>
    <build>./autogen.sh</build>
    <build index="true">make -j4</build>
    <build>odasa duplicateCode --ram 2048 --minimum-tokens 100</build>
    <days-between-updates>1</days-between-updates>
  </autoupdate>
  <snapshot-policy>
    <max>15</max>
    <include>
      <recurrent kind="daily"/>
      <max>5</max>
    </include>
    <include>
      <recurrent kind="weekly">
        <day>Monday</day>
      </recurrent>
      <max>4</max>
    </include>
    <include>
      <recurrent kind="monthly">
        <day>1</day>
      </recurrent>
    </include>
  </snapshot-policy>
</project> 

Collecting external data

While this example shows how to incorporate compiler warnings into the Semmle database, it does not work for all setups. In particular, when make is run with multiple threads the paths of the parsed lines will not match up.

The easiest way to collect and integrate external data into the Semmle database is to add one or more build commands that create comma-separated value (CSV) files. As long as these files are stored under the directory ${snapshot}/external/data, they will be picked up and imported into the Semmle database automatically by the buildSnapshot process. This makes the data available for analysis by custom queries (run using either the command-line tools or one of the QL plugins and extensions.

In the present example, we want to parse the compiler output for any warnings (or errors), and add the relevant information to the database. One approach is to write a Perl script to parse the compiler output and record information for each warning in a CSV file. For example:

parse-gcc-warnings.pl
 #!/usr/bin/perl
use strict;

# The last directory entered by make.
my $curDir = ".";

while (<>) {
    if (m/make\[\d+\]: Entering directory `(.*)'/) {
        # (A) Make is entering a new directory.
        $curDir = $1;
    } elsif (m/(\S+):(\d+):(\d+): (error|warning): (.*)/) {
        # (B) This is an error or warning message.
        my ($file, $line, $col, $type, $msg) = ($1, $2, $3, $4, $5);

        # (C) If the file in the message is not absolute, prepend the cur dir.
        $file = "$curDir/$file" unless $file =~ m|^/|;

        # (D) Escape values for CSV
        $file =~ s/"/""/g;
        $msg =~ s/"/""/g;

        # (E) Write CSV data about the error or warning.
        print qq/"$file","$line","$col","$type","$msg"\n/;
    }
}

Reviewing the script, we can see that it reads standard input and attempts to parse it:

  • (A) If the line contains a log message from make starting with "Entering directory", then the current directory is stored in the $curDir variable – this is needed because gcc may only add the file name without the absolute path to its warning messages.
  • (B) If the current line is an error or warning message, we parse out the file name, line number, column number, the kind and the actual message. An example line from the log output might be this: fe-windows.c:314:4: warning: ‘g_strcasecmp’ is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]. In this situation, the variables would have the following values:
    • $filefe-windows.c
    • $line: 314
    • $col4
    • $kindwarning
    • $msg: ‘g_strcasecmp’ is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]
  • (C) Depending on the invocation, gcc may have included the absolute path or only a relative path. Consequently, if the captured file name does not start with a slash then we pre-pend the current directory.
  • (D) The message, and – in theory – the file name, might contain double quotes. Since we want to create a double-quote delimited CSV, such characters must be escaped by doubling them, if they exist.
  • (E) Finally, we print a line of CSV data. Each line has five columns, namely the five pieces of information we have collected.

As an example, here are the first few lines of the generated CSV file:

"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/lib-config/get.c","34","3","warning","'g_strcasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/lib-config/parse.c","25","2","warning","'g_strcasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/core/commands.c","50","3","warning","'g_strcasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/core/commands.c","68","3","warning","'g_strcasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/core/channels.c","113","3","warning","'g_strcasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/core/commands.c","113","3","warning","'g_strncasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:203) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/core/channels-setup.c","105","3","warning","'g_strcasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/core/channels-setup.c","106","7","warning","'g_strcasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/core/channels.c","202","7","warning","'g_strcasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:200) [-Wdeprecated-declarations]"
"/.../odasa/projects/irssi/revision-2013-July-17--15-59-33/src/src/core/commands.c","268","3","warning","'g_strncasecmp' is deprecated (declared at /usr/include/glib-2.0/glib/gstrfuncs.h:203) [-Wdeprecated-declarations]"

The Perl script above is an example – any script, command or program that produces comma-separated values is suitable for use in this situation.

Save the script as ${project}/parse-gcc-warnings.pl.

Now, you need to update the project configuration to call the script during every snapshot build. Add the following build command to the project configuration, after the line that invokes make:

Build command for collecting gcc warnings data
 <build>bash -c "perl ${project}/parse-gcc-warnings.pl &lt; ${snapshot}/log/build.log &gt; ${snapshot}/external/data/gcc-warnings.csv"</build>

This uses the redirection operators < and >, implemented by bash, to feed the current build log to our Perl script, and to save the output as ${snapshot}/external/data/gcc-warnings.csv, thus fulfilling the contract for providing external data.

Create a new snapshot and build it to ensure you have a database that has the extra information.

If you have already created a snapshot using the bootstrap command, then addLatestSnapshot will not create a new snapshot because there is already a snapshot with today's date. You can force addLatestSnapshot to discard the existing snapshot for today and create a new one (using the updated project configuration) by running: odasa addLatestSnapshot --overwrite.

Custom analyses

The final step is to write a query to process the new data to produce violations that can be displayed in client applications. The following QL file demonstrates the key requirements for writing queries to process external data. Save a copy of this query in a custom query folder, for example: odasa/queries/local-c/GccWarnings.ql

GccWarnings.ql
 /**
 * @name GCC warnings
 * @description The warnings produced by GCC during the build.
 * @kind problem
 * @problem.severity warning
 */
import default
import external.ExternalArtifact
/**
 * A particular type of external data, namely a GCC warning.
 */
class GccWarning extends ExternalData {
    /**
     * This class represents rows from the 'gcc-warnings.csv' file.
     */
    GccWarning() {
        this.getDataPath() = "gcc-warnings.csv"
    }
    
    /**
     * The absolute path of the file in which the warning occurs.
     */
    string getPath() {
        result = getField(0)
    }
    
    /**
     * The reported line of the warning.
     */
    int getLine() {
        result = getFieldAsInt(1)
    }
    
    /**
     * The reported column of the warning.
     */
    int getCol() {
        result = getFieldAsInt(2)
    }
    
    /**
     * The "kind" of the warning -- typically, this will be the string "warning".
     */
    string getKind() {
        result = getField(3)
    }
    
    /**
     * The warning message.
     */
    string getMessage() {
        result = getField(4)
    }
	
	/**
	 * The file associated with this warning.
	 */
	File getFile() {
		result.getFullName() = this.getPath()
	}
    
    /**
     * The URL associated with this warning.
     */
    string getURL() {
        exists(string path, int line, int col |
            path = this.getPath() and
            line = this.getLine() and
            col = this.getCol() and
            toUrl(path, line, col, line, col, result)
        )
    }
}

from GccWarning w
select w, "gcc " + w.getKind() + ": " + w.getMessage()

As always with QL, you could move the class definition into a QLL library that is then imported by any queries where you want to use the class.

By defining the getURL() method (line 58), the GccWarning class identifies where to find the code that generated the warning. This enables client applications to display the violation with the correct line of source code. This is precisely the same mechanism that is used for the default analyses, as described in the topic: Locations and strings for QL entities.

The getFile() method is of particular interest as it ties each row of the imported CSV file to a specific database element – in this case, a File object in QL. In this query, we achieve that by stating that the path reported with the gcc warning matches the full name of the associated database File, but more complex matching schemes (indeed, arbitrary predicates) are possible.

File paths imported into the database by Semmle's tools are normalized, and consequently any file paths mentioned in the external data should be normalized using the same process to avoid spurious differences. The following steps are performed during path normalization:

  • File canonicalization: This expands any symbolic links in the path name, replacing them with their targets instead.
  • Slash normalization: On Windows, backslashes are converted to forward slashes.
  • Case normalization: On Windows (and other case-insensitive file systems), each directory's "standard" case is used.

For example, the path c:\UseRS\aDmiNIStraTOr might be normalized to C:/Users/Administrator.

Putting it all together

The final step is to add the newly defined custom analysis to a query suite and run it using analyzeSnapshot. This process is described in Preparing custom queries and Grouping queries. In essence, a line like the following will need to be added to a custom query suite:

+ local-c/GccWarnings.ql

(This assumes that the query was saved at odasa/queries/local-c/GccWarnings.ql

Metrics based on external data

A common use case for external data is to define metrics based on the imported data. For example, if a test coverage tool reports its results in a CSV format, those reports can be included in the snapshot database using the steps outlined above, and new metrics reporting on the code coverage of various source elements can be defined.

For example, you can extend the GCC warning query above to define a simple metric: The number of gcc warnings per function in a source file.

First, since we now want to re-use the same QL class definition in multiple queries, let us move the definition into a library file, GCC.qll. Then we can update the original query, GCCWarnings.ql, and replace the QL class definition with an import statement to import  definitions from the new library file import GCC. Finally, we can define our metric (remember to check the Query metadata for metric queries):

GccWarningsMetric.ql
 /**
 * @name GCC warnings per function
 * @description The number of GCC warnings per function in each file
 * @kind treemap
 * @metricType file
 * @treemap.warnOn highValues
 * @metricAggregate avg max
 */
import GCC
 
from File f
select f,
	count(GccWarning warning | warning.getFile() = f) / 
	count(Function func | func.getFile() = f).maximum(1.0)

There are a few things worth commenting on in the query itself:

  • Since a count aggregate normally returns an integer, we have to be careful to avoid integer division (which would throw away any fractional part of the result). We get around this by using the int.maximum(float) operation (the argument is written as 1.0 rather than just 1 to ensure the result is a floating-point number); this also avoids problems with division-by-zero.
  • The metric here is a normalized value ("warnings per function"), and so it does not make sense to sum the values for different files. This is why we specify @metricAggregate avg max – the default would be to allow avg and sum as aggregations, which is clearly not desirable.

We also need to add the metric to our custom query suite:

+ local-c/GccWarningsMetric: /Metrics/GCC

After reanalyzing, we can browse the results of the new metric like any other.

Troubleshooting

The most common problem encountered during the import of external data is queries that give no results. This is usually the result of an external data file where the format is subtly wrong, or the result of some paths failing to match. Occasionally, it can take several iterations of refining the external data and running the associated queries before the desired results are achieved.

Normally, you can only import the corrected CSV file into the snapshot database by rebuilding the snapshot from scratch. However, you can reduce the turnaround time using the following steps:

  1. Save the corrected CSV file(s) in ${snapshot}/external/data
  2. Update the snapshot database to include the corrected file by running the following command in the snapshot directory: odasa updateExternalData.
  3. Force re-evaluation of the queries that rely on the external data.