Semmle 1.19
Skip to end of metadata
Go to start of metadata

This topic describes the use of "worker nodes" for Team Insight data collection and analysis.

Overview

Typically, each instance of Team Insight executes build, analysis and attribution jobs on multiple servers, each of which operates as a "worker node" in a distributed system. Typically, the master server installs the Semmle Core software on each worker node and then starts one or more worker daemon processes. After they have been started, the worker processes connect to the master server over HTTPS and fetch details of a data collection job to perform. When a worker process completes a job, it sends the analysis data back to the master server and picks up details of the next job to work on. 

The workers access source code from the remote source code repository either directly or indirectly (through the master server), depending on the type of version control system being used. For Git and Mercurial repositories, the master clones the repository and the worker processes fetch a copy of this, from which they checkout specific revisions for analysis without having to pull data from the remote repository. For other repository types, the workers each connect directly to the version control system on the remote repository server.

Any required build dependencies (for example, libraries and compilers) must be installed on the worker nodes. For a full description of the infrastructure and requirements for Team Insight, see Infrastructure requirements for Team Insight.

Types of data collection setup

There are three ways to set up Team Insight data collection:

  • Local setup—This provides a quick way to set up data collection on a single machine (that is, locally on the master server). This setup is typically only used for testing configurations or for familiarizing yourself with Team Insight before deploying it for full-scale analysis.

    For step-by-step instructions on how to configure a local setup, see Configuring a local setup of data collection.

  • Static setup—This allows worker nodes to be defined centrally on the master server and for the setup and starting of worker processes to be done automatically. This setup requires SSH to be installed on the master server and each worker node, so it may not be appropriate in some situations.

    For step-by-step instructions on how to configure a static setup, see Configuring a static setup of data collection.

  • Dynamic setup—With this setup, workers can be added and removed on an ad hoc basis. Setup and starting of the worker processes is not handled by the master server so there is no requirement for SSH. Instead setup and starting of the worker processes must be done individually on each worker node.

    For step-by-step instructions on how to configure a dynamic setup, see Configuring a dynamic setup of data collection.

Defining servers as worker nodes (static and local setups)

In a static setup of data collection you define one or more worker nodes in a workers.xml file. This centralizes the maintenance of workers and reduces the need for some manual operations setting up and starting the individual worker processes on each worker node. 

For a local setup you can either define the local workers in the workers.xml file, or you can use the --local-workers flag of the attribution command to create and start local worker processes using default settings. For more details, see Configuring a local setup of data collection.

-

Setting up and starting remote worker processes from the master server, using a workers.xml configuration file, requires SSH and rsync to be installed on the master server and the worker nodes referenced in the workers.xml file. If this is not possible you can use a dynamic setup, which does not use a workers.xml file—see the following section.

-

To configure a static setup—defining worker nodes for Team Insight explicitly—create a file called workers.xml in the team-insight/<instance> directory alongside the team-insight configuration file—for example, SEMMLE_HOME/team-insight/TI-instance/workers.xml. This file specifies the name of the servers that will be used as worker nodes and connection details to allow the master server to connect to each server, over SSH, for the initial setup operation. After this initial automated setup operation, the worker processes connect to the master server over HTTPS. See workers.xml for details of the format of this file. For step-by-step setup instructions, see Configuring a static setup of data collection.

Configuring workers (dynamic setup)

With a dynamic setup, you must set up a credentials store on each worker node to allow the worker processes to communicate with the master server over HTTPS. A credentials store is created for you when you run the attribution tool on the master server. You can use this credentials store by copying it to each of the worker nodes.

You also need to create a workspace location for each worker process and copy the worker-daemon.jar file from the Semmle distribution to the worker node.

Finally, you need to start each worker process on each worker node.

For step-by-step setup instructions, see Configuring a dynamic setup of data collection.

Installing dependencies

The worker nodes need to be able to build the projects with which they are associated. Therefore you must make sure that all of the build dependencies of those projects (libraries, build tools, compilers, etc.) are installed on the worker nodes. 

If a project has particular build requirements that are not available on all of the worker nodes—and which you cannot, or do not want to, install on all of the workers—you can label the project in the team-insight file to ensure that data collection for that project is only handled by specific worker nodes. For details of how to set this up, see "Using labels to ensure a project is built in the correct environment" below.

Installing Semmle Core on the worker nodes

Each worker process needs to use Semmle Core to perform Semmle analysis. Semmle Core is automatically installed on the worker nodes the first time a worker process connects to the master server to fetch a job for processing (or the first time after Semmle Core has been modified on the master server—for example, after a new release has been installed on the master server).

By default, the distribution of Semmle Core used by the master server is installed on the worker nodes: one copy for each worker process. However, when the master server and worker nodes use different operating systems, you need to specify an appropriate version of Semmle Core for the worker nodes to use, and ensure that the correct distribution is stored at the specified location on the master server. For example, the master server may be running Linux but some or all of the projects you want to analyze may need to be built on Windows to satisfy their build dependencies.

Different versions of Semmle Core are available for Linux, Windows and OS X operating systems. If the worker nodes run a different operating system from the master server, you must edit the team-insight configuration file and specify the correct Semmle Core distribution for the worker nodes using the workspace element within either the defaults element (to specify a distribution to be used by all workers) or within the project element (for specific projects).

The following example, from a Linux or OS X master server, specifies a Windows distribution to be downloaded by worker nodes that analyze this project:

Example: project element in a team-insight file
<project ...>
   <workspace>/opt/semmle/distributions/release-n-n/odasa-windows64</workspace>
   ...
</project>

To check that the directory you have specified does contain a Semmle distribution, make sure the directory includes the Semmle environment setup script (setup.sh or setup.bat) and a tools subdirectory.

Using labels to ensure a project is built in the correct environment

Labels allow you to specify that certain projects should only be given to certain workers for data collection. 

Let's imagine that I want to analyze a mixture of projects: some built on Linux, others can only be built on Windows. My Semmle master server is running Linux. I have therefore used the workspace element in the team-insight file to specify the projects that should be analyzed using the Windows 64-bit Semmle Core distribution, rather than the distribution used by my master server. Most of my worker nodes are running Linux but I have one Windows server for running the worker processes that will analyze the Windows-based projects. In this situation I need to ensure that the Linux worker processes don't attempt to analyze the Windows-based projects, and the Windows worker processes don't attempt to analyze the Linux-based projects. I can do this by using labels.

The worker processes automatically detect if they are running on a Windows or a UNIX platform and take on the label windows or unix as appropriate. So all I need to do, in this case, is to apply the label windows to the projects that must be run on Windows. This is done in the team-insight file by adding a required-labels element as the child of the appropriate project elements:

Excerpt from a team-insight file
<team-insight>
   ...
   <project ...> 
      <workspace>/opt/semmle/distributions/release-n-n/odasa-windows64</workspace>
      <required-labels>windows</required-labels>
      ...
   </project>
</team-insight> 

In the example above, this project will only be processed by workers that have the "windows" label, which worker processes running on Windows are given by default.

To find out more about labeling and how it can be used to ensure specific build requirements are available, see Labeling projects for specific build dependencies

Java requirement

The worker-daemon.jar program that runs on the worker nodes requires Java 8. 

For a static setup of data collection, the master server checks for the availability of Java 8 during the initial setup of each worker node. If Java 8 is not found, and the master and worker are using the same operating system, the JRE used by the master server is copied to the worker node and used to run worker-daemon.jar.

If the master server and the worker node use different operating systems then you must make sure that Java 8 is installed on the worker node. 

Similarly, in a dynamic setup of data collection you must make sure that Java 8 is installed on every worker node.

Disk space requirements

Each worker node typically needs a minimum of disk space equivalent to about 10 times the size of the largest project configured for analysis, plus 6 GB, multiplied by the number of worker processes that you intend to run on that node.

Tutorial

The topic Setting up the master server is the starting point for a three-stage tutorial which guides you through the process of:

  • Setting up a master server
  • Setting up worker processes (either on a worker node or locally on the master server)
  • Running data collection and analysis