Introducing the CodeQL libraries for COBOL

Overview

There is an extensive library for analyzing COBOL code. The classes in this library present the data from a CodeQL database in an object-oriented form and provide abstractions and predicates to help you with common analysis tasks.

The library is implemented as a set of QL modules–that is, files with the extension .qll. The module cobol.qll imports most other standard library modules, so you can include the complete library by beginning your query with:

import cobol

The rest of this tutorial briefly summarizes the most important classes and predicates provided by this library, including references to the detailed API documentation where applicable.

Introducing the library

The CodeQL library for COBOL presents information about COBOL source code at different levels:

  • Textual — classes that represent source code as unstructured text files
  • Lexical — classes that represent comments and other tokens of interest
  • Syntactic — classes that represent source code as an abstract syntax tree
  • Name binding — classes that represent data entries and data references
  • Control flow — classes that represent the flow of control during execution
  • Frameworks — classes that represent interactions via CICS and SQL

Note that representations above the textual level (for example the lexical representation or the flow graphs) are only available for COBOL code that does not contain fatal syntax errors. For code with such errors, the only information available is at the textual level, as well as information about the errors themselves.

Textual level

At its most basic level, a COBOL code base can simply be viewed as a collection of files organized into folders.

Files and folders

Files are represented as entities of class File, and folders as entities of class Folder, both of which are subclasses of class Container.

Class Container provides the following member predicates:

  • Container.getParentContainer() returns the parent folder of the file or folder.
  • Container.getAFile() returns a file within the folder.
  • Container.getAFolder() returns a folder nested within the folder.

Note that while getAFile and getAFolder are declared on class Container, they currently only have results for Folders.

Both files and folders have paths, which can be accessed by the predicate Container.getAbsolutePath(). For example, if f represents a file with the path /home/user/project/src/main.cbl, then f.getAbsolutePath() evaluates to the string "/home/user/project/src/main.cbl", while f.getParentContainer().getAbsolutePath() returns "/home/user/project/src".

These paths are absolute file system paths. If you want to obtain the path of a file relative to the source location in the CodeQL database, use Container.getRelativePath() instead. Note, however, that a database may contain files that are not located underneath the source location; for such files, getRelativePath() will not return anything.

The following member predicates of class Container provide more information about the name of a file or folder:

  • Container.getBaseName() returns the base name of a file or folder, not including its parent folder, but including its extension. In the above example, f.getBaseName() would return the string "main.cbl".
  • Container.getStem() is similar to Container.getBaseName(), but it does not include the file extension; so f.getStem() returns "main".
  • Container.getExtension() returns the file extension, not including the dot; so f.getExtension() returns "cbl".

For example, the following query computes, for each folder, the number of COBOL files (that is, files with extension cbl) contained in the folder:

import cobol

from Folder d
select d.getRelativePath(), count(File f | f = d.getAFile() and f.getExtension() = "cbl")

Locations

Most entities in a CodeQL database have an associated source location. Locations are identified by four pieces of information: a file, a start line, a start column, an end line, and an end column. Line and column counts are 1-based (so the first character of a file is at line 1, column 1), and the end position is inclusive.

All entities associated with a source location belong to the class Locatable. The location itself is modeled by the class Location and can be accessed through the member predicate Locatable.getLocation(). The Location class provides the following member predicates:

  • Location.getFile(), Location.getStartLine(), Location.getStartColumn(), Location.getEndLine(), Location.getEndColumn() return detailed information about the location.
  • Location.getNumLines() returns the number of (whole or partial) lines covered by the location.
  • Location.startsBefore(Location) and Location.endsAfter(Location) determine whether one location starts before or ends after another location.
  • Location.contains(Location) indicates whether one location completely contains another location; l1.contains(l2) holds if, and only if, l1.startsBefore(l2) and l1.endsAfter(l2).

Lexical level

At this level we represent comments through the Comment class. We do not currently retain any tokens other than scope terminators (for example END-IF), which are represented by the ScopeTerminator class.

Comments

The class Comment represents the comments that occur in COBOL programs:

The most important member predicates are as follows:

  • Comment.getText() returns the source text of the comment, not including delimiters.
  • Comment.getScope() returns the location of the source code to which the comment is bound.

Scope terminators

The class ScopeTerminator represents the scope terminators that occur in COBOL programs:

The most important member predicates are as follows:

  • ScopeTerminator.getStmt() returns the statement whose scope this terminator is closing.

Syntactic level

The majority of classes in the CodeQL library for COBOL are concerned with representing a COBOL program as a collection of abstract syntax trees (ASTs).

The class ASTNode contains all entities representing nodes in the abstract syntax trees and defines generic tree traversal predicates:

  • ASTNode.getParent(): returns the parent node of this AST node, if any.

Please note that the libraries for COBOL do not currently represent all possible parts of a COBOL program. Due to the complexity of the language, and its many dialects, this is an ongoing task. We prioritize elements that are of interest to queries, and expand this selection over time. Please check the detailed API documentation to see what is currently available.

The main structure of any COBOL program is represented by the Unit class and its subclasses. For example, each program definition has a ProgramDefinition counterpart. For each PROCEDURE DIVISION in the program, there will be a ProcedureDivision class.

All data definitions are made accessible through the DescriptionEntry class and its subclasses. In particular, you can use DataDescriptionEntry to find the typical data entries defined in a WORKING-STORAGE SECTION.

References to data items are modeled through the DataReference class. You can use DataReference.getTarget() to resolve the reference to the matching data item.

Individual statements are represented by the class Stmt and its subclasses. The name of the specific type starts with the statement’s verb. For example, OPEN statements are covered by the class Open. Unknown statement types are covered by the OtherStmt class.

Control flow

You can represent a program in terms of its control flow graph (CFG) using the AstNode.getASuccessor predicate. You can use this predicate to find possible successors to any statement, sentence, or unit in a procedure division.

Parse errors

COBOL code that contains breaking syntax errors cannot usually be analyzed. All that is available in this case is a value of class Error representing the parse error. It provides information about the syntax error location and the error message through predicates Error.getLocation and Error.getMessage respectively.

Frameworks

CICS

Calls to the CICS system through EXEC CICS are represented by the class CICS.

SQL

Calls to the SQL system through EXEC SQL are represented by the class SqlStmt and its subclasses.

What next?