Note: This documentation is for the legacy command-line tool odasa.
The final version was released in May 2020. Support for this tool expires in May 2021.

For documentation on the new generation CodeQL CLI, see CodeQL CLI .
In particular, you may find the Notes for legacy QL CLI users useful in planning your migration.

Skip to end of metadata
Go to start of metadata

The C/C++ security analyses are regularly evaluated against the SAMATE Juliet tests maintained by the US National Institute of Standards and Technology (NIST). This ensures that the quality and discrimination of the results is maintained as the queries are updated, for example, for changes to the C++ language, or improvements to the CodeQL library, and enhancements to the code extraction process.

Summary of results

The following table summarizes the results for the latest release of the C/C++ security queries run against the SAMATE Juliet 1.3 tests. In the table, each row represents a weakness, and the columns show the following information:

  • TP – count of all true positive results: the code has a known security weakness, and the CodeQL analyses correctly identify this defect.
  • FP – count of all false positive results: the code has no known security weakness, but the CodeQL analyses are over cautious and suggest a potential problem.
  • TN – count of true negative results: the code has no known security weakness, and CodeQL analyses correctly pass the code as secure.
  • FN – count of all false negative results: the code has a known security weakness, but CodeQL analyses fail to identify this defect.

In an ideal implementation of the analyses, the number of false positives (FP) and false negatives (FN) would be zero, but that is impossible to achieve by static analysis. The figures for FP and FN show where there are limitations in the present implementation.

CWE TP FP TN FN Count
CWE-114 310 0 576 266 576
CWE-134 1890 0 2880 990 2880
CWE-190 3276 113 3847 684 3960
CWE-191 2370 0 2952 582 2952
CWE-197 495 0 864 369 864
CWE-200 54 0 54 0 54
CWE-242 18 0 18 0 18
CWE-327 54 0 54 0 54
CWE-367 36 0 36 0 36
CWE-416 360 0 459 99 459
CWE-457 180 0 948 768 948
CWE-468 37 0 37 0 37
CWE-665 104 0 193 89 193
CWE-676 18 0 18 0 18
CWE-681 18 0 54 36 54
CWE-772 1719 468 1334 83 1802
CWE-835 3 0 6 3 6

Interpreting the results

The report CAS Static Analysis Tool Study – Methodology, by the Center for Assured Software of the National Security Agency of the USA defines four different ways to measure success:

  • Precision = TP/ (FP+TP)
  • Recall = TP/(TP+FN)
  • F-Score = 2*(Precision*Recall)/(Precision+Recall)
  • Discrimination rate = #discriminated tests / #tests

For each of these metrics, a higher score is better. There is clearly a trade-off between the precision and recall metrics: increasing the level of precision or recall for any analysis reduces the level of the other metric. The F-score is therefore an attempt to quantify the balance of decision between these two metrics.

The following table shows the results of calculating these metrics for the results shown above. These scores compare very favorably with the sample tools tested by the Center for Assured Software.

CWE Precision F-score Recall Disc. Rate
CWE-114 100% 70% 54% 54%
CWE-134 100% 79% 66% 66%
CWE-190 97% 89% 83% 80%
CWE-191 100% 89% 80% 80%
CWE-197 100% 73% 57% 57%
CWE-200 100% 100% 100% 100%
CWE-242 100% 100% 100% 100%
CWE-327 100% 100% 100% 100%
CWE-367 100% 100% 100% 100%
CWE-416 100% 88% 78% 78%
CWE-457 100% 32% 19% 19%
CWE-468 100% 100% 100% 100%
CWE-665 100% 70% 54% 54%
CWE-676 100% 100% 100% 100%
CWE-681 100% 50% 33% 33%
CWE-772 79% 86% 95% 69%
CWE-835 100% 67% 50% 50%

Conclusions

The tests suggest judicious choices have been made to balance the number of false positive results (an incorrect warning is issued) and false negative results (a true defect was not identified). Where comparative results are available for other tools, the CodeQL analyses stand out for their exceptional accuracy.

 

  • No labels