How To Set A Benchmark Of False Positives With SAST Tools
Table of Contents
Many Static Application Security Testing (SAST) tools struggle with false positives. They often report that a vulnerability is present, while, in reality, it does not exist. This inaccuracy weighs down the engineering team, as they spend productive hours triaging the false alarms.
By setting a benchmark of false positives — a limit, above which is unacceptable — you can establish a point of reference or standard against which to measure the efficacy of your SAST tool. It will also help you know the extent of false positives you can allow from your security analysis tool.
Why benchmark false positives?
Performing application security testing is an important way to identify flaws that attackers could use to compromise the application. If a security tool can properly identify vulnerabilities, developers can fix them and thus improve the security of their applications.
SAST tools produce results that are usually grouped into four categories:
- True Positive—correctly identifying that a vulnerability exists.
- True Negative—correctly identifying that a vulnerability does not exist.
- False Positive—erroneously identifying that a vulnerability exists, while it actually does not exist.
- False Negative—not disclosing that a vulnerability is existing, while it is actually present.
The objective of any SAST tool should be to maximize the number of true positives and true negatives while minimizing the false negatives and false positives. However, this is difficult to accomplish from an engineering perspective.
Thus, the engineer needs to make a design decision: Should he design the SAST tool to generate more true positives while also generating more false positives? Or should he dial it down, generating fewer false positives at the expense of missing some true positives?
The most common decision is to lean towards the former design because this is considered to be a “winning” strategy in a sales situation, when a customer tests one product against another. The engineer is betting that the customer is going to simplistically choose the product that produces the most “results” without bothering to examine the validity of those results.
In reality, this might win the sale, but it hurts the customer in the long run. Why? Because a security product that produces too many false positives will overwhelm your developers and make them avoid using the tool, or at least avoid paying much attention to the results the tool produces. And when that happens, your security program suffers. Applications containing vulnerabilities are deployed to production.
To solve this problem, it’s important to set a benchmark of false positives with SAST tools. This is a measure or an agreed amount of false positives that your organization considers acceptable, so that you can avoid wasting a lot of time hunting vulnerabilities that actually do not exist.
How to measure the success of SAST tools
A simple way to measure the success of a SAST tool is to subtract its false positive rate from its true positive rate. If you get a perfect accuracy score of 100%, it implies that the true positive rate for the SAST tool is 100%, and the false positive rate is 0%.
Let’s say scanning the vulnerabilities in an application with three different SAST tools generates the following results:
- Tool #1 does nothing. It does not detect any vulnerabilities and produces no false alarms. Its true positive rate is 0%, and false positive rate is also 0%. So, its accuracy score is 0%, which is worthless.
- Tool #2 reports that every line of code in the application has a vulnerability. So, its true positive rate is 100% because it perfectly detects every vulnerability. However, if it detects a number of harmless, defunct, ineffective or unimportant positives, it has a high false positive rate. For example, if the tool identifies a total of 1000 vulnerabilities, but 800 of them pose no threat, then its false positive rate is 80%, and its accuracy score is only 20%, which doesn’t offer good value to the testing process.
- Tool #3 simply flips a coin to determine if a line of code is vulnerable or not. It randomly guesses, which results in an equal true positive rate and false positive rate of 50% each. Its accuracy score would also be 0%.
The OWASP Foundation has established a free and open source Benchmark Project that assesses the speed, coverage, and accuracy of automated software vulnerability identification tools.
The Benchmark Project is a sample application seeded with thousands of exploitable vulnerabilities, some true and some false positives. You can run a SAST tool against it and score the results of the tool.
Ideally, the best results for a security tool would be at the upper left corner—indicating minimal false positives and maximum true positives.
Going deeper with the benchmark
We mentioned above that a simple way to measure the success of a SAST tool is to subtract its false positive rate from its true positive rate. But this measure by itself is not adequate because it does not look at other important factors.
Take the example of these two different SAST tools, each of which has been scored against the OWASP project:
- SAST tool #1 identified 10 true positives and 3 false negatives, yielding an accuracy score of 70%.
- SAST tool #2 identified 100 true positives and 30 false negatives, yielding an accuracy score of 70%.
Although the simplistically derived accuracy scores for these two different tools are the same, you may strongly prefer one over the other. So we need to introduce some additional metrics, as follows:
- Completeness — Completeness refers to a measure of the number of real vulnerabilities detected (true positives) compared to the total number of real vulnerabilities present.
A higher completeness score (theoretical maximum is 1) indicates that a SAST tool identifies more of the existing issues in the application. A complete tool offers you better visibility into your code. This could lead to identifying more vulnerabilities and shipping more secure products.
- Depth — In SAST testing, depth refers to the ability of a tool to detect a wide range of vulnerabilities. A tool that supports a large number of programming languages and has a comprehensive database of vulnerabilities sourced from multiple, up-to-date outlets has the necessary depth to help you find security vulnerabilities.
If a tool’s depth of coverage is limited, it will generate a massive number of false positives and also produce inconsistent results between different vulnerabilities and languages. So, you should consider this factor when setting a benchmark of false positives with your SAST tool.
Conclusion
Setting appropriate benchmarks for your application testing program needs to be done collaboratively, because different teams have different goals. The security team naturally wants every application to introduce the lowest possible security risk, which means they want security tools that score very high on the completeness scale regardless of the number of false positives they produce. The development team has almost opposite goals. They want to spend their time developing new features, and they don’t want to be slowed down by unproductive work such as dealing with false positives.
Furthermore, your benchmarks might also differ depending on which application you are testing. Some applications might be higher value than others, or more exposed to attack. For these sensitive applications, you might accept higher false positives to obtain higher completeness.
And all of these accuracy scores are just one dimension of a SAST tool. Other important dimensions include how fast the tool runs, how conveniently the results can be consumed by developers, and how easily the tool can be deployed and automated as part of your workflow.
Based on all of these considerations, Mend SAST has proven to be an extremely effective and efficient tool for modern organizations that are striving for both speed and security. If you aren’t yet familiar with Mend SAST, Check it out!