Next-Gen Vulnerability Assessment: AWS Bedrock Claude in CVE Data Classification

Table of Contents

Large language models are fascinating tools for cybersecurity. They can analyze large quantities of text and are excellent for data extraction. One application is researching and analyzing vulnerability data, specifically Common Vulnerabilities and Exposures (CVE) information.  As an application security company with roots in open source software vulnerability detection and remediation, the research team at Mend.io found this a particularly relevant area of exploration.

Introduction

In recent years, we’ve seen a significant shift in how we approach and handle cybersecurity threats. With the emergence of advanced AI and LLMs, a new space has opened for us to explore and enhance our understanding of vulnerability data. But here’s the catch: this data is only sometimes ready to use off the shelf. It often requires some elbow grease, manual cleanup, and alignment to make it valuable. That’s where the power of LLMs comes into play.

You see, CVE data is predominantly text-based. Its descriptions, reports, and details are all written down so humans can read and understand. But when dealing with thousands of records, reading through each isn’t just impractical; it’s impossible. That’s the beauty of using LLMs in this context. These models are not just good at understanding and generating text—they’re fantastic at sifting through vast amounts to find relevant details, patterns, and insights.

Best of all, you don’t need to be an AI expert to understand how this works. Whether you’re a seasoned cybersecurity professional, a budding data scientist, or just someone with a keen interest in the field, the advances in LLMs and their application in CVE data analysis are something to be excited about. So, let’s dive in and explore how these technological marvels are changing the game in cybersecurity research and what that means for the future of digital safety.

Overview of CVE data classification

CVE Data Classification is a process in cybersecurity where Common Vulnerabilities and Exposures (CVEs) are categorized and analyzed for better understanding, management, and mitigation.

Each CVE entry contains a unique identifier, a standard description, and a reference to publicly known information about the vulnerability.

The need for classification

Imagine a library with thousands of books randomly scattered around. Finding a specific book in this chaos would be a nightmare.

Similarly, with thousands of CVEs reported each year, the cybersecurity community needs a systematic way to sort through them. Classification helps by organizing these vulnerabilities into categories based on severity, affected software, and attack type.

When dealing with CVEs, it’s essential to recognize that the initial reports of these vulnerabilities are often not curated or well-written. The contributors of these reports range from developers and system administrators to end-users, many of whom are not security experts or professional writers. Their primary goal is to flag an issue, not necessarily to provide a polished, comprehensive analysis. This results in significant variation in reports, some terse and cryptic, others verbose but lacking clarity or structure.

Understanding the complexity of CVE reports

This diversity and the often unrefined nature of CVE reports present a challenge in extracting critical information. These complex narratives can bury crucial details, such as the type of vulnerability, the affected software or hardware, potential impact, attack requirements, and suggested mitigation steps. Navigating this information maze requires a keen eye and a deep understanding of what to look for.

The trouble with unstructured data

The inconsistency and sometimes poor quality of these reports make it difficult for automated systems to parse and understand the text accurately. When the input data is unclear, incomplete, or inconsistent, the risk of misunderstandings or oversights increases, which can have severe implications.

Below is an example of a CVE with its description, severity, and other information.

Human intervention often becomes necessary to review, interpret, and refine raw vulnerability reports. Security experts are crucial in translating these initial submissions into actionable insights. However, LLMs can also be invaluable in this context. With their advanced natural language capabilities, they can sift through the text to identify and extract key information from poorly written reports. Let’s take a look. 

The challenge

Out of nearly 70,000 selected vulnerabilities and their descriptions, our mission was to isolate and identify those with specific Attack Requirements details in their descriptions. Before we dive into the task itself, it is good to understand what Attack Requirements are in the context of CVEs.

Understanding CVE attack requirements

Attack Requirements (AT) is a crucial metric in Common Vulnerabilities and Exposures (CVEs). It delves into the prerequisite conditions necessary for an exploit to be successful. These conditions could include specific system configurations, user actions, or any other state the vulnerable component must be in for the attack to occur. Understanding these requirements is vital, because it helps security teams assess risk and develop mitigation mechanisms. It’s about knowing how an attack happens and what specific circumstances can occur.

To better understand the AT metric, let’s look at the well-known Spring4Shell (CVE-2022-22965) description:

“A Spring MVC or Spring WebFlux application running on JDK 9+ may be vulnerable to remote code execution (RCE) via data binding. The specific exploit requires the application to run on Tomcat as a WAR deployment. If the application is deployed as a Spring Boot executable jar, i.e. the default, it is not vulnerable to the exploit. However, the nature of the vulnerability is more general, and there may be other ways to exploit it.” 

We can see that for the environment using that package to be exploitable, it has to also run on JDK 9+. So, in that case, the AT will be set to ‘Present’.

Variability in CVE descriptions

CVE descriptions are the textual narratives that come with each reported vulnerability. They average around 43 words but display a significant range in length and detail. Some are concise, offering just a glimpse of the issue, while others provide an in-depth analysis. These descriptions are contributed by a diverse global community, leading to a wide spectrum of quality and clarity. This variability adds a layer of complexity to the task of accurately identifying and extracting attack requirement details.

Below, you can find some basic statistics about the descriptions of the CVEs we have analyzed:

MetricValue
Total words2,936,951
Unique words170,216
Average words42.85
Stdev in words30.76
P95 in words94
P99 in words164
Total characters20,051,586
Stdev in characters206.7
P95 in characters642
P99 in characters1,143

The statistical analysis of those CVE descriptions, showing high variability and diverse content complexity, strongly supports using an LLM for data classification. The significant range in description lengths and the depth of content underscores the challenge of manual analysis and the necessity for sophisticated solutions.

Our goal and tolerance for errors

We aimed to filter through these 70,000 vulnerabilities and accurately identify those with specific details about Attack Requirements. In pursuing this goal, we recognize that there will be false positives—instances where the system incorrectly flags a CVE containing Attack Requirements details. While not ideal, these false positives are tolerable to an extent. They ensure we cast a wide enough net to capture all relevant data. However, what we strive to avoid at all costs are false negatives. These occur when a CVE contains details of attack requirements, but the system fails to identify them. Missing these crucial details is not an option, as it could potentially leave systems vulnerable to attacks that could have been prevented.

Model selection: Claude v2.1 vs. GPT-4

When selecting a suitable LLM for classifying vulnerabilities, we faced a critical decision: choosing between Claude v2.1 and GPT-4. This choice wasn’t just about picking a tool but about aligning our objectives with these advanced technologies’ capabilities, integration, and support systems.

Outcome quality. Our initial experiment involved analyzing a sample of 100 vulnerabilities, focusing specifically on understanding attack vectors. After fine-tuning the prompts for both models, we reached similar outcomes from each. It’s worth noting that our team was initially unfamiliar with Claude, and a learning curve was involved.

False positives and suitability. Both models required prompt tuning to reduce false positives, but Claude v2.1 emerged as the better-suited option for our specific needs. The tag-based model of Claude, which allows for the recognition of XML tags within prompts, provided a structured way to organize and refine our queries. This significantly enhanced our ability to delineate different subsections of a prompt, leading to more precise and valuable outcomes.

Claud prompt with XML tags:

 <Instructions>
        You are designed to help analyze cves entries for calculating cvss score. Users should share a link to CVE entry in NVD, you should check the link and all the attached references and provide a cvss 4.0 score and vector..
    </Instructions>
    <AnalysisProcess>
        <StepOne>
            <Title>Analyze CVE Description</Title>
            <Description>Read the description, and try to extract all relevant vulnerable components and understand the vulnerability impact.</Description>
        </StepOne>
        <StepTwo>
            <Title>Examine attached references</Title>
            <Description>Analyze all the references attached to the cve enrtry, try to find more relevant information that will help you asses the cvss vectore and score.</Description>
        </StepTwo>
    </AnalysisProcess>
    <Conclusion>
        At the conclusion of these steps, you should provide an estimation of the cvss score and vectore of the supplied CVE.
    </Conclusion>
    <DataHandling>
        Refer to uploaded documents as 'knowledge source'. Adhere strictly to facts provided, avoiding speculation. Favor documented information before using baseline knowledge or external sources. If no answer is found within the documents, state this explicitly. 
    </DataHandling>

GPT Prompt:

You are designed to help analyze cves entries for calculating cvss score. Users should share a link to CVE entry in NVD, you should check the link and all the attached references and provide a cvss 4.0 score and vector.

Upon receiving the information, you should take a structured analysis comprising two critical steps:

1. **Analyze CVE Description:** Read the description, and try to extract all relevant vulnerable components and understand the vulnerability impact.

2. **Examine attached references:** Analyze all the references attached to the CVE entry, try to find more relevant information that will help you assess the cvss vector and score.

At the conclusion of these steps, you should provide an estimation of the cvss score and vector of the supplied CVE.

You have files uploaded as knowledge to pull from. Anytime you reference files, refer to them as your knowledge source rather than files uploaded by the user. You should adhere to the facts in the provided materials. Avoid speculations or information not contained in the documents. Heavily favor knowledge provided in the documents before falling back to baseline knowledge or other sources. If searching the documents didn’t yield any answer, just say that.

Data privacy and security. While data privacy and security are always critical, their importance multiplies manifold when dealing with customer data, making AWS’s proven security infrastructure a significant factor in our decision-making process.

Support and integration. Throughout the initial phases of our research, the support we received from AWS was instrumental. Not only did it improve our learning and adaptation process, but it also enhanced our overall experience with Claude v2.1. Furthermore, our existing infrastructure heavily relies on AWS services, making integrating Claude Bedrock a logical step.

Programmatic access

For my analysis, I utilized the Ruby AWS Bedrock SDK, which offers a straightforward and user-friendly interface. Beyond the initial credentialing set up, the primary step involves leveraging the #invoke_model method. This method enables you to execute your prompt and gather the results:

client = Aws::BedrockRuntime::Client.new(
  region: 'us-east-1',
  credentials: Aws::Credentials.new(
    '',
    '',
    ''
  )
)
resp = client.invoke_model(
  body: {
    prompt: "Who are you?",
    "max_tokens_to_sample": 100,
    "temperature": 0.5,
    "top_k": 250,
    "top_p": 0.999,
    "stop_sequences":["\n\nHuman:"],
    "anthropic_version":"bedrock-2023-05-31"
  }.to_json,
  model_id: "anthropic.claude-v2:1",
  content_type: "application/json",
  accept: "*/*"
)
response = resp.body.read
puts response

Crafting the prompt

The prompt we used with Claude v2.1 for analyzing CVE descriptions, specifically focusing on “Attack Requirements” (AT) as introduced in CVSS v4.0, is a crucial piece of our intellectual property. Due to its detailed and customized nature, tailored to our specific system needs, we’ve chosen not to disclose its content now.

What we can say is that it was crafted with structured XML tags and rich contextual information; the prompt differentiated between “Attack Complexity” (AC) and AT, including examples and hypothetical scenarios to enhance the model’s understanding of attack prerequisites. This careful design ensured the model’s analysis was precise and grounded in practical applications, making it a crucial asset in our CVE data classification efforts.

Regarding differences between prompts designed for Claude and those intended for GPT models, the primary one lies in the structural and contextual specificity required by Claude, especially when utilizing AWS Bedrock’s capabilities. Claude prompts often necessitate a more detailed and structured format, leveraging XML tags to guide the model’s focus and improve response accuracy. As highlighted in the preceding section, AWS played an important role in assisting us with crafting and fine-tuning this prompt.

Challenges encountered

We encountered a few notable challenges during this project. Below, you can find some details on each of the obstacles we had to deal with.

Quota limitations. When attempting to process approximately 70,000 vulnerabilities, each requiring between 2 to 10 seconds to complete, it became clear it would take around four days to complete. The initial strategy to expedite this process involved executing requests in parallel to save time. However, while AWS publishes its model quota limits (https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html), providing a theoretical understanding of capacity, the practical implications of these limits become apparent only through direct application. Our parallel processing approach quickly led to the depletion of our quota allocation, resulting in AWS quota exceeded errors. This real-world experience underscored the challenge of estimating the impact of quota limits, demonstrating that even a single-threaded approach, with cautious pacing, would occasionally hit the daily limits imposed by AWS, leading to unexpected halts in our data processing efforts. Despite these challenges, the situation was manageable in this particular case. However, it raises critical considerations for the high-scale usage of Bedrock in similar contexts.

Cost estimation. The initial estimated budget for this project was approximately $400, based on preliminary data and expected model interactions. However, the project’s final cost escalated to around $1,600. This discrepancy arose primarily due to underestimating the complexity and length of CVE descriptions and details, resulting in more detailed responses than anticipated. Moreover, the initial data sample used for cost prediction was not fully representative of the broader data set, leading to inaccuracies in our budget forecast.

Response characteristics. We expected the model to provide YES/NO answers to our prompts, which were necessary for further data curation steps. However, we often received responses with additional justifications or explanations beyond our anticipated straightforward answers. While these expanded responses provided deeper insights, they also introduced unexpected complexity into our data processing workflow. This characteristic highlighted the need to refine our prompts further to align more closely with our desired response format, optimizing the model’s output for our specific use case.

Results analysis

In our task, Claude v2.1 did a really good job. Ignoring the quota limits for a moment, we sent out 68,378 requests, and almost all of them came back with the YES/NO answers we were looking for. Only eight requests didn’t give us the straightforward answers we expected. That’s a success rate of 99.9883%, which is impressive. Even for those eight times when we didn’t get a simple YES or NO, Claude still provided enough info for us to figure out the answer.

Character count of the prompt (without CVE specific details)13,935
Number of tokens for the prompt (without CVE specific details)2,733
Total requests68,378
Unexpected answers8
Failures (quota limitations excluded)0
Answer quality success rate99.9883%

While the high success rate with Claude v2.1 is undoubtedly impressive, another aspect we shouldn’t overlook is when Claude provided us with additional information beyond what we specifically asked for. Overall, this didn’t cause significant issues on our end, since the YES/NO answers we needed were identified correctly. However,  this extra information impacted our analysis’s overall cost and speed.

To put it into perspective, out of 68,370 valid responses, 11,636 included more information than requested. This accounts for about 17% of all responses. At first glance, 17% might not seem like a huge number, but it’s important to note the difference in response length. We expected responses to be around 23 or 24 characters long, just enough for a YES or NO answer wrapped with a tag. However, the average length for these more detailed responses was 477.4 characters, with the longest being 607 characters. This means our anticipated response size was only about 5% of the average length for these more detailed replies, highlighting a significant discrepancy contributing to increased processing time and cost.

Adding another layer to this analysis, when we summed the total number of characters for the straightforward short responses, it amounted to 1,343,688 characters. In stark contrast, the number of characters for the long responses was 5,554,849. This data reveals that while the 17% of responses with additional information only make up a minor fraction of the total count, they accounted for over 80% of the total characters present in all responses, which is a significant number. This disproportion further underscores the impact of the longer responses on the overall analysis—dramatically increasing the volume of data processed and, consequently, the associated costs, especially since around 75% of the total cost of using Bedrock Claude comes from the output tokens. This statistic vividly illustrates how a relatively small proportion of responses can disproportionately contribute to the workload and expense of handling and analyzing the data.

Total requests68,378
Requests with extensive answers11,636
% of requests with extensive answers17%
Total number of characters6,898,537
Total number of characters for short answers5,554,849
Total number of characters for long answers
1,343,688
% of characters in short answers19.5%
% of characters in long answers80.5%

The additional details observed in 17% of responses likely result from the model’s inherent design to generate thorough, contextually rich text. These models, including both Claude and GPT-4, aim to deliver the most relevant and helpful information based on the vast data sets they’ve been trained on. When provided with a prompt, they often produce more comprehensive responses than strictly requested, especially if the prompt’s phrasing leaves room for interpretation. This behavior reflects the models’ training to prioritize completeness and clarity in their outputs, attempting to cover any potential ambiguities in the query. It’s important to note that this tendency is not unique to Claude; similar behavior has been observed with GPT-4, indicating a broader pattern among LLMs to provide extensive information to be as helpful as possible. This underlines the challenge of fine-tuning prompts across different models to elicit the desired level of specificity in responses.

While we guided the model to restrain its responses to the essentials by explicitly stating the need for brief, binary answers, it was not enough. We might have also leveraged Bedrock API settings to limit response length further or adjust other parameters. This would, however, require another round of refining the prompts, which would create other costs of its own.

Conclusions

Diving into CVE data with AWS Bedrock and Claude taught us several big lessons:

  • Keep an eye on costs
  • Make sure your prompts are spot on
  • Don’t let the answers get too long, or things can get pricey

Even though we ran into some bumps, Claude v2.1 proved to be really useful. The trick is to tweak the prompts to get more short answers that don’t cost as much.

Through this project, we gained invaluable hands-on experience with AWS Bedrock, significantly enhancing our understanding of prompt construction and operational capabilities and a clear insight into its limitations.

This endeavor enabled us to automate a process that, if done manually, would have been nearly impossible. Analyzing a single vulnerability to discern AT, including potential further research, would take an analyst approximately 2-5 minutes. Given the sheer volume of data involved, this translates to roughly 200 days of continuous work for a single person, highlighting a massive time-saving and addressing the reliability issues stemming from the specialized knowledge and cross-referencing required in manual analysis.

Recommendations

Below are key takeaways from our experiment worth considering when working with any LLMs.

  1. Thorough cost estimation. Estimate costs carefully using a representative sample of all expected data. Conduct a detailed pre-analysis to account for the variability in data complexity and response detail levels.
  2. Consideration of time. Factor in the time required for processing requests, especially when dealing with large datasets. Implement strategies to optimize request handling, such as batching or spreading out processing to avoid quota limitations.
  3. Tune model responses. Fine-tune prompts to elicit responses that are as detailed as necessary for your specific use case. This might involve iterative testing and refinement of prompts to balance detail richness and processing efficiency. While Claude can justify its verdicts amazingly, this is not without a cost.
  4. Use caution with parallel processing. While parallel processing can speed up data handling, balancing this with quota limitations is essential. 
  5. Error handling and retry logic. Incorporate robust error handling and retry logic in your integration code to manage intermittent issues gracefully.
  6. Security and data privacy. Prioritize security and data privacy in all interactions with AI services. Ensure that data handling complies with relevant regulations and best practices, particularly when processing sensitive or proprietary information.
  7. Collaboration with AI providers. Foster a collaborative relationship with AI service providers for support and guidance. Leverage their expertise to optimize your use of their technologies, from model selection to integration best practices.
  8. Documentation and knowledge sharing. Document your processes, challenges encountered, and solutions implemented. Share knowledge within your team and the broader community to foster best practices and collaborative problem-solving.

Future outlook in AI and cybersecurity

At Mend.io, we’re already seeing how AI can change cybersecurity. We started using AI to make sense of messy data a while ago, but the rise of LLMs made it more accessible. This success has shown us that LLMs can do wonders in sorting through data to find the significant bits, which is super helpful for keeping things secure. We’re looking to take things up a notch by further integrating those AI capabilities within our day-to-day operations. This means our security experts and data analysts will get their jobs done faster and better, thanks to AI helping them out. We are also working on integrating LLMs to better handle legal issues like license compliance and copyrights. This way, we can solve these tricky issues more accurately and quickly. This is merely a glimpse into our broader initiatives, with only a select few ready for public unveiling.

The future of AI in cybersecurity is promising, especially as we aim to harness the industry’s vast but unorganized knowledge. With the advancement of AI technologies, we’re now equipped to sift through and extract valuable insights from this data, turning what was once a challenge into a significant asset. AI’s ability to analyze and organize complex information sets the stage for more proactive and intelligent cybersecurity measures. This evolution suggests a shift towards responding to threats and predicting and preventing them, leveraging the deep knowledge we’ve accumulated over the years.

Increase visibility and control over AI models used in your applications

Recent resources

Application Security — The Complete Guide

Explore our application security complete guide and find key trends, testing methods, best practices, and tools to safeguard your software.

Read more

Introducing the Mend AppSec Platform

The Mend AppSec platform offers customers everything needed to build proactive application security through one solution, at one price.

Read more

ASPM and Modern Application Security

Gartner’s 2024 Hype Cycle for Application Security: ASPM moves from peak to trough.

Read more