LLMDA

Open-source code is pervasive. In this setting, embedded vulnerabilities are spreading to downstream software at an alarming rate. While such vulnerabilities are generally identified and addressed rapidly, inconsistent maintenance policies may lead security patches to go unnoticed. Indeed, security patches can be {\em silent}, i.e., they do not always come with comprehensive advisories such as CVEs. This lack of transparency leaves users oblivious to available security updates, providing ample opportunity for attackers to exploit unpatched vulnerabilities. Consequently, identifying silent security patches just in time when they are released is essential for preventing n-day attacks, and for ensuring robust and secure maintenance practices. With LLMDA we propose to (1) leverage large language models (LLMs) to augment patch information with generated code change explanations, (2) design a representation learning approach that explores code-text alignment methodologies for feature combination, (3) implement a label-wise training with labeled instructions for guiding the embedding based on security relevance, and (4) rely on a probabilistic batch contrastive learning mechanism for building a high-precision identifier of security patches. We evaluate LLMDA on the PatchDB and SPI-DB literature datasets and show that our approach substantially improves over the state-of-the-art, notably GraphSPD by 20% in terms of F-Measure on the SPI-DB benchmark.

Figure 1 depicts the overview of the different steps of LLMDA. First, representations of multi-modal inputs (code and texts) are obtained using LLMs. Then, the obtained representations are aligned within a unique embedding space and fused into a single comprehensive representation by the PT-Former module. Finally, a stochastic batch contrastive learning (SBCL) mechanism is deployed to make the predictions of whether a given patch is a security patch or not.

Main components

PT-Former innovatively addresses the challenge of fusing embeddings from different modalities—specifically, patches and texts—by introducing an architecture that aligns and concatenates embedding spaces to enhance the interpretability and effectiveness of classification models. Leveraging self-attention mechanisms, PT-Former updates individual embeddings, ensuring that the information carried by each is rich and contextually relevant. The cross-attention module further aligns the embeddings of code changes with their textual explanations, creating a nuanced understanding of the interaction between these two modalities. Finally, by employing feed-forward layers for non-linear transformation and concatenating the aligned embeddings, PT-Former achieves a comprehensive representation of the input data. This methodical approach not only bridges the gap between distinct feature spaces but also optimizes the process of embedding fusion, thereby enabling more accurate and insightful classification outcomes.

SBCL, standing for Stochastic Batch Contrastive Learning, serves as a sophisticated mechanism for refining the classification capabilities of a binary classifier tasked with distinguishing security patches from non-security patches. This mechanism operates on the premise of enhancing the classifier's ability to discern intrinsic patterns within a dataset, relying on embeddings output by PT-Former that encapsulate security patch characteristics along with their LLM-generated explanations, developer descriptions, and labelled instructions. SBCL innovatively utilizes batch sampling and triplet formation techniques to emphasize learning from both closely related security examples (positive pairs) and distinguishably different non-security examples (negative pairs), thereby optimizing the model's embedding space for precise security relevance prediction. Through the application of a stochastic batch contrastive loss, SBCL meticulously adjusts the embedding distances within each batch to ensure a clear demarcation between security-related and non-security-related examples. This strategic approach significantly bolsters the model's performance by fostering an embedding space that is both robust and discriminative, thereby setting a new standard for the identification of security patches.

Contribution

LLMDA: is a novel framework for security patch detection. LLMDA can detect silent security patches as it does not require any explicit descriptive information from developers to operate. It leverages LLMs for both data augmentation (generation of explanations) and patch analysis (generation of representations). It further deploys a specialized PT-Former module to align various modalities within a single embedding space, enabling the approach to extract richer information from the joint context of code and descriptions. Leveraging contrastive learning on the yielded embeddings, LLMDA is able to precisely identify security patches.
We achieve new state-of-the-art performance in security patch detection. The experimental results show that our language-centric approach consistently outperforms the baseline methods (i.e., TwinRNN and GraphSPD) on two target datasets (i.e., PatchDB and SPI-DB): LLMDA achieves up to ~42% and ~20% performance improvement over the incumbent state-of-the-art on both datasets, respectively.
We experimentally demonstrate through ablation studies that the different components and key design decisions of LLMDA are contributing to its overall performance. Notably, we show that the representations have a high discriminative power and that the yielded classification model is relatively robust (compared to the incumbent state-of-the-art)

Experimental Design

Research Questions

RQ1. How effective is LLMDA in identifying security patches? We assess LLMDA against well-known literature benchmarks and compare the achieved performance against some strong baselines.

RQ2. How do key design decisions in LLMDA contribute to its performance? We perform an ablation study where we investigate the added value of label-wise training, the generated explanations, PT-Former and contrastive learning.

RQ3. To what extent the distribution of patch representations in LLMDA improves over the state of the art? We visualize the learned representations from LLMDA and GraphSPD to observe the differences in their potential discriminative power. Based on case studies, we also qualitatively assess how LLMDA representation assigns scores to key tokens.

RQ4. Does the trained LLMDA model generalize beyond our study dataset? We evaluate the robustness of LLMDA by applying the model trained on a given dataset to samples from a different dataset.

Dataset

PatchDB is an extensive set of patches of C/C++ programs. It includes about 12K security-relevant and about 24K non-security-relevant patches. The dataset was constructed by considering patches referenced in the National Vulnerability Database (NVD) as well as patches extracted from GitHub commits of 311 open-source projects (e.g., Linux kernel, MySQL, OpenSSL, etc.).

SPI-DB is another large dataset for security patch identification. The public version includes patches from FFmpeg and QEMU, amounting to about 25k patches (10k security-relevant and 15k non-security-relevant).

In this example, the model generates a comprehensive plan for the task, including subsequent actions on the following pages that are not currently visible.

A. Overall Performance

LLMDA is effective in detecting security patches.With an F1 score at 78.19%, LLMDA demonstrates a well-balanced performance: our model can concurrently attain high precision and high recall. Specifically, we achieved a new state-of-the-art performance in identifying both security patches (+Recall) and recognizing non-security patches (-Recall). Comparison experiments further confirm that LLMDA is superior to the baselines and is consistently high-performing across the datasets and across the metrics.

B. Ablation Study

The ablation study results reveal that each of the key design decisions contributes noticeably to the performance of LLMDA. In particular, without the PT-Former module LLMDA would lose about 8 percentage points in F1

C. Visualizations

We consider 1 000 random patches from our PatchDB dataset. We then collect their associated embeddings from LLMDA and GraphSPD and apply principal component analysis (PCA) [30]. Given the imbalance of the dataset, the drawn samples are largely non- security patches, while security patches are fewer. Figure 5 presents the PCA visualizations of the representations. We observe from the distribution of data points that LLMDA can effectively separate the two categories (i.e., security and non-security patches), in contrast to the incumbent state-of- the-art, GraphSPD. This finding suggests that the representa- tions of LLMDA are highly relevant for the task of security patch detection.

D. Case Study

Table VI presents 2 examples to illustrate the difference between LLMDA and GraphSPD in terms of what the representations can capture, and potentially explaining why LLMDA was successful on these cases while GraphSPD was not. For our classification task, we have two labels: security (0) and non-security (1). For LLMDA, we can directly consider the label name in the instruction. Thus we compute the attention map between security and the tokens in the patch, the explanation, and the description. For GraphSPD, however, since there no real label name involved in the training and inference phases, we compute the attention score between the words in the patch and the number “0” or “1”. To simplify the analysis, we only highlight, in Table VI, tokens for which the similarity score is higher than 0.5. As shown in the examples, LLMDA generally assigns high similarity scores to security-related aspects, suggesting a detection capability that nuances between tokens. For example, in the sgminer.c patch, LLMDA gives high scores to realloc and mutex init, indicating a finer sensitivity to potential security implications within these code parts. Similarly, in the krb5/auth context.c patch, the use of memset for initializing authenticator memory is scored high in LLMDA, reflecting its more acute recognition of security practices.

E. Detailed Cases with Attention Map (More lighter mean more matching)

LLMDA

Just-in-Time Detection of Silent Security Patches

LLMDA Method

Main components

Contribution

Experimental Design

Main Experimental Results