Logo LLMDA

Just-in-Time Detection of Silent Security Patches 

Open-source code is pervasive. In this setting, embedded vulnerabilities are spreading to downstream software at an alarming rate. While such vulnerabilities are generally identified and addressed rapidly, inconsistent maintenance policies may lead security patches to go unnoticed. Indeed, security patches can be {\em silent}, i.e., they do not always come with comprehensive advisories such as CVEs. This lack of transparency leaves users oblivious to available security updates, providing ample opportunity for attackers to exploit unpatched vulnerabilities. Consequently, identifying silent security patches just in time when they are released is essential for preventing n-day attacks, and for ensuring robust and secure maintenance practices. With LLMDA we propose to (1) leverage large language models (LLMs) to augment patch information with generated code change explanations, (2) design a representation learning approach that explores code-text alignment methodologies for feature combination, (3) implement a label-wise training with labeled instructions for guiding the embedding based on security relevance, and (4) rely on a probabilistic batch contrastive learning mechanism for building a high-precision identifier of security patches. We evaluate LLMDA on the PatchDB and SPI-DB literature datasets and show that our approach substantially improves over the state-of-the-art, notably GraphSPD by 20% in terms of F-Measure on the SPI-DB benchmark.

Logo

Logo LLMDA Method

Figure 1 depicts the overview of the different steps of LLMDA. First, representations of multi-modal inputs (code and texts) are obtained using LLMs. Then, the obtained representations are aligned within a unique embedding space and fused into a single comprehensive representation by the PT-Former module. Finally, a stochastic batch contrastive learning (SBCL) mechanism is deployed to make the predictions of whether a given patch is a security patch or not.

Main components 

PT-Former innovatively addresses the challenge of fusing embeddings from different modalities—specifically, patches and texts—by introducing an architecture that aligns and concatenates embedding spaces to enhance the interpretability and effectiveness of classification models. Leveraging self-attention mechanisms, PT-Former updates individual embeddings, ensuring that the information carried by each is rich and contextually relevant. The cross-attention module further aligns the embeddings of code changes with their textual explanations, creating a nuanced understanding of the interaction between these two modalities. Finally, by employing feed-forward layers for non-linear transformation and concatenating the aligned embeddings, PT-Former achieves a comprehensive representation of the input data. This methodical approach not only bridges the gap between distinct feature spaces but also optimizes the process of embedding fusion, thereby enabling more accurate and insightful classification outcomes.

algebraic reasoning

SBCL, standing for Stochastic Batch Contrastive Learning, serves as a sophisticated mechanism for refining the classification capabilities of a binary classifier tasked with distinguishing security patches from non-security patches. This mechanism operates on the premise of enhancing the classifier's ability to discern intrinsic patterns within a dataset, relying on embeddings output by PT-Former that encapsulate security patch characteristics along with their LLM-generated explanations, developer descriptions, and labelled instructions. SBCL innovatively utilizes batch sampling and triplet formation techniques to emphasize learning from both closely related security examples (positive pairs) and distinguishably different non-security examples (negative pairs), thereby optimizing the model's embedding space for precise security relevance prediction. Through the application of a stochastic batch contrastive loss, SBCL meticulously adjusts the embedding distances within each batch to ensure a clear demarcation between security-related and non-security-related examples. This strategic approach significantly bolsters the model's performance by fostering an embedding space that is both robust and discriminative, thereby setting a new standard for the identification of security patches.

algebraic reasoning

Contribution

  • LLMDA: is a novel framework for security patch detection. LLMDA can detect silent security patches as it does not require any explicit descriptive information from developers to operate. It leverages LLMs for both data augmentation (generation of explanations) and patch analysis (generation of representations). It further deploys a specialized PT-Former module to align various modalities within a single embedding space, enabling the approach to extract richer information from the joint context of code and descriptions. Leveraging contrastive learning on the yielded embeddings, LLMDA is able to precisely identify security patches.
  • We achieve new state-of-the-art performance in security patch detection. The experimental results show that our language-centric approach consistently outperforms the baseline methods (i.e., TwinRNN and GraphSPD) on two target datasets (i.e., PatchDB and SPI-DB): LLMDA achieves up to ~42% and ~20% performance improvement over the incumbent state-of-the-art on both datasets, respectively.
  • We experimentally demonstrate through ablation studies that the different components and key design decisions of LLMDA are contributing to its overall performance. Notably, we show that the representations have a high discriminative power and that the yielded classification model is relatively robust (compared to the incumbent state-of-the-art)

Experimental Design

Research Questions

  • RQ1. How effective is LLMDA in identifying security patches? We assess LLMDA against well-known literature benchmarks and compare the achieved performance against some strong baselines.
  • RQ2. How do key design decisions in LLMDA contribute to its performance? We perform an ablation study where we investigate the added value of label-wise training, the generated explanations, PT-Former and contrastive learning.
  • RQ3. To what extent the distribution of patch representations in LLMDA improves over the state of the art? We visualize the learned representations from LLMDA and GraphSPD to observe the differences in their potential discriminative power. Based on case studies, we also qualitatively assess how LLMDA representation assigns scores to key tokens.
  • RQ4. Does the trained LLMDA model generalize beyond our study dataset? We evaluate the robustness of LLMDA by applying the model trained on a given dataset to samples from a different dataset.
  • Dataset

  • PatchDB is an extensive set of patches of C/C++ programs. It includes about 12K security-relevant and about 24K non-security-relevant patches. The dataset was constructed by considering patches referenced in the National Vulnerability Database (NVD) as well as patches extracted from GitHub commits of 311 open-source projects (e.g., Linux kernel, MySQL, OpenSSL, etc.).
  • algebraic reasoning
  • SPI-DB is another large dataset for security patch identification. The public version includes patches from FFmpeg and QEMU, amounting to about 25k patches (10k security-relevant and 15k non-security-relevant).
  • Main Experimental Results