Uncertainty-Aware Contrastive Learning with Hard Negative Sampling for Code Search Tasks
Authors: Han Liu, Jiaqing Zhan, Qin Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results indicate that our approach outperforms 10 baseline methods on a large code search dataset with six programming languages. The results also show that our strategies of uncertainty learning and hard negative sampling can really help enhance the representation of queries and codes leading to an improvement of the code search performance. |
| Researcher Affiliation | Academia | Han Liu1 2, Jiaqing Zhan1, Qin Zhang1* 1College of Computer Science and Software Engineering, Shenzhen University 2Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed approach and loss functions using mathematical equations and text, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We conducted experiments on the Code Search Net code corpus (Husain et al. 2019), to be consistent with Guo et al.. Code Search Net contains six languages, namely, Ruby, Java Script, Go, Python, Java, and PHP, and has been widely used in previous studies. |
| Dataset Splits | Yes | To make the experimental setup closer to the real scenarios, Guo et al. expanded the candidate dataset and filtered out low-quality queries on each code corpus through rules, where the data statistics are shown in Table 1. Table 1: Dataset statistics. Language Training Dev Test Candidates size Python 251,820 13,914 14,918 43,827 PHP 241,241 12,982 14,014 52,660 Go 167,288 7,325 8,122 28,120 Java 164,923 5,183 10,955 40,347 Java Script 58,025 3,885 3,291 13,981 Ruby 24,927 1,400 1,261 4,360 |
| Hardware Specification | Yes | All experiments were conducted on a machine equipped with four NVIDIA Ge Force RTX 4090 GPUs which each has 24GB of memory. |
| Software Dependencies | No | The paper mentions using a 'Transformer architecture' and initializing with 'Co Co So Da (Shi et al. 2023)' parameters, and using 'Adam W optimizer'. However, it does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | For training, we set the batch size to 128, the temperature hyperparameter to 0.03, the number of epochs to 10, and the random seed to 123456. The maximum sequence lengths are set to 256 for code snippets and 128 for queries. We use the Adam W optimizer with a learning rate of 8e-6. |