Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 24:27:104712.
doi: 10.1016/j.dib.2019.104712. eCollection 2019 Dec.

Source code analysis dataset

Affiliations

Source code analysis dataset

Ben Gelman et al. Data Brief. .

Abstract

The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extracted using Doxygen. The second set of pairs connects raw C and C++ source code repositories with the build artifacts of that code, which are obtained by running the make command. The last set of pairs connects raw C and C++ source code repositories with potential code vulnerabilities, which are determined by running the Infer static analyzer. The code and comment pairs can be used for tasks such as predicting comments or creating natural language descriptions of code. The code and build artifact pairs can be used for tasks such as reverse engineering or improving intermediate representations of code from decompiled binaries. The code and static analyzer pairs can be used for tasks such as machine learning approaches to vulnerability discovery.

Keywords: Bug detection; Code comments; Source code; Static analysis.

PubMed Disclaimer

References

    1. GitHub, https://github.com, (accessed 30 July 2019).
    1. Van Heesch Dimitri. 2008. Doxygen: Source Code Documentation Generator Tool.http://www.doxygen.nl
    1. Infer A Tool to Detect Bugs in Java and C/C++/Objective-C Code before it Ships. https://fbinfer.com
    1. GitHub Developer GraphQL API V4. https://developer.github.com/v4/
    1. Moore Jessica, Gelman Ben, Slater David. ENASE; 2019. A Convolutional Neural Network for Language-Agnostic Source Code Summarization.

LinkOut - more resources