The development of machine learning techniques for discovering software
vulnerabilities relies fundamentally on the availability of appropriate
datasets. The ideal dataset consists of a large and diverse collection of
real-world vulnerabilities, paired so as to contain both vulnerable and patched
versions of each program. Naturally, collecting such datasets is a laborious
and time-consuming task. Within the specific domain of vulnerability discovery
in binary code, previous datasets are either publicly unavailable, lack
semantic diversity, involve artificially introduced vulnerabilities, or were
collected using static analyzers, thereby themselves containing incorrectly
labeled example programs.
In this paper, we describe a new publicly available dataset which we dubbed
Binpool, containing numerous samples of
vulnerable versions of Debian packages across the years. The dataset was
automatically curated, and contains both vulnerable and patched versions of
each program, compiled at four different optimization levels. Overall, the
dataset covers 603 distinct CVEs across 89 CWE classes, 162 Debian packages,
and contains 6144 binaries. We argue that this dataset is suitable for
evaluating a range of security analysis tools, including for vulnerability
discovery, binary function similarity, and plagiarism detection.