Siamese: A Scalable Code Clone Search Engine
Siamese (Scalable, incremental, and multi-representation) is a code clone search engine.
It works with multiple representations of source code to capture code similarity at different structural levels and mines token frequencies in a code corpus on-the-fly and automatically adjusts a query’s length to improve the search speed and accuracy. The tool is scalable to a corpus of hundreds million lines of code and return the results within seconds. It also allows incremental updates to its index to support changes in the software project being analysed.
Siamese: Siamese executable can be downloaded here: Siamese v. 0.6. Please make sure you have Java 8 installed on your machine.
1. To execute Siamese, unzip the file and follow the steps below:
$cd siamese
$./elasticsearch-2.2.0/bin/elasticsearch -d
$java -jar siamese-0.0.5-SNAPSHOT.jar
Then you’ll see the usage and example of how to use Siamese.
usage: (v 0.5) $java -jar siamese.jar -cf <config file> [-i input] [-o output] [-c command] [-h help]
Example: java -jar siamese.jar -cf config.properties
Example: java -jar siamese.jar -cf config.properties -i /my/input/dir -o /my/output/dir -c index
-c,--command <arg> [optional] command to execute [index, search].
This will override the configuration file.
-cf,--configFile <arg> [* requried *] a configuration file
-h,--help <optional> print help
-i,--inputFolder <arg> [optional] location of the input files (for
index or query). This will override the
configuration file.
-o,--outputFolder <arg> [optional] location of the search result file.
This will override the configuration file.
2. An example of running Siamese to index a project “foo”.
java -jar siamese-0.0.6-SNAPSHOT.jar -c index -i /my/dir/foo -cf config.properties
3. Then, tell Siamese to search for clones of “bar” in “foo”.
java -jar siamese-0.0.6-SNAPSHOT.jar -c search -i /my/dir/bar -o /my/output/dir -cf config.properties
4. After Siamese finishes its execution, the output file (clone classes) will be located at /my/output/dir
.
The file will be using the pattern data_qr_<timestamp>.xml
.
5. If you want to enforce similarity threshold on the search results,
modify the config.properties
file to enable fuzzywuzzy or tokenratio (recommended) similarity.
Choose any similarity thresholds you like for the four code representations (r0, r1, r2, r3) respectively.
computeSimilarity : tokenratio
simThreshold : 50%,50%,50%,50%
BigCloneEval: BigCloneEval is a tool for automated recall evaluation based on BigCloneBench data set. It can be downloaded from: BigCloneBench