Siamese (Scalable, incremental, and multi-representation) is a code clone search engine.
It works with multiple representations of source code to capture code similarity at different structural levels and mines token frequencies in a code corpus on-the-fly and automatically adjusts a query’s length to improve the search speed and accuracy. The tool is scalable to a corpus of hundreds million lines of code and return the results within seconds. It also allows incremental updates to its index to support changes in the software project being analysed.
Siamese: Siamese executable can be downloaded here: Siamese v. 0.6. Please make sure you have Java 8 installed on your machine.
1. To execute Siamese, unzip the file and follow the steps below:
$cd siamese $./elasticsearch-2.2.0/bin/elasticsearch -d $java -jar siamese-0.0.5-SNAPSHOT.jar
Then you’ll see the usage and example of how to use Siamese.
usage: (v 0.5) $java -jar siamese.jar -cf <config file> [-i input] [-o output] [-c command] [-h help] Example: java -jar siamese.jar -cf config.properties Example: java -jar siamese.jar -cf config.properties -i /my/input/dir -o /my/output/dir -c index -c,--command <arg> [optional] command to execute [index, search]. This will override the configuration file. -cf,--configFile <arg> [* requried *] a configuration file -h,--help <optional> print help -i,--inputFolder <arg> [optional] location of the input files (for index or query). This will override the configuration file. -o,--outputFolder <arg> [optional] location of the search result file. This will override the configuration file.
2. An example of running Siamese to index a project “foo”.
java -jar siamese-0.0.6-SNAPSHOT.jar -c index -i /my/dir/foo -cf config.properties
3. Then, tell Siamese to search for clones of “bar” in “foo”.
java -jar siamese-0.0.6-SNAPSHOT.jar -c search -i /my/dir/bar -o /my/output/dir -cf config.properties
4. After Siamese finishes its execution, the output file (clone classes) will be located at
The file will be using the pattern
5. If you want to enforce similarity threshold on the search results,
config.properties file to enable fuzzywuzzy or tokenratio (recommended) similarity.
Choose any similarity thresholds you like for the four code representations (r0, r1, r2, r3) respectively.
computeSimilarity : tokenratio simThreshold : 50%,50%,50%,50%
BigCloneEval: BigCloneEval is a tool for automated recall evaluation based on BigCloneBench data set. It can be downloaded from: BigCloneBench