Duplicate code can be hard to find, especially in a large project. But PMD's Copy/Paste Detector (CPD) can find it for you! CPD has been through three major incarnations:
Each rewrite made it much faster, and now it can process the JDK 1.4 java.* packages in about 4 seconds (on my workstation, at least).
Here's a screenshot of CPD after running on the JDK java.lang package.
Note that CPD works with Java, C, C++, and PHP code.
CPD is included with PMD, which you can download here. Or, if you have Java Web Start, you can run CPD by clicking here.
Here are the duplicates CPD found in the JDK 1.4 source code.
Here are the duplicates CPD found in the APACHE_2_0_BRANCH branch of Apache
(just the httpd-2.0/server/
directory).
Andy Glover wrote an Ant task for CPD; here's how to use it:
<target name="cpd"> <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask" /> <cpd minimumTokenCount="100" outputFile="/home/tom/cpd.txt"> <fileset dir="/home/tom/tmp/ant"> <include name="**/*.java"/> </fileset> </cpd> </target>
There's a ignoreLiterals="true"
option which makes CPD ignore literal
value differences when evaluating a duplicate block. This means that foo=42;
and foo=43;
will be seen as equivalent. You may want to run PMD with this option off to start with and
then switch it on to see what it turns up. There's also a ignoreIdentifiers="true"
option
that does the same thing with identifiers; i.e., variable names, methods names, and so forth.
The same guidelines apply. Finally, there's an optional language="cpp|java|php|ruby"
flag
to select the appropriate language; the default language is "java".
Also, you can get verbose output from this task by running ant with the -v
flag; i.e.:
ant -v -f mybuildfile.xml cpd
To run CPD from the command line, just give it the minimum duplicate size and the source directory:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /usr/local/java/src/java
You can also specify the language:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /path/to/c/source --language cpp
And if you're checking a C source tree with duplicate files in different architecture directories you can skip those using --skip-duplicate-files:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /path/to/c/source --language cpp --skip-duplicate-files
You can also specify a report format - here we're using the XML report:
$ java net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /usr/local/java/src/java --format net.sourceforge.pmd.cpd.XMLRenderer
The default format is a text report, and there's also a net.sourceforge.pmd.cpd.CSVRenderer
report.
Note that CPD is pretty memory-hungry; you may need to give Java more memory to run it, like this:
$ java -Xmx512m net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /usr/local/java/src/java
Suggestions? Comments? Post them here. Thanks!