phpcpd
Detect copied and pasted PHP code
TLDR
Analyze duplicated code for a specific file or directory
Analyze using fuzzy matching for variable names
Specify a minimum number of identical lines (defaults to 5)
Specify a minimum number of identical tokens (defaults to 70)
Exclude a directory from analysis (must be relative to the source)
Output the results to a PHP-CPD XML file
SYNOPSIS
phpcpd [options] <directories_or_files...>
PARAMETERS
--min-lines
Specifies the minimum number of identical lines to consider a block as a copy. The default value is 5.
--min-tokens
Specifies the minimum number of identical tokens to consider a block as a copy. The default value is 70.
--suffix
Only process files with the specified suffix(es). Multiple suffixes can be provided comma-separated (e.g., .php,.inc). The default suffix is .php.
--exclude
Excludes a directory from the scan. This option can be specified multiple times to exclude several directories (e.g., --exclude vendor --exclude cache).
--fuzzy
Enables fuzzy matching for duplicates, allowing for slight variations in the duplicated code blocks.
--log-xml
Writes the copy-paste detection results to an XML file in a custom format.
--log-pmd
Writes the copy-paste detection results to an XML file in PMD (Programming Mass Detector) format, compatible with tools like Jenkins or SonarQube.
--progress
Displays a progress bar during the scan operation.
--verbose
Enables verbose output, providing more detailed information during the scan.
--help
Displays a help message with available options and usage.
--version
Displays the version information of phpcpd.
<directories_or_files...>
One or more directories or specific files to be scanned for duplicated code.
DESCRIPTION
phpcpd (PHP Copy/Paste Detector) is a command-line tool that scans PHP source code for identical blocks of code, helping developers identify and reduce code duplication. It employs an algorithm to find exact or near-exact duplicates across multiple files or within a single file.
Identifying duplicated code is crucial for code maintainability, as changes often need to be applied in multiple places, potentially leading to bugs and increased development effort. phpcpd outputs a report listing the duplicated blocks, their file locations, and the number of lines involved, serving as a valuable aid in refactoring efforts.
CAVEATS
While highly effective for detecting exact and near-exact code duplication, phpcpd might produce false positives with very small --min-lines or --min-tokens values. It focuses on syntactic duplication, not semantic, meaning it won't detect if two different pieces of code achieve the same logical outcome.
Scanning very large codebases can be resource-intensive in terms of CPU and memory. Ensure your PHP installation has the tokenizer extension enabled, as it's required for phpcpd to parse PHP code.
INTEGRATION WITH CI/CD
phpcpd is commonly integrated into Continuous Integration/Continuous Deployment (CI/CD) pipelines. This ensures that new code duplication is automatically detected and flagged before it gets merged into the main codebase, enforcing a high standard of code quality and maintainability.
REFACTORING AID
The detailed reports generated by phpcpd serve as a crucial guide for refactoring efforts. By highlighting specific duplicated blocks, developers can efficiently identify areas where code can be abstracted, encapsulated, or removed, leading to more modular, readable, and maintainable software.
HISTORY
phpcpd is an integral part of the PHP ecosystem's static analysis tools, largely developed and maintained by Sebastian Bergmann, who is also the creator of PHPUnit. It emerged as a specialized tool within the broader set of PHP quality assurance utilities, providing a dedicated solution for identifying code duplication. Its development has consistently aligned with best practices in PHP development, making it a staple for maintaining clean and efficient codebases over many years.