File Deduplicator

Find and remove duplicate files intelligently. Save storage space, keep your system clean. Perfect for digital hoarders and document management.

Installa
$clawhub install file-deduplicator

File-Deduplicator - Find and Remove Duplicates

Vernox Utility Skill - Clean up your digital hoard.

Overview

File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.

Features

✅ Duplicate Detection

  • Content-based hashing (MD5) for fast comparison

  • Size-based detection (exact match, near match)

  • Name-based detection (similar filenames)

  • Directory scanning (recursive)

  • Exclude patterns (.git, node_modules, etc.)

✅ Removal Options

  • Auto-delete duplicates (keep newest/oldest)

  • Interactive review before deletion

  • Move to archive instead of delete

  • Preserve permissions and metadata

  • Dry-run mode (preview changes)

✅ Analysis Tools

  • Duplicate count summary

  • Space savings estimation

  • Largest duplicate files

  • Most common duplicate patterns

  • Detailed report generation

✅ Safety Features

  • Confirmation prompts before deletion

  • Backup to archive folder

  • Size threshold (don't remove huge files by mistake)

  • Whitelist important directories

  • Undo functionality (log for recovery)

Installation

clawhub install file-deduplicator

Quick Start

Find Duplicates in Directory

const result = await findDuplicates({
  directories: ['./documents', './downloads', './projects'],
  options: {
    method: 'content',  // content-based comparison
    includeSubdirs: true
  }
});

console.log(`Found ${result.duplicateCount} duplicate groups`);
console.log(`Potential space savings: ${result.spaceSaved}`);

Remove Duplicates Automatically

const result = await removeDuplicates({
  directories: ['./documents', './downloads'],
  options: {
    method: 'content',
    keep: 'newest',  // keep newest, delete oldest
    action: 'delete',  // or 'move' to archive
    autoConfirm: false  // show confirmation for each
  }
});

console.log(`Removed ${result.filesRemoved} duplicates`);
console.log(`Space saved: ${result.spaceSaved}`);

Dry-Run Preview

const result = await removeDuplicates({
  directories: ['./documents', './downloads'],
  options: {
    method: 'content',
    keep: 'newest',
    action: 'delete',
    dryRun: true  // Preview without actual deletion
  }
});

console.log('Would remove:');
result.duplicates.forEach((dup, i) => {
  console.log(`${i+1}. ${dup.file}`);
});

Tool Functions

findDuplicates

Find duplicate files across directories.

Parameters:

  • directories (array|string, required): Directory paths to scan

  • options (object, optional):

    • method (string): 'content' | 'size' | 'name' - comparison method
    • includeSubdirs (boolean): Scan recursively (default: true)
    • minSize (number): Minimum size in bytes (default: 0)
    • maxSize (number): Maximum size in bytes (default: 0)
    • excludePatterns (array): Glob patterns to exclude (default: ['.git', 'node_modules'])
    • whitelist (array): Directories to never scan (default: [])

Returns:

  • duplicates (array): Array of duplicate groups
    • duplicateCount (number): Number of duplicate groups found
    • totalFiles (number): Total files scanned
    • scanDuration (number): Time taken to scan (ms)
    • spaceWasted (number): Total bytes wasted by duplicates
    • spaceSaved (number): Potential savings if duplicates removed

removeDuplicates

Remove duplicate files based on findings.

Parameters:

  • directories (array|string, required): Same as findDuplicates

  • options (object, optional):

    • keep (string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keep
    • action (string): 'delete' | 'move' | 'archive'
    • archivePath (string): Where to move files when action='move'
    • dryRun (boolean): Preview without actual action
    • autoConfirm (boolean): Auto-confirm deletions
    • sizeThreshold (number): Don't remove files larger than this

Returns:

  • filesRemoved (number): Number of files removed/moved

  • spaceSaved (number): Bytes saved

  • groupsProcessed (number): Number of duplicate groups handled

  • logPath (string): Path to action log

  • errors (array): Any errors encountered

analyzeDirectory

Analyze a single directory for duplicates.

Parameters:

  • directory (string, required): Path to directory

  • options (object, optional): Same as findDuplicates options

Returns:

  • fileCount (number): Total files in directory

  • totalSize (number): Total bytes in directory

  • duplicateSize (number): Bytes in duplicate files

  • duplicateRatio (number): Percentage of files that are duplicates

Use Cases

Digital Hoarder Cleanup

  • Find duplicate photos/videos

  • Identify wasted storage space

  • Remove old duplicates, keep newest

  • Clean up download folders

Document Management

  • Find duplicate PDFs, docs, reports

  • Keep latest version, archive old versions

  • Prevent version confusion

  • Reduce backup bloat

Project Cleanup

  • Find duplicate source files

  • Remove duplicate build artifacts

  • Clean up node_modules duplicates

  • Save storage on SSD/HDD

Backup Optimization

  • Find duplicate backup files

  • Remove redundant backups

  • Identify what's actually duplicated

  • Save space on backup drives

Configuration

Edit config.json:

{
  "detection": {
    "defaultMethod": "content",
    "sizeTolerancePercent": 0,  // exact match only
    "nameSimilarity": 0.7,  // 0-1, lower = more similar
    "includeSubdirs": true
  },
  "removal": {
    "defaultAction": "delete",
    "defaultKeep": "newest",
    "archivePath": "./archive",
    "sizeThreshold": 10485760,  // 10MB threshold
    "autoConfirm": false,
    "dryRunDefault": false
  },
  "exclude": {
    "patterns": [".git", "node_modules", ".vscode", ".idea"],
    "whitelist": ["important", "work", "projects"]
  }
}

Methods

  • Fast MD5 hashing

  • Detects exact duplicates regardless of filename

  • Works across renamed files

  • Perfect for documents, code, archives

Size-Based

  • Compares file sizes

  • Faster than content hashing

  • Good for media files where content hashing is slow

  • Finds near-duplicates (similar but not exact)

Name-Based

  • Compares filenames

  • Detects similar named files

  • Good for finding version duplicates (file_v1, file_v2)

Examples

Find Duplicates in Documents

const result = await findDuplicates({
  directories: '~/Documents',
  options: {
    method: 'content',
    includeSubdirs: true
  }
});

console.log(`Found ${result.duplicateCount} duplicate sets`);
result.duplicates.slice(0, 5).forEach((set, i) => {
  console.log(`Set ${i+1}: ${set.files.length} files`);
  console.log(`  Total size: ${set.totalSize} bytes`);
});

Remove Duplicates, Keep Newest

const result = await removeDuplicates({
  directories: '~/Documents',
  options: {
    keep: 'newest',
    action: 'delete'
  }
});

console.log(`Removed ${result.filesRemoved} files`);
console.log(`Saved ${result.spaceSaved} bytes`);

Move to Archive Instead of Delete

const result = await removeDuplicates({
  directories: '~/Downloads',
  options: {
    keep: 'newest',
    action: 'move',
    archivePath: '~/Documents/Archive'
  }
});

console.log(`Archived ${result.filesRemoved} files`);
console.log(`Safe in: ~/Documents/Archive`);

Dry-Run Preview Changes

const result = await removeDuplicates({
  directories: '~/Documents',
  options: {
    dryRun: true  // Just show what would happen
  }
});

console.log('=== Dry Run Preview ===');
result.duplicates.forEach((set, i) => {
  console.log(`Would delete: ${set.toDelete.join(', ')}`);
});

Performance

Scanning Speed

  • Small directories (<1000 files): <1s

  • Medium directories (1000-10000 files): 1-5s

  • Large directories (10000+ files): 5-20s

Detection Accuracy

  • Content-based: 100% (exact duplicates)

  • Size-based: Fast but may miss renamed files

  • Name-based: Detects naming patterns only

Memory Usage

  • Hash cache: ~1MB per 100,000 files

  • Batch processing: Processes 1000 files at a time

  • Peak memory: ~200MB for 1M files

Safety Features

Size Thresholding

Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.

Archive Mode

Move files to archive directory instead of deleting. No data loss, full recoverability.

Action Logging

All deletions/moves are logged to file for recovery and audit.

Undo Functionality

Log file can be used to restore accidentally deleted files (limited undo window).

Error Handling

Permission Errors

  • Clear error message

  • Suggest running with sudo

  • Skip files that can't be accessed

File Lock Errors

  • Detect locked files

  • Skip and report

  • Suggest closing applications using files

Space Errors

  • Check available disk space before deletion

  • Warn if space is critically low

  • Prevent disk-full scenarios

Troubleshooting

Not Finding Expected Duplicates

  • Check detection method (content vs size vs name)

  • Verify exclude patterns aren't too broad

  • Check if files are in whitelisted directories

  • Try with includeSubdirs: false

Deletion Not Working

  • Check write permissions on directories

  • Verify action isn't 'delete' with autoConfirm: true

  • Check size threshold isn't blocking all deletions

  • Check file locks (is another program using files?)

Slow Scanning

  • Reduce includeSubdirs scope

  • Use size-based detection (faster)

  • Exclude large directories (node_modules, .git)

  • Process directories individually instead of batch

Tips

Best Results

  • Use content-based detection for documents (100% accurate)

  • Run dry-run first to preview changes

  • Archive instead of delete for important files

  • Check logs if anything unexpected deleted

Performance Optimization

  • Process frequently used directories first

  • Use size threshold to skip large media files

  • Exclude hidden directories from scan

  • Process directories in parallel when possible

Space Management

  • Regular duplicate cleanup prevents storage bloat

  • Delete temp directories regularly

  • Clear download folders of installers

  • Empty trash before large scans

Roadmap

  • [ ] Duplicate detection by image similarity

  • [ ] Near-duplicate detection (similar but not exact)

  • [ ] Duplicate detection across network drives

  • [ ] Cloud storage integration (S3, Google Drive)

  • [ ] Automatic scheduling of scans

  • [ ] Heuristic duplicate detection (ML-based)

  • [ ] Recover deleted files from backup

  • [ ] Duplicate detection by file content similarity (not just hash)

License

MIT


Find duplicates. Save space. Keep your system clean. 🔮