Get list of files, that exist in one folder but not another, as determined by content, not filename, on Windows
Solution 1:
The strings returned from PowerShell's Get-FileHash can be used as the keys to a hashtable to associate content with the fully-qualified path. This code creates a hashtable for each path with the following caveats.
- Empty files are ignored as the content hash of all empty files will be identical. ( You could also create a list of empty files foulnd in each directory )
- If duplicates are found within the directory we are indexing, only the first file found will be added to the
$HashOut
table. The$Dups
table will have the list of all paths that share identical content.
PowerShell:
Function Get-DirHash ( [String]$PathIn , [PSObject]$HashOut )
{
$HashOut.Clear()
gci $PathIn *.txt -Recurse | ? Length -gt 0 | Get-FileHash | %{
If ( $HashOut.Contains($_.Hash) )
{
If ( $Dups.Contains($_.Hash) )
{
$Dups[$_.Hash] += $_.Path
}
Else
{
$Dups.Add( $_.Hash , @( $HashOut[$_.Hash] , $_.Path ))
}
}
Else
{
$HashOut.Add( $_.Hash , $_.Path )
}
}
}
$DirA = 'c:\whatever'
$DirB = 'c:\whenever'
$TableA = @{}
$TableB = @{}
$Dups = @{}
$Unique2A = New-Object System.Collections.Generic.List[String]
Get-DirHash -PathIn $DirA -HashOut $TableA
Get-DirHash -PathIn $DirB -HashOut $TableB
$TableA.Keys | %{
If ( ! ( $TableB.Contains($_) ))
{
$Unique2A.Add( $TableA[$_] )
}
}
$Unique2A | Out-GridView
Not fully tested, but I believe this will do the trick of only computing the hash for files that match in size.
$DirA = 'c:\whatever'
$DirB = 'c:\whenever'
$TestA = [Regex]::Escape($DirA)
$MasterList = gci $DirA , $DirB -Filter *.txt -recurse | Group Length
$Unique2A_BySize = ( $MasterList | ? Count -eq 1 |
? { $_.Group[0].DirectoryName -match $TestA } ).Group.FullName
$Unique2A_ByHash = ( $MasterLIst | ? Count -gt 1 | %{
$_.Group | Get-FileHash | Group Hash |
? Count -eq 1 |
? { $_.Group[0].Path -match $TestA }
} ).Group.Path
( $Unique2A = $Unique2A_BySize + $Unique2A_ByHash ) | Out-GridView
Which may be improved by the harder-to-read:
$DirA = 'c:\whatever'
$DirB = 'c:\whenever'
$TestA = [Regex]::Escape($DirA)
$Unique2A = ( ( $MasterList = gci $DirA , $DirB -Filter *.txt -recurse | Group Length ) |
? Count -eq 1 |
? { $_.Group[0].DirectoryName -match $TestA } ).Group.FullName +
( $MasterLIst | ? Count -gt 1 | %{
$_.Group | Get-FileHash | Group Hash |
? Count -eq 1 |
? { $_.Group[0].Path -match $TestA }
} ).Group.Path
$Unique2A | Out-GridView