Recursively process zip archives to extract files while discarding specific format of files

UPDATE: I noticed that many people are viewing this thread, which makes me believe that this situation is not so rare after all. Anyway, I had asked a similar/related question on SO here, which has pretty decent solutions too which might solve the problem in a better way.

On my Windows 7 machine, I have a directory full of downloaded dumps in ZIP archives. Each archive contains few text files, PDFs and rarely XML files. I want to extract all the contents of each ZIP archive into its respective folder(must be created during the process) while discarding/ignoring extraction of PDFs. After extraction of required files from an archive, processed zip must not be deleted(or I would like to know how I can control it in different situations).

If it helps to know, the number of archives in the directory is in the range of 60k-70k. Also, I need separate output directories because files in an archive may have same names as files in other.

For example,

I have all my archives like one.zip, two.zip,.. in, say, D:\data
I create a new folder for processed data, say, D:\extracted
Now the data from D:\data\one.zip should go to D:\extracted\one. Here, D:\extracted\one should be created automatically.
During this complete uncompression process, all the encountered PDFs should not be extracted(be ignored). There's no point in extracting and then deleting.
(Optional) A log file should be maintained at, say, D:\extracted. Idea is to use this file to resume processing from where it was left in case of an error.
(Optional) Script should let me decide whether I want to keep source archives or delete them after processing.

I already did some search to find a solution but couldn't find one. I came across few questions like these

Recursively unzip files where they reside, then delete the archives
7 zip extract recursively
Is it possible to recursively list zip file contents with 7 zip without extracting

but they were not of much help(I'm not a pro with Windows by the way). I'm open to installing safe and ad free 3rd party software(open-source) like 7-zip.

EDIT: Is there a tool readily available to do what I need, I already tried Multi Unpacker. It doesn't create new directories, it can't ignore *.pdf files. It's even slow to start with, I think it first reads all the archives at source before starting to process them.

Thanks in advance!

Modifying the answer found here, this piece of PowerShell script should do what you want. Just save it as a file with the Extension ".ps1". When calling it, just call it as ./filename.ps1 and it will extract the files to separate folders, delete the zip files and remove all files with .pdf extension. I have not tested if it works properly with recursive paths, but it should, please test it.

Edit: If you don't want your zip files to be deleted, remove or comment out (#) the line rmdir -Path $_.FullName -Force

Requirements: PowerShell, 7-zip and for you to set the 7-zip path in the file.

param([string]$folderPath="D:\Blah\files")

Get-ChildItem $folderPath -recurse | %{ 

    if($_.Name -match "^*.`.zip$")
    {
        $parent="$(Split-Path $_.FullName -Parent)";    
        write-host "Extracting $($_.FullName) to $parent"

        $arguments=@("e", "`"$($_.FullName)`"", "-o`"$($parent)\$($_.BaseName)`"");
        $ex = start-process -FilePath "`"C:\Program Files\7-Zip\7z.exe`"" -ArgumentList $arguments -wait -PassThru;

        if( $ex.ExitCode -eq 0)
        {
            write-host "Extraction successful, deleting $($_.FullName)"
            rmdir -Path $_.FullName -Force
            $arguments1="$($parent)\$($_.BaseName)\*.pdf"
            rmdir -Recurse -Path $arguments1
        }
    }
}

Recursively process zip archives to extract files while discarding specific format of files

Related

Recent Posts