How to "unextract" a zip file?

I extracted a zip file into a non-empty folder. The zip file has lots of files and a deep hierarchy, that merged with the existing tree of the target directory. How can I remove the files and directories that where created by unzipping without destroying the files and directories that were already there? Of course, I still have the zip file that I merged in, so the information is there.


Solution 1:

jjlin's answer is the way to go. I just want to add a few choices for directories:

  • Delete all extracted files, no directories:

    unzip -lqq file.zip | gawk -F"  " '{print $NF;}' |
      while IFS= read -r n; do rm "$n"; done
    
  • Delete extracted files and empty directories only

    unzip -lqq file.zip | gawk -F"  " '{print $NF;}' |
      while IFS= read -r n; do rm "$n"; done; rmdir *
    

    With no options, rmdir deletes only empty directories, it will leave files and non-empty folders alone so you can safely run it on *.

  • Delete everything extracted, but prompt for a confirmation before each deletion:

    unzip -lqq file.zip | gawk -F"  " '{print $NF;}' |
      while IFS= read -r n; do rm -ri "$n"; done; rmdir *
    

    The -i flag will cause rm to prompt before every removal, you can choose Yes or No.

  • Delete everything extracted, directories included:

    unzip -lqq file.zip | gawk -F"  " '{print $NF;}' |
      while IFS= read -r n; do rm -rf "$n"; done
    

Solution 2:

You can use unzip -lqq <filename.zip> to list the contents of the zip file; this will include some extraneous info that you'll need to filter out, though. Here's a command that works for me:

unzip -lqq file.zip | awk '{print $4;}' | xargs rm -rf

The awk command extracts just the names of the files and directories. Then the result gets passed to xargs to delete everything. I suggest doing a dry-run of the command (i.e., by omitting the xargs rm -rf part) first to make sure the results are correct.

The above command will have issues dealing with paths that have whitespace. This (more complicated) version should fix that:

unzip -lqq file.zip | awk '{$1=$2=$3=""; sub(/ */, "", $0); printf "%s%s", $0, "\0"}' | xargs -0 rm -rf

Solution 3:

With the switch -Z1, unzip will list exactly one file per line (and nothing else).

This way, you can use

unzip -Z1 | xargs -I {} rm '{}'

to delete all files extracted from the zip file.

The command

unzip -Z1 | xargs -I {} rm -rf '{}'

will delete directories as well, but you have to be careful. If the directories already existed before extracting the zip file, all pre-existing files in those directories will be deleted as well.


If you're going to re-extract the zip file anyway, there's another approach that is guaranteed to deal with strange file names.

First extract the zip file where you originally meant to extract it:

unzip file.zip -d elsewhere

Now, change into the directory where you extracted the files by mistake and execute the following command:

find elsewhere -type f -printf "%P\0" | xargs -0 -I {} rm '{}'
  • -type f only finds files (no directories).

  • %P\0 is the relative path (without elsewhere/), followed by a null character.

  • -0 makes xargs separate lines by null characters. This is more reliable, since – in theory – file names can contain newline characters.


To deal with leftover directories, you can execute the command:

find -type d -exec rmdir -p {} \; 2> /dev/null
  • -type d only finds directories.

  • -exec rmdir -p {} \; executes rmdir -p {} for every directory that has been found.

    {} is the directory that has been found, and the -p switch makes rmdir remove its empty parent directories as well.

  • 2> /dev/null suppresses the error messages that will arise from trying to delete non-empty or previously deleted directories.


Related man pages:

  • find
  • rmdir
  • xargs
  • zipinfo

Solution 4:

Here is an even easier and safer (I think) solution

zip -m getmeoutofhere.zip `unzip -lqq myoriginalzipfile.zip`
rm getmeoutofhere.zip

What this is doing: The backquoted unzip command will produce a list of what was in your original file.

zip -m will then use that list to add add that each to getmeoutofhere.zip and remove it from the original directory (so theoretically it should be indential to myoriginalfile.zip.

The downside is that unzip -lqq will produce some extra text, dates, times, filesize, etc. These will cause zip -m to produce error messages but this should have no affect (unless you have the unlikely case of a file with the same name).

Please note that this will not remove any directories that were created during the original unzip.

Solution 5:

If you extracted the files such that the modification timestamp in the archive is not preserved in the extracted copies (but rather the extracted files have their usual modification time) then the right way to attack this is via modification time. All the extracted files have a newer modification timestamp than the most recently modified existing file in that directory.

Here is a simple situation.

Suppose that none of the existing files in the current directory were touched for at least 24 hours. Anything that was modified in the last 24 hours is therefore junk from the zipfile.

$ find . -mtime -1 -print0 | xargs -0 rm

This will find some directories too, but rm will leave them alone. They can be dealt with in a second pass:

$ find . -mtime 1 -type d -print 0 | xargs -0 rmdir

Any directories which were recently modified were modified by the zip. If rmdir successfully removes them, that means they are empty. Empty directories that were touched by zip were probably created by it: i.e. came from the archive. We can't be 100% sure. It's possible that the unzip job put some files into an existing directory which was empty.

If find's 24 hour granularity isn't good enough for the job, because files in the tree were modified too recently, then I'd next consider something simple: suppose that the unzip job did not put anything into existing subdirectories. That is to say, everything that was unzipped is either a file at the top level, or a new subdirectory which was not there before, which therefore contains nothing but material from the zip. Then:

# list directory in descending order of modification time
$ ls -1t > filelist  # descending order of modification time

Now we open filelist in a text editor, and determine the first entry in the list which did not come from the zip. We delete that entry and everything else after it. What remains are the files and directories which came from the zip. First we visually inspect for issues like spaces in the names, and occurrences of quotes that need to be escaped. We can then add quotes around everything, if necessary: The following assumes you use Vim:

:%s/.*/"&"/

Then join it all into a big line:

:%j

Now insert rm -rf in front of it:

Irm - rf<ESC>

Run the line under the cursor as a shell command:

!!sh<Enter>

Definitely, I would not automate the steps of this task, due to the risk of erasing files which were already there, or screwing up due to file name issues.

If you're going to go the obvious route of obtaining a list of the paths in the zip, then capture it to a file, look over it very carefully and transform it to a removal after doing any necessary editing.