Having a list of paths, how can I filter out subdirectories of previously mentioned paths?
Let's say I have a sorted list of absolute paths, like the one in my answer here (shortened and modified for this question):
/proc
/proc/sys/fs/binfmt_misc
/proc/sys/fs/binfmt_misc
/run
/run/cgmanager/fs
/run/hugepages/kvm
/run/lock
/run/user/1000
/run/user/1000/gvfs
/tmp
/home/bytecommander/ramdisk
What I want is to reduce this list by eliminating all paths which are subdirectories of previously mentioned paths. That means, for the given input I want this output:
/proc
/run
/tmp
/home/bytecommander/ramdisk
How can this be done easily in the command-line using e.g. Bash, sed
, awk
or any other common tools? Short solutions that fit in one line are appreciated but not required.
AWK
$ awk -F '/' 'oldstr && NR>1{ if($0!~oldstr"/"){print $0;oldstr=$0}};NR == 1{print $0;oldstr=$0}' paths.txt
/proc
/run
/tmp
/home/bytecommander/ramdisk
/var/zomg
/var/zomgkthx
/zomg
/zomgkthx
The way this works is simple enough, but order of commands is significant. We start by recording what the first line is and printing it out. We go to following line and check if the next line contains previous text. If it does - we do nothing. If it doesn't - that's a different, new path.
The original approach was flawed and failed when there were adjacent paths with same leading substring, such as /var/zomg
and /var/zomgkthx
(Thanks to Chai T.Rex for pointing that out). The trick is to append "/" to old path to signify ending of it, thus breaking the substring. Same approach is used in python alternative below.
Python alternative
#!/usr/bin/env python
import sys,os
oldline = None
with open(sys.argv[1]) as f:
for index,line in enumerate(f):
path = line.strip()
if index == 0 or not line.startswith(oldline):
print(path)
oldline = os.path.join(path,'')
Sample run:
$ ./reduce_paths.py paths.txt
/proc
/run
/tmp
/home/bytecommander/ramdisk
/var/zomg
/var/zomgkthx
/zomg
/zomgkthx
This approach is similar to awk-one. Idea is the same: record the first line, and keep printing and resetting the tracking variable only when we encounter line that doesn't have tracking variable as a starting substring.
Alternatively, once could use os.path.commonprefix()
function as well.
#!/usr/bin/env python
import sys,os
oldline = None
with open(sys.argv[1]) as f:
for index,line in enumerate(f):
path = line.strip()
if index == 0 or os.path.commonprefix([path,oldline]) != oldline:
print(path)
oldline = os.path.join(path,'')
Another Python version, using the new pathlib
library:
#! /usr/bin/env python3
import pathlib, sys
seen = set()
for l in sys.stdin:
p = pathlib.Path(l.strip())
if not any(x in seen for x in p.parents):
seen.add(p)
print(str(p))