Recursively search files with exclusions and inclusions

Solution 1:

tl;dr

Something similar to

find /local/data/ \
   ! -path '/local/data/database/session*' \
   -o -path '/local/data/database/session_*.db'

Preamble

There are no simple --include and --exclude directives in the implementations of find I know. In any case you can build a sequence of tests that will work as you wish, because the mechanism of tests in find is deliberately designed to allow any (even a custom) test based on any criteria (i.e. not necessarily on the pathname). To do what you want you need to translate your exclude/include patterns to a sequence of tests. To do this properly you need to know how find works. Its mechanism is more general than the concept of excluding/including.

Here I will rely mostly on the POSIX specification for find (all citations are from this document). Implementations that go beyond this specification expand the tool without changing its general philosophy.


Theory

To understand and effectively use find you need to know few things:

  1. Terminology:

    • There are few possible options (like -L) that may appear just after find. For the purpose of this answer they are not important.
    • Then there is one or more starting points. /local/data/ in your example is a starting point. Some implementations allow zero starting points (then . or ./ is the default starting point).
    • Everything that follows forms an expression. The expression consists of zero or more supported operands: primaries like -name, -exec; operators like -o, ( (which often should be escaped or quoted to protect it from the shell) or !. Some of them require custom additional operands (e.g. patterns) that also belong to the expression.
  2. Almost everything in the expression is a test. The manual for GNU find in my Ubuntu divides supported operands into categories: tests, actions etc. Still most of them can be treated as tests; i.e. any primary returns either true or false, which affects what find does next. In this answer I use the word "test" in a very broad sense.

  3. find starts from the specified starting point and recursively descends the directory hierarchy in a certain sequence. Some operands can alter the sequence (-depth) or even reduce it (-prune).

  4. find evaluates the expression for each file separately.

  5. find evaluates the expression from left to right. The tool may rearrange tests if this maneuver does not affect the overall output (not only output to stdout, note -exec can do anything), some implementations do this for performance; even then the expression should work as if it was evaluated from left to right. Some operands work regardless of their position in the expression though (-depth, -xdev).

  6. For a given file some part(s) of the expression may not be evaluated at all. Operators -a, -o, (+), ! define the logic of the expression.

    The primaries can be combined using the following operators (in order of decreasing precedence):

    ( expression )
    True if expression is true.

    ! expression
    Negation of a primary; the unary NOT operator.

    expression [-a] expression
    Conjunction of primaries; the AND operator is implied by the juxtaposition of two primaries or made explicit by the optional -a operator. The second expression shall not be evaluated if the first expression is false.

    expression -o expression
    Alternation of primaries; the OR operator. The second expression shall not be evaluated if the first expression is true.

    Imagine -test1, -test2 and -test3 are tests find understands. Let the expression be

    ! -test1 -test2 -o -test3
    

    which is equivalent to

    ( ( ! -test1 ) -a -test2 ) -o -test3
    

    In a shell the full commands would be respectively:

    find /starting/point ! -test1 -test2 -o -test3
    find /starting/point \( \( ! -test1 \) -a -test2 \) -o -test3
    

    Possible outcomes:

    • -test1 is evaluated for every file tested.
      • If -test1 is false, ( ! -test1 ) is true. Then -test2 is evaluated because this is how -a works.
        • If -test2 is false, the expression in the outer parentheses is false. Then -test3 is evaluated because this is how -o works.
          • If -test3 is false, the entire expression is false.
          • If -test3 is true, the entire expression is true.
        • If -test2 is true, the expression in the outer parentheses is true. Then -test3 is not evaluated because this is how -o works. The entire expression is true.
      • If -test1 is true, ( ! -test1 ) is false. Then -test2 is not evaluated because this is how -a works. The expression in the outer parentheses is false. Then -test3 is evaluated because this is how -o works.
        • If -test3 is false, the entire expression is false.
        • If -test3 is true, the entire expression is true.

    Note that logically ( ( NOT A ) AND B ) OR C is equivalent to C OR ( B AND ( NOT A ) ), but with find the following expressions are not equivalent, in general they are pairwise different:

    ! -test1 -test2 -o -test3
    -test2 ! -test1 -o -test3
    -test3 -o ! -test1 -test2
    -test3 -o -test2 ! -test1
    

    This is especially true if one or more tests are -exec. Often -exec is used to conditionally do something (example), so it will be after other tests (conditions) and we will rather say it's an action, not a test. But you can write a custom test with -exec (example) and this is very powerful; in such case -exec may be even the first test, the one that is always evaluated. Not only the logical outcome (true or false) from -exec makes find perform or skip later tests for the file. What -exec does (e.g. imagine it removes some accompanying files) can affect later tests (for the same file or even for other files), possibly in a non-obvious way.

  7. Parentheses are important. Problems where -o seems to misbehave are often solved by using parentheses (example).

  8. In some circumstances -print is implicitly added:

    If no expression is present, -print shall be used as the expression. Otherwise, if the given expression does not contain any of the primaries -exec, -ok, or -print, the given expression shall be effectively replaced by:

    ( given_expression ) -print
    

    Notes

    • In this case -print will be evaluated (performed) iff the given expression evaluates to true. Above, where I wrote "the entire expression is false" or "the entire expression is true", I meant what matters for the implicit -print (if applicable).
    • Implementations may expand the set "-exec, -ok, -print" with other (non-POSIX) primaries.

Solution

The question is about exclusions/inclusions based on pathnames. The following primaries are useful:

  • -name pattern
    The primary shall evaluate as true if the basename of the current pathname matches pattern using the pattern matching notation […]

  • -path pattern
    The primary shall evaluate as true if the current pathname matches pattern using the pattern matching notation […]

  • -prune
    The primary shall always evaluate as true; it shall cause find not to descend the current pathname if it is a directory. If the -depth primary is specified, the -prune primary shall have no effect.

(Terms like "basename" or "pathname" are defined here.)

Implementations may add other useful primaries (e.g. -regex, -iname).

Often -prune is the right way to exclude the content of the given directory (with or without the directory itself). But it totally prevents find from entering the directory; so if you want to find (include) some files in the directory anyway, then you cannot use -prune.

I think you want this:

  • Print pathname of each file in the directory hierarchy starting from /local/data/,
  • but don't if it matches /local/data/database/session*,
  • but do if it matches /local/data/database/session_*.db.

The following find command should do it:

find /local/data/ \
   ! -path '/local/data/database/session*' \
   -o -path '/local/data/database/session_*.db'

where \ before a newline tells the shell the command continues in the next line. Quoting is important (you probably know, you quoted in the question).

It works like this:

  • For each file under (and including) the starting point but not matching the exclusion pattern, ! -path … is true; the second test is not performed and the entire expression is true.
  • For each file under (and including) the starting point and matching the exclusion pattern, ! -path … is false; only then the second test is performed.
    • If the second test is true, the entire expression is true.
    • If the second test if false, the entire expression is false.

Notes:

  • This is a case where the implicit -print is added.
  • These tests in the reverse order would work as well.

General case

With parentheses, -a, -o and ! you can create quite complex exclude+include schemes. In particular:

  • nested (e.g. exclude ./foo/*, but include ./foo/bar/*, but exclude ./foo/bar/baz/*, but …);
  • based on criteria other than pathnames (e.g. totally exclude directories owned by root).

Although it may not be easy to create expressions implementing complex schemes flawlessly.


Pitfalls

  1. Metacharacters (e.g. *) in patterns do not treat / or . specially. The fragment session_*.db matches session_5.db, it also matches session_foo/bar/baz.db.

  2. In cases when you can use -prune, remember -prune evaluates as true. With implicit -print this may surprise you. That's why I wrote "-prune is the right way to exclude the content of the given directory (with or without the directory itself)".

  3. In cases when you can use -prune, make sure it gets evaluated when you need it.

    Example:

    mkdir -p test/ab/a; cd test
    
    find .    -name 'a*' -print        -o -name '*b' -prune             #1
    find .    -name '*b' -prune        -o -name 'a*' -print             #2
    find .    -name '*b' -prune -print -o -name 'a*' -print             #3
    find . \( -name '*b' -prune        -o -name 'a*'        \) -print   #4
    find .    -name '*b' -prune        -o -name 'a*'                    #5
    

    In the first case the directory named ab will be printed and not pruned. In the second case it will be pruned and not printed. In the third case it will be pruned and printed once. The fourth case is equivalent to the third, -print has been placed behind the parentheses (like a common factor in math). The fifth case is equivalent to the fourth, -print is implicit.

    The first case is an example of a more general problem (bug), where some file (here ab directory) never reaches the test designed for it and the right action, because it accidentally matches an earlier test designed with other files in mind, and triggers an unwanted action.

  4. Pathnames used by -path are what find "thinks" they are, not what realpath would print. Patterns must take this into account.

    Example:

    cd /bin && find .    -path '/bin*'   # will find nothing
    cd /bin && find .    -path '.*'      # will find "everything"
    cd /bin && find /bin -path '/bin*'   # will find "everything"
    cd /bin && find /bin -path '.*'      # will find nothing
    

    Similarly for a starting point the basename used by -name depends on the exact representation of the starting point. Edge cases, but still:

    • / for /, ///, //// etc.
    • . for ., ./, /., /bin/., /bin/../. etc.
    • .. for .., /.., /../../, ///bin/.. etc.
  5. Each starting point defines a separate hierarchy. The tool doesn't care if the hierarchies overlap.

    Example: if /bin/bash and /bin/dash exist, the following command will find bash four times (with three different pathnames) and dash three times (with two different pathnames):

    cd /bin && find . /bin /bin ../bin/bash -name '[bd]ash'