How to use regex to include/exclude some input files in sc.textFile?

Solution 1:

Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat.

Searching reveals that paths supplied to FileInputFormat's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext also uses those APIs to set the path.

The syntax of the glob includes:

  • * (match 0 or more character)
  • ? (match single character)
  • [ab] (character class)
  • [^ab] (negated character class)
  • [a-b] (character range)
  • {a,b} (alternation)
  • \c (escape character)

Following the example in the accepted answer, it is possible to write your path as:

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")