How to use regex to include/exclude some input files in sc.textFile?
Solution 1:
Looking at the accepted answer, it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat
.
Searching reveals that paths supplied to FileInputFormat
's addInputPath
or setInputPath
"may represent a file, a directory, or, by using glob, a collection of files and directories". Perhaps, SparkContext
also uses those APIs to set the path.
The syntax of the glob includes:
-
*
(match 0 or more character) -
?
(match single character) -
[ab]
(character class) -
[^ab]
(negated character class) -
[a-b]
(character range) -
{a,b}
(alternation) -
\c
(escape character)
Following the example in the accepted answer, it is possible to write your path as:
sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")
It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323's comment, no escaping is necessary:
sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")