How do I search for lines in a file that only contain ASCII characters and then act on them?

It seems like you can use sed to do this job, even though it doesn't know about the [[:ascii:]] character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [\d0-\d127], as long as we use the C or POSIX locales.

Here's a command that should be reliable:

LC_ALL=C sed -r ':a;N;s|^([\d0-\d127]+)\n([\d0-\d127]+)$|\1 / \2|;ta' file

Notes

  • LC_ALL=C Use C locale settings only for this command (otherwise you get an error)
  • -r Use extended regex to make the command more readable (we need fewer backslashes) (GNU sed also recognises -E with the same meaning).
  • :a Label - loop starts here
  • ; Separates commands, like in the shell
  • N Read the next line into the pattern space, so we can replace \n
  • s|old|new| Replace old with new
  • ^([\d0-\d127])\n([\d0-\d127]+)$ - match two lines with only ASCII and capture the first line in \1 and the second line in \2. ^ is start of line, \n is a newline, and $ is end of line, so ^line 1\nline 2$ tests the whole of line 1 and line 2.
  • \1 / \2 The first and second lines, separated by  /  instead of a newline.
  • ta - If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.

Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.


If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep

$ grep -P '^[[:ascii:]]*$' file
English words only
English words only
English words only
Also English words only
English words only

Some tools provide a whole-line flag such as grep's -x or --line-regexp:

   -x, --line-regexp
          Select  only  those  matches  that exactly match the whole line.
          For a regular expression pattern, this  is  like  parenthesizing
          the pattern and then surrounding it with ^ and $.

allowing you to use:

$ grep -Px '[[:ascii:]]*' file
English words only
English words only
English words only
Also English words only
English words only

Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep to slurp a whole file using the -Z flag however there are tools such as pcregrep or perl itself are probably more appropriate at that point.

The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl is one of these, which provides a /m modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777); for example

$ perl -0777 -pe 's{^([[:ascii:]]+)\n([[:ascii:]]+)$}{$1 / $2}mg' file
English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
English words only / Also English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ