is there something akin to regEx in applescript, and if not, what's the alternative?
I need to parse the first 10 chars of a file name to see if they are all digits. The obvious way to do this is fileName =~ m/^\d{10}/ but I'm not seeing anything regExy in the applescript reference, so, I'm curious what other options I have to do this validation.
Don't despair, since OSX you can also access sed and grep through "do shell script". So:
set thecommandstring to "echo \"" & filename & "\"|sed \"s/[0-9]\\{10\\}/*good*(&)/\"" as string
set sedResult to do shell script thecommandstring
set isgood to sedResult starts with "*good*"
My sed skills aren't too crash hot, so there might be a more elegant way than appending *good* to any name that matches [0-9]{10} and then looking for *good* at the start of the result. But basically, if filename is "1234567890dfoo.mov" this will run the command:
echo "1234567890foo.mov"|sed "s/[0-9]\{10\}/*good*(&)/"
Note the escaped quotes \" and escaped backslash \\ in the applescript. If you're escaping things in the shell you have to escape the escapes. So to run a shell script that has a backslash in it you have to escape it for the shell like \\ and then escape each backslash in applescript like \\\\. This can get pretty hard to read.
So anything you can do on the command line you can do by calling it from applescript (woohoo!). Any results on stdout get returned to the script as the result.
There is an easier way to make use of the shell (works on bash 3.2+) for regex matching:
set isMatch to "0" = (do shell script ¬
"[[ " & quoted form of fileName & " =~ ^[[:digit:]]{10} ]]; printf $?")
Note:
- Makes use of a modern bash test expression
[[ ... ]]
with the regex-matching operator,=~
; not quoting the right operand (or at least the special regex chars.) is a must on bash 3.2+, unless you prependshopt -s compat31;
- The
do shell script
statement executes the test and returns its exit command via an additional command (thanks, @LauriRanta);"0"
indicates success. - Note that the
=~
operator does not support shortcut character classes such as\d
and assertions such as\b
(true as of OS X 10.9.4 - this is unlikely to change anytime soon). - For case-INsensitive matching, prepend the command string with
shopt -s nocasematch;
- For locale-awareness, prepend the command string with
export LANG='" & user locale of (system info) & ".UTF-8';
. - If the regex contains capture groups, you can access the captured strings via the built-in
${BASH_REMATCH[@]}
array variable. - As in the accepted answer, you'll have to
\
-escape double quotes and backslashes.
Here's an alternative using egrep
:
set isMatch to "0" = (do shell script ¬
"egrep -q '^\\d{10}' <<<" & quoted form of filename & "; printf $?")
Though this presumably performs worse, it has two advantages:
- You can use shortcut character classes such as
\d
and assertions such as\b
- You can more easily make matching case-INsensitive by calling
egrep
with-i
: - You canNOT, however, gain access to sub-matches via capture-groups; use the
[[ ... =~ ... ]]
approach if that is needed.
Finally, here are utility functions that package both approaches (the syntax highlighting is off, but they do work):
# SYNOPIS
# doesMatch(text, regexString) -> Boolean
# DESCRIPTION
# Matches string s against regular expression (string) regex using bash's extended regular expression language *including*
# support for shortcut classes such as `\d`, and assertions such as `\b`, and *returns a Boolean* to indicate if
# there is a match or not.
# - AppleScript's case sensitivity setting is respected; i.e., matching is case-INsensitive by default, unless inside
# a 'considering case' block.
# - The current user's locale is respected.
# EXAMPLE
# my doesMatch("127.0.0.1", "^(\\d{1,3}\\.){3}\\d{1,3}$") # -> true
on doesMatch(s, regex)
local ignoreCase, extraGrepOption
set ignoreCase to "a" is "A"
if ignoreCase then
set extraGrepOption to "i"
else
set extraGrepOption to ""
end if
# Note: So that classes such as \w work with different locales, we need to set the shell's locale explicitly to the current user's.
# Rather than let the shell command fail we return the exit code and test for "0" to avoid having to deal with exception handling in AppleScript.
tell me to return "0" = (do shell script "export LANG='" & user locale of (system info) & ".UTF-8'; egrep -q" & extraGrepOption & " " & quoted form of regex & " <<< " & quoted form of s & "; printf $?")
end doesMatch
# SYNOPSIS
# getMatch(text, regexString) -> { overallMatch[, captureGroup1Match ...] } or {}
# DESCRIPTION
# Matches string s against regular expression (string) regex using bash's extended regular expression language and
# *returns the matching string and substrings matching capture groups, if any.*
#
# - AppleScript's case sensitivity setting is respected; i.e., matching is case-INsensitive by default, unless this subroutine is called inside
# a 'considering case' block.
# - The current user's locale is respected.
#
# IMPORTANT:
#
# Unlike doesMatch(), this subroutine does NOT support shortcut character classes such as \d.
# Instead, use one of the following POSIX classes (see `man re_format`):
# [[:alpha:]] [[:word:]] [[:lower:]] [[:upper:]] [[:ascii:]]
# [[:alnum:]] [[:digit:]] [[:xdigit:]]
# [[:blank:]] [[:space:]] [[:punct:]] [[:cntrl:]]
# [[:graph:]] [[:print:]]
#
# Also, `\b`, '\B', '\<', and '\>' are not supported; you can use `[[:<:]]` for '\<' and `[[:>:]]` for `\>`
#
# Always returns a *list*:
# - an empty list, if no match is found
# - otherwise, the first list element contains the matching string
# - if regex contains capture groups, additional elements return the strings captured by the capture groups; note that *named* capture groups are NOT supported.
# EXAMPLE
# my getMatch("127.0.0.1", "^([[:digit:]]{1,3})\\.([[:digit:]]{1,3})\\.([[:digit:]]{1,3})\\.([[:digit:]]{1,3})$") # -> { "127.0.0.1", "127", "0", "0", "1" }
on getMatch(s, regex)
local ignoreCase, extraCommand
set ignoreCase to "a" is "A"
if ignoreCase then
set extraCommand to "shopt -s nocasematch; "
else
set extraCommand to ""
end if
# Note:
# So that classes such as [[:alpha:]] work with different locales, we need to set the shell's locale explicitly to the current user's.
# Since `quoted form of` encloses its argument in single quotes, we must set compatibility option `shopt -s compat31` for the =~ operator to work.
# Rather than let the shell command fail we return '' in case of non-match to avoid having to deal with exception handling in AppleScript.
tell me to do shell script "export LANG='" & user locale of (system info) & ".UTF-8'; shopt -s compat31; " & extraCommand & "[[ " & quoted form of s & " =~ " & quoted form of regex & " ]] && printf '%s\\n' \"${BASH_REMATCH[@]}\" || printf ''"
return paragraphs of result
end getMatch
I recently had need of regular expressions in a script, and wanted to find a scripting addition to handle it, so it would be easier to read what was going on. I found Satimage.osax, which lets you use syntax like below:
find text "n(.*)" in "to be or not to be" with regexp
The only downside is that (as of 11/08/2010) it's a 32-bit addition, so it throws errors when it's called from a 64-bit process. This bit me in a Mail rule for Snow Leopard, as I had to run Mail in 32-bit mode. Called from a standalone script, though, I have no reservations - it's really great, and lets you pick whatever regex syntax you want, and use back-references.
Update 5/28/2011
Thanks to Mitchell Model's comment below for pointing out they have updated it to be 64-bit, so no more reservations - it does everything I need.
I'm sure there is an Applescript Addition or a shell script that can be called to bring regex into the fold, but I avoid dependencies for the simple stuff. I use this style pattern all the time...
set filename to "1234567890abcdefghijkl"
return isPrefixGood(filename)
on isPrefixGood(filename) --returns boolean
set legalCharacters to {"1", "2", "3", "4", "5", "6", "7", "8", "9", "0"}
set thePrefix to (characters 1 thru 10) of filename as text
set badPrefix to false
repeat with thisChr from 1 to (get count of characters in thePrefix)
set theChr to character thisChr of thePrefix
if theChr is not in legalCharacters then
set badPrefix to true
end if
end repeat
if badPrefix is true then
return "bad prefix"
end if
return "good prefix"
end isPrefixGood