How to strip a Hebrew text of vowels and punctuation in AppleScript?

Solution 1:

ASCII number is deprecated and doesn't work correctly with unicode text, use id of someCharacter:

set charNum to id of "בְּ" -- this return id of 3 characters because "בְּ" is a composed character
log charNum
set charNum to id of "ב"
log charNum
-->result: 
(*1489, 1456, 1468*)
(*1489*)

So, I do not know how to do this in pure AppleScript.


But, you can use a perl command in a do shell script:

-- The text look not good in this code block, but it will be correct after the compilation of the script
set theString to "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃

וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃

וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)"


return do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' <<< " & quoted form of theString

Here is a brief explanation of the perl script

  • the -CSD option : the output and the error will be in UTF-8, the input is assumed to be in UTF-8
  • s~\\p{NonspacingMark}~~og : Remove non spacing marks
  • s~־|׀~ ~g : Replace all ־ and ׀ by a space
  • s~ +~ ~g : Replace multiple spaces in a row by one space

If your AppleScript read the text from a file, you can use perl to read the file:

do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' < " & quoted form of posix path of pathOfTheTextFile

The encoding of the file must be utf8.


Another solution is to use a Cocoa-AppleScript:

        use framework "Foundation"
        use scripting additions
        -- The text look not good in this code block, but it will be correct after the compilation of the script
        set theString to "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃

וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃

וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)"

        return stripString(theString)

        on stripString(t)
            set sourceString to current application's NSMutableString's stringWithString:t
            set myOpt to current application's NSRegularExpressionSearch
            set theSuccess to sourceString's applyTransform:(current application's NSStringTransformStripCombiningMarks) |reverse|:false range:(current application's NSMakeRange(0, (sourceString's |length|))) updatedRange:(missing value)
            if theSuccess then
                -- *** Replace all "־" and "׀" by a space, each character must be separated by a vertical bar character, e.g. "a|d|z"
                sourceString's replaceOccurrencesOfString:"־|׀" withString:" " options:myOpt range:(current application's NSMakeRange(0, (sourceString's |length|)))

                -- **** Replace multiple spaces in a row by one space
                sourceString's replaceOccurrencesOfString:" +" withString:" " options:myOpt range:(current application's NSMakeRange(0, (sourceString's |length|)))
                return sourceString as string -- convert the NSString object to an AppleScript's string
            end if
            return "" -- else, the transform was not applied
        end stripString

According to the commentary:

For a droplet, the script need an on open handler, like this:

on open theseFiles
    repeat with f in theseFiles
        set cleanText to do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f
        -- do something with that cleanText
    end repeat
end open

If you want to do an in-place editing (the perl script need the -i option + '.some name extension'):

This will create backup of each file (it add ".bak" after the name)

on open theseFiles
    repeat with f in theseFiles -- ***  create a backup and edit the file in-place ***
        do shell script "perl -i'.bak' -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f
    end repeat
end open

If you don't want a backup of each file (the perl script need the -i option + ''), like this:

-- ***  edit the file in-place without backup***
do shell script "perl -i'' -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f