How to strip a Hebrew text of vowels and punctuation in AppleScript?
Solution 1:
ASCII number
is deprecated and doesn't work correctly with unicode text, use id of someCharacter
:
set charNum to id of "בְּ" -- this return id of 3 characters because "בְּ" is a composed character
log charNum
set charNum to id of "ב"
log charNum
-->result:
(*1489, 1456, 1468*)
(*1489*)
So, I do not know how to do this in pure AppleScript.
But, you can use a perl command in a do shell script
:
-- The text look not good in this code block, but it will be correct after the compilation of the script
set theString to "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃
וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃
וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃
וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)"
return do shell script "perl -CSD -pe 'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g; s~ +~ ~g;' <<< " & quoted form of theString
Here is a brief explanation of the perl script
- the
-CSD
option : the output and the error will be in UTF-8, the input is assumed to be in UTF-8 -
s~\\p{NonspacingMark}~~og
: Remove non spacing marks -
s~־|׀~ ~g
: Replace all־
and׀
by a space -
s~ +~ ~g
: Replace multiple spaces in a row by one space
If your AppleScript read the text from a file, you can use perl to read the file:
do shell script "perl -CSD -pe 'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g; s~ +~ ~g;' < " & quoted form of posix path of pathOfTheTextFile
The encoding of the file must be utf8.
Another solution is to use a Cocoa-AppleScript:
use framework "Foundation"
use scripting additions
-- The text look not good in this code block, but it will be correct after the compilation of the script
set theString to "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃
וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃
וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃
וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)"
return stripString(theString)
on stripString(t)
set sourceString to current application's NSMutableString's stringWithString:t
set myOpt to current application's NSRegularExpressionSearch
set theSuccess to sourceString's applyTransform:(current application's NSStringTransformStripCombiningMarks) |reverse|:false range:(current application's NSMakeRange(0, (sourceString's |length|))) updatedRange:(missing value)
if theSuccess then
-- *** Replace all "־" and "׀" by a space, each character must be separated by a vertical bar character, e.g. "a|d|z"
sourceString's replaceOccurrencesOfString:"־|׀" withString:" " options:myOpt range:(current application's NSMakeRange(0, (sourceString's |length|)))
-- **** Replace multiple spaces in a row by one space
sourceString's replaceOccurrencesOfString:" +" withString:" " options:myOpt range:(current application's NSMakeRange(0, (sourceString's |length|)))
return sourceString as string -- convert the NSString object to an AppleScript's string
end if
return "" -- else, the transform was not applied
end stripString
According to the commentary:
For a droplet, the script need an on open handler
, like this:
on open theseFiles
repeat with f in theseFiles
set cleanText to do shell script "perl -CSD -pe 'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g; s~ +~ ~g;' " & quoted form of POSIX path of f
-- do something with that cleanText
end repeat
end open
If you want to do an in-place editing (the perl script need the -i
option + '.some name extension'
):
This will create backup of each file (it add ".bak" after the name)
on open theseFiles
repeat with f in theseFiles -- *** create a backup and edit the file in-place ***
do shell script "perl -i'.bak' -CSD -pe 'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g; s~ +~ ~g;' " & quoted form of POSIX path of f
end repeat
end open
If you don't want a backup of each file (the perl script need the -i
option + ''
), like this:
-- *** edit the file in-place without backup***
do shell script "perl -i'' -CSD -pe 'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g; s~ +~ ~g;' " & quoted form of POSIX path of f