如何通过AppleScript变音符号和其他带重音符号的文字字符 [英] How to grep umlauts and other accented text characters via AppleScript
问题描述
我在尝试从Apple脚本执行Shell脚本时遇到问题.我做了一个"grep",但是一旦包含特殊字符,它就不会按预期工作. (脚本会读取目录中子文件夹的列表列表,并检查文件中是否存在任何子文件夹.)
I have a problem trying to execute shell scripts from apple script. I do a "grep", but as soon as it contains special characters it doesn't work as intended. (The script reads a list list ob subfolders in a directory and checks if any of the subfolders appear in a file.)
这是我的剧本:
set searchFile to "/tmp/output.txt"
set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & space & searchFile
do shell script theCommand
tell application "Finder"
set companies to get name of folders of folder ("/path/" as POSIX file)
end tell
repeat with company in companies
set theCommand to "grep -c " & quoted form of company & space & quoted form of searchFile
try
do shell script theCommand
set CompanyName to company as string
return CompanyName
on error
end try
end repeat
return false
问题例如带有变音符号的字符串.当我直接在CLI上进行操作时,"theCommand"的编码方式有所不同.
The problem is e.g. with strings with umlauts. "theCommand" is somehow differently encoded that when I do it on the CLI directly.
$ grep -c 'Württemberg' '/tmp/output.txt' --> typed on command line
3
$ grep -c 'Württemberg' '/tmp/output.txt' --> copy & pasted from AppleScript
0
$ grep -c 'rttemberg' '/tmp/output.txt' --> no umlauts, no problems
3
第一行和第二行中的ü"不同; echo 'Württemberg' | openssl base64
显示了这一点.
The "ü" from the first and the second line are different; a echo 'Württemberg' | openssl base64
shows this.
我在不同的地方尝试了几种编码技巧,基本上是我能找到或想到的所有东西.
I tried several encoding tricks at different places, basically everything I could find or think of.
有人有什么主意吗?如何检查字符串具有哪种编码?
Does anyone have any idea? How can I check which encoding a string has?
提前谢谢! 塞巴斯蒂安
Thanks in advance! Sebastian
推荐答案
概述
这可以通过在grep
命令中使用每个转义每个company
名称中带有重音符号的字符时起作用.
Overview
This can work by escaping each character that has an accent in each company
name before they are used in the grep
command.
因此,您需要使用双反斜杠(即\\
)对每个字符(即带有重音符号的字符)进行转义.例如:
So, you'll need to escape each one of those characters (i.e. those which have an accent) with double backslashes (i.e. \\
). For example:
-
Württemberg
中的ü
将需要成为\\ü
-
Königsberg
中的ö
将需要成为\\ö
-
Einbahnstraße
中的ß
将需要成为\\ß
- The
ü
inWürttemberg
will need to become\\ü
- The
ö
inKönigsberg
will need to become\\ö
- The
ß
inEinbahnstraße
will need to become\\ß
这些带重音的字符,例如 u带有音调符号,肯定会以不同的方式编码.他们所接收的编码类型很难确定.我的假设是所使用的编码模式以反斜杠开头-因此,为什么用反斜杠转义那些字符可以解决此问题.考虑上一个链接中的 u带有音调的,它表明对于C/C ++语言,ü
编码为\u00FC
.
These accented characters, such as a u with diaeresis, are certainly getting encoded differently. Which type of encoding they receive is difficult to ascertain. My assumption is that the encoding pattern used begins with a backslash - hence why escaping those characters with backslashes fixes the issue. Consider the u with diaeresis in the previous link, it shows that for the C/C++ language the ü
is encoded as \u00FC
.
在下面的完整脚本中,您会注意到以下内容:
In the complete script below you'll notice the following:
-
添加了
-
set accentedChars to {"ü", "ö", "ß", "á", "ė"}
来保存所有需要转义的字符的列表.您需要明确说明每个人,因为似乎没有一种方法可以推断角色是否带有重音. -
在将
grep
命令分配给theCommand
变量之前,我们首先通过以下代码行转义必要的字符:
set accentedChars to {"ü", "ö", "ß", "á", "ė"}
has been added to hold a list of all characters that will need to be escaped. You'll need to explicitly state each one as there doesn't seem to be a way to infer whether the character has an accent.Before assigning the
grep
command totheCommand
variable we firstly escape the necessary characters via the line reading:
set company to escapeChars(company, accentedChars)
正如您在这里看到的那样,我们正在向escapeChars
子例程传递两个参数(即,未转义的company
变量和重音字符列表).
As you can see here we are passing two arguments to the escapeChars
sub-routine, (i.e. the non-escaped company
variable and the list of accented characters).
在escapeChars
子例程中,我们遍历accentedChars
列表中的每个char
,并调用findAndReplace
子例程.这样将在company
变量中使用反斜杠转义那些字符的任何实例.
In the escapeChars
sub-routine we iterate over each char
in the accentedChars
list and invoke the findAndReplace
sub-routine. This will escape any instances of those characters with backslashes found in the company
variable.
完整脚本:
set searchFile to "/tmp/output.txt"
set accentedChars to {"ü", "ö", "ß", "á", "ė"}
set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & ¬
space & searchFile
do shell script theCommand
tell application "Finder"
set companies to get name of folders of folder ("/path/" as POSIX file)
end tell
repeat with company in companies
set company to escapeChars(company, accentedChars)
set theCommand to "grep -c " & quoted form of company & ¬
space & quoted form of searchFile
try
do shell script theCommand
set CompanyName to company as string
return CompanyName
on error
end try
end repeat
return false
(**
* Checks each character of a given word. If any characters of the word
* match a character in the given list of characters they will be escapd.
*
* @param {text} searchWord - The word to check the characters of.
* @param {text} charactersList - List of characters to be escaped.
* @returns {text} The new text with the item(s) replaced.
*)
on escapeChars(searchWord, charactersList)
repeat with char in charactersList
set searchWord to findAndReplace(char, ("\\" & char), searchWord)
end repeat
return searchWord
end escapeChars
(**
* Replaces all occurances of findString with replaceString
*
* @param {text} findString - The text string to find.
* @param {text} replaceString - The replacement text string.
* @param {text} searchInString - Text string to search.
* @returns {text} The new text with the item(s) replaced.
*)
on findAndReplace(findString, replaceString, searchInString)
set oldTIDs to text item delimiters of AppleScript
set text item delimiters of AppleScript to findString
set searchInString to text items of searchInString
set text item delimiters of AppleScript to replaceString
set searchInString to "" & searchInString
set text item delimiters of AppleScript to oldTIDs
return searchInString
end findAndReplace
关于当前计数的注释:
当前,您的grep模式仅报告找到该单词的行数.没有找到多少个单词实例.
Note about current counts:
Currently your grep pattern only reports the number of lines that the word was found on. Not how many instances of the word were found.
如果您要获取单词的实际实例数,则将-o
选项与 -l
选项将其通过管道传递到 wc
来计算行数.例如:
If you want the actual number of instances of the word then use the -o
option with grep
to output each occurrence. Then pipe that to wc
with the -l
option to count the number of lines. For example:
grep -o 'Württemberg' /tmp/output.txt | wc -l
,在您的AppleScript中为:
and in your AppleScript that would be:
set theCommand to "grep -o " & quoted form of company & space & ¬
quoted form of searchFile & "| wc -l"
提示::如果要删除要记录的计数/数字中的前导空格,则将其通过管道传输到
Tip: If your want to remove the leading spaces in the count/number that gets logged then pipe it to sed
to strip the spaces: For example via your script:
set theCommand to "grep -o " & quoted form of company & space & ¬
quoted form of searchFile & "| wc -l | sed -e 's/ //g'"
以及通过命令行进行的等效操作:
and the equivalent via the command line:
grep -o 'Württemberg' /tmp/output.txt | wc -l | sed -e 's/ //g'
这篇关于如何通过AppleScript变音符号和其他带重音符号的文字字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!