如何通过 AppleScript grep 元音变音和其他重音文本字符 [英] How to grep umlauts and other accented text characters via AppleScript

查看:17
本文介绍了如何通过 AppleScript grep 元音变音和其他重音文本字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在尝试从 Apple 脚本执行 shell 脚本时遇到问题.我做了一个grep",但是一旦它包含特殊字符,它就不能按预期工作.(该脚本读取目录中的 ob 子文件夹列表,并检查文件中是否出现任何子文件夹.)

这是我的脚本:

将搜索文件设置为/tmp/output.txt"将命令设置为/usr/local/bin/pdftotext -enc UTF-8 some.pdf"&空间&搜索文件执行 shell 脚本命令告诉应用程序查找器"设置公司以获取文件夹的文件夹名称(/path/"作为 POSIX 文件)结束告诉在公司中与公司重复将命令设置为grep -c" &公司报价表空间&searchFile 的引用形式尝试执行 shell 脚本命令将 CompanyName 设置为 company 作为字符串返回公司名称出错时结束尝试结束重复返回假

问题是例如带有变音符号的字符串.theCommand"的编码方式与我直接在 CLI 上执行时的编码方式不同.

$ grep -c 'Württemberg' '/tmp/output.txt' -->在命令行输入3$ grep -c '符腾堡' '/tmp/output.txt' -->复制&从 AppleScript 粘贴0$ grep -c 'rttemberg' '/tmp/output.txt' -->没有变音,没有问题3

第一行和第二行的ü"不同;一个 echo 'Württemberg' |openssl base64 显示了这一点.

我在不同的地方尝试了几种编码技巧,基本上都是我能找到或想到的.

有人知道吗?如何检查字符串具有哪种编码?

提前致谢!塞巴斯蒂安

解决方案

概述

这可以通过在 grep 命令中使用之前转义每个 company 名称中带有重音符号的每个字符来实现.

因此,您需要使用双反斜杠(即 \\)对每个字符(即带有重音的字符)进行转义.例如:

  • Württemberg 中的 ü 需要变成 \\ü
  • Königsberg 中的 ö 需要变成 \\ö
  • Einbahnstraße 中的 ß 需要变成 \\ß
<小时>

为什么需要这样做:

这些重音字符,例如带有分音符的 u,肯定会以不同的方式编码.他们收到哪种类型的编码很难确定.我的假设是使用的编码模式以反斜杠开头 - 因此为什么用反斜杠转义这些字符可以解决这个问题.考虑前面链接中的 u with diaeresis,它表明对于 C/C++ 语言,ü 被编码为 \u00FC.<小时>

解决方案

在下面的完整脚本中,您会注意到以下内容:

  1. 将重音字符设置为 {"ü", "ö", "ß", "á", "ė"} 已添加以保存需要转义的所有字符的列表.您需要明确说明每一个,因为似乎没有办法推断该字符是否有口音.
  2. 在将 grep 命令分配给 theCommand 变量之前,我们首先通过以下行来转义必要的字符:

    设置公司转义字符(公司,重音字符)

    正如您在此处看到的,我们将两个参数传递给 escapeChars 子例程,(即未转义的 company 变量和重音字符列表).

  3. escapeChars 子例程中,我们遍历 accentedChars 列表中的每个 char 并调用 findAndReplace 子程序.这将使用在 company 变量中找到的反斜杠转义这些字符的任何实例.

完整脚本:

设置 searchFile 为 "/tmp/output.txt"将重音字符设置为 {"ü", "ö", "ß", "á", "ė"}将命令设置为/usr/local/bin/pdftotext -enc UTF-8 some.pdf"&¬空间&搜索文件执行 shell 脚本命令告诉应用程序查找器"设置公司以获取文件夹的文件夹名称(/path/"作为 POSIX 文件)结束告诉在公司中与公司重复将公司设置为转义字符(公司,重音字符)将命令设置为grep -c" &公司报价表¬空间&searchFile 的引用形式尝试执行 shell 脚本命令将 CompanyName 设置为 company 作为字符串返回公司名称出错时结束尝试结束重复返回假(*** 检查给定单词的每个字符.如果单词的任何字符* 匹配给定字符列表中的字符,它们将被转义.** @param {text} searchWord - 要检查字符的单词.* @param {text} charactersList - 要转义的字符列表.* @returns {text} 替换项目的新文本.*)关于escapeChars(searchWord, charactersList)重复字符列表中的字符将 searchWord 设置为 findAndReplace(char, ("\\" & char), searchWord)结束重复返回搜索词结束转义字符(*** 用 replaceString 替换所有出现的 findString** @param {text} findString - 要查找的文本字符串.* @param {text} replaceString - 替换文本字符串.* @param {text} searchInString - 要搜索的文本字符串.* @returns {text} 替换项目的新文本.*)在 findAndReplace(findString, replaceString, searchInString)将 oldTID 设置为 AppleScript 的文本项分隔符将 AppleScript 的文本项分隔符设置为 findString将 searchInString 设置为 searchInString 的文本项将 AppleScript 的文本项分隔符设置为 replaceString将 searchInString 设置为 "" &搜索输入字符串将 AppleScript 的文本项分隔符设置为 oldTID返回搜索字符串结束查找和替换

<小时>

关于当前计数的注意事项:

目前您的 grep 模式仅报告找到该单词的行数.不是找到了多少这个词的实例.

如果您想要单词的实际实例数,请使用 -o 选项和 grep 输出每次出现.然后使用 -l<将其传送到 wc/code> 选项来计算行数.例如:

grep -o 'Württemberg'/tmp/output.txt |wc -l

在你的 AppleScript 中:

将命令设置为grep -o" &公司报价表空间&¬searchFile & 的引用形式"| wc -l"

提示:如果您想删除记录的计数/数字中的前导空格,请将其通过管道传输到 sed 去除空格:例如通过您的脚本:

将命令设置为grep -o" &公司报价表空间&¬searchFile & 的引用形式"| wc -l | sed -e 's///g'"

以及通过命令行的等效项:

grep -o 'Württemberg'/tmp/output.txt |wc -l |sed -e 's///g'

I have a problem trying to execute shell scripts from apple script. I do a "grep", but as soon as it contains special characters it doesn't work as intended. (The script reads a list list ob subfolders in a directory and checks if any of the subfolders appear in a file.)

Here is my script:

set searchFile to "/tmp/output.txt"

set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & space & searchFile
do shell script theCommand

tell application "Finder"
    set companies to get name of folders of folder ("/path/" as POSIX file)
end tell

repeat with company in companies
    set theCommand to "grep -c " & quoted form of company & space & quoted form of searchFile

    try
        do shell script theCommand
        set CompanyName to company as string
        return CompanyName
    on error

    end try
end repeat

return false

The problem is e.g. with strings with umlauts. "theCommand" is somehow differently encoded that when I do it on the CLI directly.

$ grep -c 'Württemberg' '/tmp/output.txt' --> typed on command line
3
$ grep -c 'Württemberg' '/tmp/output.txt' --> copy & pasted from AppleScript
0
$ grep -c 'rttemberg' '/tmp/output.txt'   --> no umlauts, no problems
3

The "ü" from the first and the second line are different; a echo 'Württemberg' | openssl base64 shows this.

I tried several encoding tricks at different places, basically everything I could find or think of.

Does anyone have any idea? How can I check which encoding a string has?

Thanks in advance! Sebastian

解决方案

Overview

This can work by escaping each character that has an accent in each company name before they are used in the grep command.

So, you'll need to escape each one of those characters (i.e. those which have an accent) with double backslashes (i.e. \\). For example:

  • The ü in Württemberg will need to become \\ü
  • The ö in Königsberg will need to become \\ö
  • The ß in Einbahnstraße will need to become \\ß

Why is this necessary:

These accented characters, such as a u with diaeresis, are certainly getting encoded differently. Which type of encoding they receive is difficult to ascertain. My assumption is that the encoding pattern used begins with a backslash - hence why escaping those characters with backslashes fixes the issue. Consider the u with diaeresis in the previous link, it shows that for the C/C++ language the ü is encoded as \u00FC.


Solution

In the complete script below you'll notice the following:

  1. set accentedChars to {"ü", "ö", "ß", "á", "ė"} has been added to hold a list of all characters that will need to be escaped. You'll need to explicitly state each one as there doesn't seem to be a way to infer whether the character has an accent.
  2. Before assigning the grepcommand to theCommand variable we firstly escape the necessary characters via the line reading:

    set company to escapeChars(company, accentedChars)
    

    As you can see here we are passing two arguments to the escapeChars sub-routine, (i.e. the non-escaped company variable and the list of accented characters).

  3. In the escapeChars sub-routine we iterate over each char in the accentedChars list and invoke the findAndReplace sub-routine. This will escape any instances of those characters with backslashes found in the company variable.

Complete script:

set searchFile to "/tmp/output.txt"
set accentedChars to {"ü", "ö", "ß", "á", "ė"}

set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & ¬
  space & searchFile
do shell script theCommand

tell application "Finder"
  set companies to get name of folders of folder ("/path/" as POSIX file)
end tell

repeat with company in companies
  set company to escapeChars(company, accentedChars)

  set theCommand to "grep -c " & quoted form of company & ¬
    space & quoted form of searchFile

  try
    do shell script theCommand
    set CompanyName to company as string
    return CompanyName
  on error

  end try
end repeat

return false

(**
 * Checks each character of a given word. If any characters of the word
 * match a character in the given list of characters they will be escapd.
 *
 * @param {text} searchWord - The word to check the characters of.
 * @param {text} charactersList - List of characters to be escaped.
 * @returns {text} The new text with the item(s) replaced.
 *)
on escapeChars(searchWord, charactersList)
  repeat with char in charactersList
    set searchWord to findAndReplace(char, ("\\" & char), searchWord)
  end repeat
  return searchWord
end escapeChars

(**
 * Replaces all occurances of findString with replaceString
 *
 * @param {text} findString - The text string to find.
 * @param {text} replaceString - The replacement text string.
 * @param {text} searchInString - Text string to search.
 * @returns {text} The new text with the item(s) replaced.
 *)
on findAndReplace(findString, replaceString, searchInString)
  set oldTIDs to text item delimiters of AppleScript
  set text item delimiters of AppleScript to findString
  set searchInString to text items of searchInString
  set text item delimiters of AppleScript to replaceString
  set searchInString to "" & searchInString
  set text item delimiters of AppleScript to oldTIDs
  return searchInString
end findAndReplace


Note about current counts:

Currently your grep pattern only reports the number of lines that the word was found on. Not how many instances of the word were found.

If you want the actual number of instances of the word then use the -o option with grep to output each occurrence. Then pipe that to wc with the -l option to count the number of lines. For example:

grep -o 'Württemberg' /tmp/output.txt | wc -l

and in your AppleScript that would be:

set theCommand to "grep -o " & quoted form of company & space & ¬
  quoted form of searchFile & "| wc -l"

Tip: If your want to remove the leading spaces in the count/number that gets logged then pipe it to sed to strip the spaces: For example via your script:

set theCommand to "grep -o " & quoted form of company & space & ¬
  quoted form of searchFile & "| wc -l | sed -e 's/ //g'"

and the equivalent via the command line:

grep -o 'Württemberg' /tmp/output.txt | wc -l | sed -e 's/ //g'

这篇关于如何通过 AppleScript grep 元音变音和其他重音文本字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆