删除文件中的非 ASCII 字符 [英] Remove non-ASCII characters in a file

查看:31
本文介绍了删除文件中的非 ASCII 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从文件中删除非ASCII 字符?

How do I remove non-ASCII characters from a file?

推荐答案

如果您想使用 Perl,请这样做:

If you want to use Perl, do it like this:

perl -pi -e 's/[^[:ascii:]]//g' filename

详细说明

假设读者不熟悉解决方案中的任何内容,以下解释涵盖了上述命令的每个部分......

The following explanation covers every part of the above command assuming the reader is unfamiliar with anything in the solution...

  • perl

运行 perl 解释器.Perl 是一种编程语言,通常可在所有类 Unix 系统上使用.此命令需要在 shell 提示符下运行.

run the perl interpreter. Perl is a programming language that is typically available on all unix like systems. This command needs to be run at a shell prompt.

-p

-p 标志告诉 perl 遍历输入文件中的每一行,在每一行上运行指定的命令(稍后描述),然后打印结果.它相当于将你的 perl 程序包装在 while(<>) {/* program... */;}继续{打印;}.有一个类似的 -n 标志,它的作用相同,但省略了 continue { print;} 块,所以如果你想自己打印,你可以使用它.

The -p flag tells perl to iterate over every line in the input file, run the specified commands (described later) on each line, and then print the result. It is equivalent to wrapping your perl program in while(<>) { /* program... */; } continue { print; }. There's a similar -n flag that does the same but omits the continue { print; } block, so you'd use that if you wanted to do your own printing.

-i

-i 标志告诉 perl 输入文件将被就地编辑并且输出应该返回到该文件中.这对于实际修改文件很重要.省略此标志会将输出写入 STDOUT,然后您可以将其重定向到新文件.

The -i flag tells perl that the input file is to be edited in place and output should go back into that file. This is important to actually modify the file. Omitting this flag will write the output to STDOUT which you can then redirect to a new file.

注意,您不能省略 -i 并将 STDOUT 重定向到输入文件,因为这会在读取之前破坏输入文件.这就是 shell 的工作方式,与 perl 无关.-i 标志可以智能地解决这个问题.

Note that you cannot omit -i and redirect STDOUT to the input file as this will clobber the input file before it has been read. This is just how the shell works and has nothing to do with perl. The -i flag works around this intelligently.

Perl 和 shell 允许你将多个单字符参数组合成一个,这就是为什么我们可以使用 -pi 而不是 -p -i

Perl and the shell allow you to combine multiple single character parameters into one which is why we can use -pi instead of -p -i

-i 标志接受一个参数,这是一个文件扩展名,如果你想备份原始文件,那么如果你使用了 -i.bak,然后 perl 会在进行更改之前将输入文件复制到 filename.bak.在这个例子中,我省略了创建备份,因为我希望你无论如何都会使用版本控制:)

The -i flag takes a single argument, which is a file extension to use if you want to make a backup of the original file, so if you used -i.bak, then perl would copy the input file to filename.bak before making changes. In this example I've omitted creating a backup because I expect you'll be using version control anyway :)

-e

-e 标志告诉 perl,下一个参数是封装在字符串中的完整 perl 程序.如果您有一个很长的程序,因为它可能会变得不可读,这并不总是一个好主意,但是对于我们这里的单个命令程序,它的简洁性可以提高易读性.

The -e flag tells perl that the next argument is a complete perl program encapsulated in a string. This is not always a good idea if you have a very long program as that can get unreadable, but with a single command program as we have here, its terseness can improve legibility.

注意我们不能将 -e 标志与 -i 标志结合起来,因为它们都接受一个参数,而 perl会假设第二个标志是参数,因此,例如,如果我们使用 -ie <program><filename>,perl 会假设 都是输入文件并尝试创建 ee 假设 e 是您要用于备份的扩展名.这将失败,因为 不是真正的文件.反过来 (-ei) 也不起作用,因为 perl 会尝试将 i 作为程序执行,这会导致编译失败.

Note that we cannot combine the -e flag with the -i flag as both of them take in a single argument, and perl would assume that the second flag is the argument, so, for example, if we used -ie <program> <filename>, perl would assume <program> and <filename> are both input files and try to create <program>e and <filename>e assuming that e is the extension you want to use for the backup. This will fail as <program> is not really a file. The other way around (-ei) would also not work as perl would try to execute i as a program, which would fail compilation.

s/.../.../

这是 perl 的基于正则表达式的替换运算符.它接受四个参数.第一个出现在运算符之前,如果未指定,则使用默认值 $_.第二个和第三个位于 / 符号之间.第四个在最后的 / 之后,在本例中是 g.

This is perl's regex based substitution operator. It takes in four arguments. The first comes before the operator, and if not specified, uses the default of $_. The second and third are between the / symbols. The fourth is after the final / and is g in this case.

  • $_ 在我们的代码中,第一个参数是 $_,它是 perl 中的默认循环变量.如上所述,-p 标志将我们的程序包装在 while(<>) 中,这会创建一个读取一行的 while 循环一次 (<>) 输入.它隐式地将此行分配给 $_,如果未指定,所有接受单个参数的命令都将使用它(例如:仅调用 print; 实际上将转换为 <代码>打印$_;).因此,在我们的代码中,s/.../.../ 运算符在输入文件的每一行上操作一次.

  • $_ In our code, the first argument is $_ which is the default loop variable in perl. As mentioned above, the -p flag wraps our program in while(<>), which creates a while loop that reads one line at a time (<>) from the input. It implicitly assigns this line to $_, and all commands that take in a single argument will use this if not specified (eg: just calling print; will actually translate to print $_;). So, in our code, the s/.../.../ operator operates once on each line of the input file.

[^[:ascii:]] 第二个参数是要在输入字符串中搜索的模式.这个模式是一个正则表达式,所以任何包含在 [] 中的东西都是一个括号表达式.这部分可能是这个例子中最复杂的部分,所以我们会在最后详细讨论.

[^[:ascii:]] The second argument is the pattern to search for in the input string. This pattern is a regular expression, so anything enclosed within [] is a bracket expression. This section is probably the most complex part of this example, so we will discuss it in detail at the end.

<empty string> 第三个参数是替换字符串,在我们的例子中是空字符串,因为我们要删除所有非 ascii 字符.

<empty string> The third argument is the replacement string, which in our case is the empty string since we want to remove all non-ascii characters.

g 第四个参数是替换运算符的修饰符标志.g 标志指定替换应该在输入中的所有匹配项中是全局的.如果没有这个标志,只会替换第一个实例.其他可能的标志是 i 用于不区分大小写的匹配,sm 仅与多行字符串相关(我们这里有单行字符串)、o 指定模式应该被预编译(这对于长文件在这里可能很有用)和 x 指定模式可以包含空格和注释它更具可读性(但如果是这种情况,我们不应该在一行中编写我们的程序).

g The fourth argument is a modifier flag for the substitution operator. The g flag specifies that the substitution should be global across all matches in the input. Without this flag, only the first instance will be replaced. Other possible flags are i for case insensitive matches, s and m which are only relevant for multi-line strings (we have single line strings here), o which specifies that the pattern should be precompiled (which could be useful here for long files), and x which specifies that the pattern could include whitespace and comments to make it more readable (but we should not write our program on a single line if that is the case).

文件名

这是包含我们想要去除的非 ascii 字符的输入文件.

This is the input file that contains non-ascii characters that we'd like to strip out.

[^[:ascii:]]

那么现在让我们更详细地讨论[^[:ascii:]].

So now let's discuss [^[:ascii:]] in more detail.

如上所述,正则表达式中的 [] 指定了一个括号表达式,它告诉正则表达式引擎匹配输入中与字符集中任意一个字符匹配的单个字符表达式里面.因此,例如,[abc] 将匹配 a,或 bc,并且它只会匹配一个字符.使用 ^ 作为第一个字符会反转匹配,所以 [^abc] 将匹配任何一个不是 a 的字符bc.

As mentioned above, [] in a regular expression specifies a bracket expression, which tells the regex engine to match a single character in the input that matches any one of the characters in the set of characters inside the expression. So, for example, [abc] will match either an a, or a b or a c, and it will match only a single character. Using ^ as the first character inverts the match, so [^abc] will match any one character that is not an a, b, or c.

但是括号表达式中的 [:ascii:] 呢?

But what about [:ascii:] inside the bracket expression?

如果您有基于 Unix 的系统可用,请在命令行运行 man 7 re_format 以阅读手册页.如果没有,阅读网络版

If you have a unix based system available, run man 7 re_format at the command line to read the man page. If not, read the online version

[:ascii:] 是表示整个 ascii 字符集的字符类,但这种字符类只能用在括号表达式中.使用它的正确方法是 [[:ascii:]] 并且它可能像上面的 abc 情况一样被否定或在括号表达式中与其他字符组合,所以例如,[éç[:ascii:]] 将匹配所有 ascii 字符以及不是 ascii 的 éç,并且 [^éç[:ascii:]] 将匹配所有不是 ascii 的字符,也不是 éç.

[:ascii:] is a character class that represents the entire set of ascii characters, but this kind of a character class may only be used inside a bracket expression. The correct way to use this is [[:ascii:]] and it may be negated as with the abc case above or combined within a bracket expression with other characters, so, for example, [éç[:ascii:]] will match all ascii characters and also é and ç which are not ascii, and [^éç[:ascii:]] will match all characters that are not ascii and also not é or ç.

这篇关于删除文件中的非 ASCII 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆