如何删除非ASCII字符并在非ASCII字符使用Perl单线程的字段中追加空格? [英] How-to remove non-ascii characters and append a space in the field where the non-ascii characters were using a Perl one-liner?
问题描述
我有以下问题。
我得到了这个名为 bad
的文件,内容如下:
垃圾邮件箱邮政信箱5555假人街
垃圾箱邮箱1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
我想从中删除非ASCII字符(在第二条记录的第二列的开头),为了获得一个没有奇怪的字符和所有列对齐的文件。此外,还有一个要求是使用 Perl单线程来实现这一点 - 所以,没有 awk
, sed code>或类似的命令都可以使用。
$ perl -plne's / [^ [ :ascii:]] // g'bad> bad.clean
$ cat bad.clean
垃圾邮件邮箱5555假邮箱
邮箱邮箱1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
我也尝试过使用相同的单行,但这次用空格替换非ascii字符。在这种情况下,记录在第二列中增加了两个空格,在第三列中增加了一个空格:
$ perl -plne's / [^ [:ascii:]] / / g'bad> bad.clean.space
$ cat bad.clean.space
垃圾邮件箱邮政信箱5555假人街
垃圾邮箱1234 LOLLERCOASTER村
LOL MAN PO PO BOX 9876 NEXT DOOR
不知何故,非ascii字符似乎是取2个字节 - strong>这是正确的,还是我错过了什么?
预期的输出是这样的:
垃圾邮件箱邮政信箱5555假人街
垃圾邮箱1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
有没有办法,使用Perl单行程来获得预期的结果?我正在考虑在删除非ASCII字符之后添加一个空格的方式,在已经进行了更改的字段中,但我找不到方法来执行此操作。另外,非ASCII字符可以出现在任何字段上,而不是在第二个字段中。顺便说一下,一些可能有用的信息:这是一个 AIX
机器,运行 Perl v5.8.8
。
谢谢!
编辑:
由于@ThisSuitIsBlackNot提到,有两个非ascii字符。因此,如果至少一个非ascii字符被命令删除,我想我只想在该字段的末尾添加一个空格。 有没有办法让这个额外的空间包括在同一个句子中,所以它也可以作为一个单线程来完成?
编辑:
查看大量数据后,我可以看出,非ascii字符总是以成对出现,和原始文件中的下一个字段(在运行单行程之前)总是与其他列相比右边一个空格。所以,我改变了这个问题的标题,以符合要求: Perl单线程去除非ASCII字符并在非ASCII字符的字段中追加空格>
取出2个非ASCII字符,在字段后面添加一个空格。
使用非ASCII字符和3个空格分隔符对。
#s / [^ [:ascii:]] {2}(。*?[] {3})/ $ 1 / g
[^ [:ascii:]] {2}
(。*?[] {3})
$ b Perl测试用例
$ / = undef;
$ str =< DATA>;
$ str =〜s / [^ [:ascii:]] {2}(。*?[] {3})/ $ 1 / g;
print $ str;
__DATA__
垃圾邮件箱邮政信箱5555假人街
垃圾箱* 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
输出>>
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
Hi Stack Overflow community,
I have the following problem.
I got this file called bad
, with the following contents:
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR ìPO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
I want to remove the non-ascii character from it (at the start of the second column of the second record), in order to get a file free of strange characters and with all its columns aligned. Plus, there's this one requirement to achieve this using a Perl one-liner - so, no awk
, sed
, or alike commands can be used. I tried the following, but got short by one space in the third column:
$ perl -plne 's/[^[:ascii:]]//g' bad > bad.clean
$ cat bad.clean
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
I also tried using the same one-liner, but this time replacing the non-ascii character by a space. In this case, the record ended up with two extra spaces in the second column, and one extra space in the third:
$ perl -plne 's/[^[:ascii:]]/ /g' bad > bad.clean.space
$ cat bad.clean.space
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
Somehow, the non-ascii character seems to be taking 2 bytes instead of one - Is this correct, or am I missing something?
The expected output is this:
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
Is there a way, using a Perl one-liner, to get the results as expected? I was thinking of a way to add one space after removing the non-ascii character, in the field in which the change has been made, but I can't find the way to do it. In addition, the non-ascii character can appear on any field, not only in the second one.
By the way, some info that might be useful: This is an AIX
machine, running Perl v5.8.8
.
Thank you!
Edit:
As @ThisSuitIsBlackNot mentions, there are two non-ascii characters. Therefore, I guess I just want to add one space to the end of that field, if at least one non-ascii character gets removed by the command. Is there a way to get this extra space included in the same sentence, so it can be done as a one-liner as well?
Edit:
After reviewing a large set of data, I can tell that the non-ascii characters always appears as pairs, and the next field in the original file (before running the one-liner) is always one space to the right compared to the other columns. So, I'm changing the title of this question to match the requirement: Perl one-liner to remove non-ascii characters and append a space in the field where the non-ascii characters were
Take out 2 non-ascii, add one space after field.
Uses non-ascii and 3 spaces as delimiter pairs.
# s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g
[^[:ascii:]]{2}
( .*? [ ]{3} )
Perl test case
$/ = undef;
$str = <DATA>;
$str =~ s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g;
print $str;
__DATA__
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR ìPO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
Output >>
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
这篇关于如何删除非ASCII字符并在非ASCII字符使用Perl单线程的字段中追加空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!