如何删除非ASCII字符并在非ASCII字符使用Perl单线程的字段中追加空格? [英] How-to remove non-ascii characters and append a space in the field where the non-ascii characters were using a Perl one-liner?

查看:516
本文介绍了如何删除非ASCII字符并在非ASCII字符使用Perl单线程的字段中追加空格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嗨Stack Overflow社区,

我有以下问题。



我得到了这个名为 bad 的文件,内容如下:

 垃圾邮件箱邮政信箱5555假人街
垃圾箱邮箱1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR

我想从中删除非ASCII字符(在第二条记录的第二列的开头),为了获得一个没有奇怪的字符和所有列对齐的文件。此外,还有一个要求是使用 Perl单线程来实现这一点 - 所以,没有 awk sed code>或类似的命令都可以使用。

  $ perl -plne's / [^ [ :ascii:]] // g'bad> bad.clean 

$ cat bad.clean
垃圾邮件邮箱5555假邮箱
邮箱邮箱1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR

我也尝试过使用相同的单行,但这次用空格替换非ascii字符。在这种情况下,记录在第二列中增加了两个空格,在第三列中增加了一个空格:

  $ perl -plne's / [^ [:ascii:]] / / g'bad> bad.clean.space 

$ cat bad.clean.space
垃圾邮件箱邮政信箱5555假人街
垃圾邮箱1234 LOLLERCOASTER村
LOL MAN PO PO BOX 9876 NEXT DOOR

不知何故,非ascii字符似乎是取2个字节 - strong>这是正确的,还是我错过了什么?



预期的输出是这样的:

 垃圾邮件箱邮政信箱5555假人街
垃圾邮箱1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR

有没有办法,使用Perl单行程来获得预期的结果?我正在考虑在删除非ASCII字符之后添加一个空格的方式,在已经进行了更改的字段中,但我找不到方法来执行此操作。另外,非ASCII字符可以出现在任何字段上,而不是在第二个字段中。顺便说一下,一些可能有用的信息:这是一个 AIX 机器,运行 Perl v5.8.8



谢谢!




编辑:

由于@ThisSuitIsBlackNot提到,有两个非ascii字符。因此,如果至少一个非ascii字符被命令删除,我想我只想在该字段的末尾添加一个空格。 有没有办法让这个额外的空间包括在同一个句子中,所以它也可以作为一个单线程来完成?





编辑:

查看大量数据后,我可以看出,非ascii字符总是以成对出现,和原始文件中的下一个字段(在运行单行程之前)总是与其他列相比右边一个空格。所以,我改变了这个问题的标题,以符合要求: Perl单线程去除非ASCII字符并在非ASCII字符的字段中追加空格>

解决方案

取出2个非ASCII字符,在字段后面添加一个空格。
使用非ASCII字符和3个空格分隔符对。

 #s / [^ [:ascii:]] {2}(。*?[] {3})/ $ 1 / g 

[^ [:ascii:]] {2}
(。*?[] {3})
$ b

Perl测试用例

  $ / = undef; 
$ str =< DATA>;
$ str =〜s / [^ [:ascii:]] {2}(。*?[] {3})/ $ 1 / g;
print $ str;

__DATA__
垃圾邮件箱邮政信箱5555假人街
垃圾箱* 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR

输出>>

  SPAM EATER PO BOX 5555 FAKE STREET 
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR


Hi Stack Overflow community,

I have the following problem.

I got this file called bad, with the following contents:

SPAM EATER       PO BOX 5555          FAKE STREET
FOO BAR          ìPO BOX 1234         LOLLERCOASTER VILLAGE
LOL MAN          PO BOX 9876          NEXT DOOR

I want to remove the non-ascii character from it (at the start of the second column of the second record), in order to get a file free of strange characters and with all its columns aligned. Plus, there's this one requirement to achieve this using a Perl one-liner - so, no awk, sed, or alike commands can be used. I tried the following, but got short by one space in the third column:

$ perl -plne 's/[^[:ascii:]]//g' bad > bad.clean

$ cat bad.clean
SPAM EATER       PO BOX 5555          FAKE STREET
FOO BAR          PO BOX 1234         LOLLERCOASTER VILLAGE
LOL MAN          PO BOX 9876          NEXT DOOR

I also tried using the same one-liner, but this time replacing the non-ascii character by a space. In this case, the record ended up with two extra spaces in the second column, and one extra space in the third:

$ perl -plne 's/[^[:ascii:]]/ /g' bad > bad.clean.space

$ cat bad.clean.space
SPAM EATER       PO BOX 5555          FAKE STREET
FOO BAR            PO BOX 1234         LOLLERCOASTER VILLAGE
LOL MAN          PO BOX 9876          NEXT DOOR

Somehow, the non-ascii character seems to be taking 2 bytes instead of one - Is this correct, or am I missing something?

The expected output is this:

SPAM EATER       PO BOX 5555          FAKE STREET
FOO BAR          PO BOX 1234          LOLLERCOASTER VILLAGE
LOL MAN          PO BOX 9876          NEXT DOOR

Is there a way, using a Perl one-liner, to get the results as expected? I was thinking of a way to add one space after removing the non-ascii character, in the field in which the change has been made, but I can't find the way to do it. In addition, the non-ascii character can appear on any field, not only in the second one.

By the way, some info that might be useful: This is an AIX machine, running Perl v5.8.8.

Thank you!


Edit:

As @ThisSuitIsBlackNot mentions, there are two non-ascii characters. Therefore, I guess I just want to add one space to the end of that field, if at least one non-ascii character gets removed by the command. Is there a way to get this extra space included in the same sentence, so it can be done as a one-liner as well?


Edit:

After reviewing a large set of data, I can tell that the non-ascii characters always appears as pairs, and the next field in the original file (before running the one-liner) is always one space to the right compared to the other columns. So, I'm changing the title of this question to match the requirement: Perl one-liner to remove non-ascii characters and append a space in the field where the non-ascii characters were

解决方案

Take out 2 non-ascii, add one space after field.
Uses non-ascii and 3 spaces as delimiter pairs.

 #  s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g

 [^[:ascii:]]{2} 
 ( .*? [ ]{3} )

Perl test case

$/ = undef;
$str = <DATA>;
$str =~ s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g;
print $str;

__DATA__
SPAM EATER       PO BOX 5555          FAKE STREET
FOO BAR          ìPO BOX 1234         LOLLERCOASTER VILLAGE
LOL MAN          PO BOX 9876          NEXT DOOR

Output >>

SPAM EATER       PO BOX 5555          FAKE STREET
FOO BAR          PO BOX 1234          LOLLERCOASTER VILLAGE
LOL MAN          PO BOX 9876          NEXT DOOR

这篇关于如何删除非ASCII字符并在非ASCII字符使用Perl单线程的字段中追加空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆