SED - 定期防爆pression多行 [英] SED - Regular Expression over multiple lines

查看:102
本文介绍了SED - 定期防爆pression多行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我坚持这个几个小时,现在,通过丰富的工具,不同的循环以完成这项工作。没有成功。这将是美妙的,如果有人可以帮助我了这一点。

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.

下面的问题是:

我有没有正确格式化一个非常大的CSV文件(400MB +)。现在它看起来是这样的:

I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:

This is a long abstract describing something. What follows is the tile for this sentence."   
,Title1  
This is another sentence that is running on one line. On the next line you can find the title.   
,Title2

正如你可能会看到标题标题1和标题2实际上应该是在同一行作为上述判决。然后,它会是这个样子:

As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:

This is a long abstract describing something. What follows is the tile for this sentence.",Title1  
This is another sentence that is running on one line. On the next line you can find the title.,Title2

请注意,句末可以包含引号或不。最后,他们也应该被更换。

Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.

下面是我想出迄今:

sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv

这实际上应该把工作匹配多行除权pression完成的。不幸的是,没有:)

This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)

这位前pression正在寻找在句末点和可选加引号,我试图以匹配一个换行符。*。

The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.

帮助多少AP preciated。它并不真正的问题是什么工具就完事了(AWK,Perl,sed的,TR等)。

Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).

谢谢,
克里斯

Thanks, Chris

推荐答案

SED 不一定靠谱本身,它只是它使用的命令多数人AREN 'T熟悉并有一定的副作用,比如划定从下一行的当前行用的'\\ n',当你使用'N'到下一行到模式空间追加。

Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.

不管怎么说,这是容易得多,如果你以逗号开始决定是否删除换行符行匹配,所以这就是我在这里所做的:

Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:

sed 'N;/\n,/s/"\? *\n//;P;D' title_csv

输入

$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line

输出

$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line

这篇关于SED - 定期防爆pression多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆