多行正则表达式 [英] Regular Expression over multiple lines

查看:113
本文介绍了多行正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经坚持了几个小时,并循环使用了许多不同的工具来完成工作.没有成功.如果有人可以帮我解决这个问题,那将是很棒的事情.

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.

问题出在这里

我有一个非常大的CSV文件(400mb +),格式不正确.现在看起来像这样:

I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:

This is a long abstract describing something. What follows is the tile for this sentence."   
,Title1  
This is another sentence that is running on one line. On the next line you can find the title.   
,Title2

您可能会看到标题,Title1"和,Title2"实际上应该与上述句子在同一行.然后看起来像这样:

As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:

This is a long abstract describing something. What follows is the tile for this sentence.",Title1  
This is another sentence that is running on one line. On the next line you can find the title.,Title2

请注意,句子的结尾可以包含引号,也可以不包含引号.最后,它们也应该被替换.

Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.

这是到目前为止我想到的:

Here is what I came up with so far:

sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv

这实际上应该完成在多行上匹配表达式的工作.不幸的是,它不是:)

This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)

表达式正在查找句子末尾的点和可选引号以及我要与.*匹配的换行符.

The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.

非常感谢您的帮助.到底什么工具完成工作并不重要(awk,perl,sed,tr等).

Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).

推荐答案

sed中的多行本身并不一定很棘手,只是它使用了大多数人不熟悉的命令并具有某些副作用,例如当您使用'N'将下一行附加到模式空间时,用'\ n'分隔当前行.

Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.

无论如何,如果您在以逗号开头的行上进行匹配以决定是否删除换行符,则容易得多,这就是我在这里所做的:

Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:

sed 'N;/\n,/s/"\? *\n//;P;D' title_csv

输入

$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line

输出

$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line

这篇关于多行正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆