如何使用正则表达式从文件中查找和删除重复行? [英] How do I find and remove duplicate lines from a file using Regular Expressions?

查看:21
本文介绍了如何使用正则表达式从文件中查找和删除重复行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题旨在与语言无关.仅使用正则表达式,能否查找并替换文件中的重复行?

This question is meant to be language agnostic. Using only Regular Expressions, can I find and replace duplicate lines in a file?

请考虑以下示例输入和我想要的输出;

Please consider the follwing example input and the output that I want;

输入>>

11
22
22  <-duplicate
33
44
44  <-duplicate
55

输出>>

11
22
33
44
55

推荐答案

Regular-expressions.info 有一个页面关于 从文件中删除重复行

Regular-expressions.info has a page on Deleting Duplicate Lines From a File

这基本上归结为搜索这个oneliner:

This basically boils down to searching for this oneliner:

^(.*)(\r?\n\1)+$

...并替换为 \1.
注意:点不能与换行符匹配

说明:

caret 只会在行首匹配.所以正则表达式引擎只会尝试匹配正则表达式的其余部分.dotstar 组合只是匹配整行,无论其内容如何,​​如果有的话.括号将匹配的行存储到第一个反向引用中.

The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The parentheses store the matched line into the first backreference.

接下来我们将匹配行分隔符.我将 问号 放入 \r?\n 以使此正则表达式适用于 Windows (\r\n) 和 UNIX (\n) 文本文件.所以到目前为止,我们匹配了一行和下面的换行符.

Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.

现在我们需要检查这个组合后面是否有同一行的副本.我们只需使用 \1 即可完成此操作.这是保存我们匹配的行的第一个反向引用.反向引用将匹配完全相同的文本.

Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text.

如果反向引用匹配失败,则正则表达式匹配和反向引用被丢弃,正则表达式引擎在下一行的开头再次尝试.如果反向引用成功,正则表达式中的加号将尝试匹配线.最后,美元符号 强制正则表达式引擎检查反向引用匹配的文本是否为一条完整的线.我们已经知道反向引用匹配的文本前面有一个换行符(匹配 \r?\n).因此,我们现在使用 美元符号.

If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.

整个匹配变成line\nline(或line\nline\nline 等).因为我们正在进行搜索和替换,所以该行、其重复项以及它们之间的换行符都将从文件中删除.由于我们希望保留原始行,而不是重复行,因此我们使用 \1 作为替换文本将原始行放回.

The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.

这篇关于如何使用正则表达式从文件中查找和删除重复行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆