提取所有唯一行 [英] Extract All Unique Lines

查看:36
本文介绍了提取所有唯一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有包含重复的精确文本行的文本文件,但我只想要其中的一个.想象一下这个文本文件:

I have text files with repeated exact lines of text, but I only want one of each. Imagine this text file:

AAAAA
AAAAA
AAAAA
BB
BBBBB
BBBBB
CCC
CCC
CCC

我只需要以下四行:

AAAAA
BB
BBBBB
CCC

我使用的是支持 RegEx 的文本编辑器(EmEditor 或 Notepad++),而不是一种编程语言,所以我必须使用纯正则表达式.

I'm using a text editor (EmEditor or Notepad++), that supports RegEx, not a programming language, so I must use a purely Regular Expression.

有什么帮助吗?

我检查了 hsz 提到的另一个线程,我想明确表示这个线程不一样.虽然两者都需要去除重复行,但实现的方式不同.我需要纯正则表达式,但另一个线程的最佳答案依赖于特定的 Notepad++ 插件(它甚至不再附带),因此它甚至不是正则表达式解决方案.第二种情况是一个正则表达式,它在 Notepad++ 上确实有效,但在 EmEditor 上根本无效,我也需要它.所以我不认为我的问题是重复那个问题,尽管那个链接很有用,所以我感谢 hsz.

I checked the other thread that hsz mentioned and I'd like to make it clear that this one is not the same. Although both need to remove duplicate lines, the way to achieve it is different. I need pure RegEx, but the best answer from the other thread relies on a specific Notepad++ plug-in (which doesn't even come with it any more), so it's not even a regex solution. The second case there, is a regex and it does work on Notepad++, but not on EmEditor at all, which I also need. So I don't think my question is a repetition of that one, although that link is useful, an so I thank hsz for it.

推荐答案

两个几乎相同的选项:

匹配所有不重复的行

(?sm)(^[^\r\n]+$)(?!.*^\1$)

这些行将被匹配,但要提取它们,您确实想替换其他行.

The lines will be matched, but to extract them, you really want to replace the other ones.

替换所有重复行

这在 Notepad++ 中效果更好:

This will work better in Notepad++:

搜索:(?sm)(^[^\r\n]*)[\r\n](?=.*^\1)

替换:空字符串

  • (?s) 激活 DOTALL 模式,允许点跨行匹配
  • (?m) 开启多行模式,允许 ^$ 在每一行匹配
  • (^[^\r\n]*) 捕获一行到 Group 1,即
  • ^ 锚断言我们在字符串的开头
  • [^\r\n]* 匹配任何不是换行符的字符
  • [\r\n] 匹配换行符
  • 前瞻(?!.*^\1$) 断言我们可以匹配任意数量的字符.*,然后...
  • ^\1$ 与 Group 1 同一行
  • (?s) activates DOTALL mode, allowing the dot to match across lines
  • (?m) turns on multi-line mode, allowing ^ and $ to match on each line
  • (^[^\r\n]*) captures a line to Group 1, i.e.
  • The ^ anchor asserts that we are at the beginning of the string
  • [^\r\n]* matches any chars that are not newline chars
  • [\r\n] matches the newline chars
  • The lookahead (?!.*^\1$) asserts that we can match any number of characters .*, then...
  • ^\1$ the same line as Group 1

这篇关于提取所有唯一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆