gVim 中的正则表达式从列表中删除重复域 [英] Regular Expression in gVim to Remove Duplicate Domains from a List

查看：34 发布时间：2021/9/25 20:11:48 windows regex vim

本文介绍了gVim 中的正则表达式从列表中删除重复域的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要一个编写在 gVim 中使用的正则表达式，它将从 URL 列表中删除重复的域(gVim 可以在这里下载:http://www.vim.org/download.php

I need a regular expression written to use in gVim that will remove duplicate domains from a list of URLs (gVim can be downloaded here: http://www.vim.org/download.php

我在一个 .txt 文件(在 gVim 中打开以进行编辑)中有超过 6,000,000 个 URL 的列表.

I have a list of over 6,000,000 URLs in a .txt file (which opens in gVim for editing).

网址采用以下格式:

http://www.example.com/some-url.php
http://example2.com/another_url.html
http://example3.com/
http://www.example4.com/anotherURL.htm
http://www.example.com/some-url2.htm
http://example.com/some-url3.html
http://www.example2.com/somethingelse.php
http://example5.com

换句话说，网址没有特定的格式.有些有 WWW，有些没有，它们都有不同的格式.

In other words, there is no specific format to the URLs. Some have the WWW, some don't, they all have different formats.

我需要一个为 gVim 编写的正则表达式，它将从列表(及其对应的 URL)中删除所有重复的 DOMAIN，留下它找到的第一个实例.

I need a regular expression written for gVim that will remove all duplicate DOMAINs from the list (and it's corresponding URL), leaving behind the first instance it finds.

因此它将采用上面发布的示例列表，最终结果应如下所示:

So it would take the example list posted above, and the end result should look like this:

http://www.example.com/some-url.php
http://example2.com/another_url.html
http://example3.com/
http://www.example4.com/anotherURL.htm
http://example5.com

这里有两个很好的网站，它们很好地解释了如何在 gVim 中使用正则表达式:

Here are two nice sites that explain how to use regular expressions within gVim pretty nicely:

http://supportweb.cs.bham.ac.uk/documentation/tutorials/docsystem/build/tutorials/gvim/gvim.html#Vi-Regular-Expressions

http://www.softpanorama.org/Editors/Vimorama/vim_regular_expressions.shtml

推荐答案

如果你想用正则表达式来做，可以尝试调整以下内容:%s!\v%(^http://%(www\.)?(%([^./]+\.)+[^./]+)%(/.*)?$\_.{-})@<=^http://%(www\.)?\1%(/.*)?\n!!g，但它在 60 亿个 url 上非常会很慢并且不起作用不明原因.这是一个更好的方法:

If you want to do it using regular expression, you can try to adjust the following: %s!\v%(^http://%(www\.)?(%([^./]+\.)+[^./]+)%(/.*)?$\_.{-})@<=^http://%(www\.)?\1%(/.*)?\n!!g, but it is will be very slow on 6 billions of urls and does not work for unknown reason. Here is a better approach:

:let g:gotDomains={}
:%g/^/let curDomain=matchstr(getline('.'), '\v^http://%(www\.)?\zs[^/]+') | if !has_key(g:gotDomains, curDomain) | let g:gotDomains[curDomain]=1 | else | delete _ | endif

它正在执行以下操作:

let g:gotDomains={} 创建一个空字典，我们将在其中保存所有域
%g/^/{command} 在每一行执行 {command}
让 curDomain=matchstr(...) 获取域名

let g:gotDomains={} creates an empty dictionary where we will hold all domains
%g/^/{command} execute {command} on every line
let curDomain=matchstr(...) get domain name

getline('.') 从当前行
\v 允许我省略在正则表达式中写很多反斜杠(非常神奇)
^ 从字符串开始
\zs 从这里开始匹配(省略捕获 \zs 之前的所有内容)

getline('.') from the current line
\v allow me omit writing lots of backslashes in regex (very magic)
^ from start of string
\zs start match from here (omit capturing everything before \zs)

if !has_key(g:gotDomains, curDomain) 如果域之前没有出现过.

if !has_key(g:gotDomains, curDomain) if domain has not occurred before.

这篇关于gVim 中的正则表达式从列表中删除重复域的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

gVim 中的正则表达式从列表中删除重复域 [英] Regular Expression in gVim to Remove Duplicate Domains from a List

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

gVim 中的正则表达式从列表中删除重复域 [英] Regular Expression in gVim to Remove Duplicate Domains from a List

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭