gVim 中的正则表达式从列表中删除重复域 [英] Regular Expression in gVim to Remove Duplicate Domains from a List
问题描述
我需要一个编写在 gVim 中使用的正则表达式,它将从 URL 列表中删除重复的域(gVim 可以在这里下载:http://www.vim.org/download.php
I need a regular expression written to use in gVim that will remove duplicate domains from a list of URLs (gVim can be downloaded here: http://www.vim.org/download.php
我在一个 .txt 文件(在 gVim 中打开以进行编辑)中有超过 6,000,000 个 URL 的列表.
I have a list of over 6,000,000 URLs in a .txt file (which opens in gVim for editing).
网址采用以下格式:
http://www.example.com/some-url.php
http://example2.com/another_url.html
http://example3.com/
http://www.example4.com/anotherURL.htm
http://www.example.com/some-url2.htm
http://example.com/some-url3.html
http://www.example2.com/somethingelse.php
http://example5.com
换句话说,网址没有特定的格式.有些有 WWW,有些没有,它们都有不同的格式.
In other words, there is no specific format to the URLs. Some have the WWW, some don't, they all have different formats.
我需要一个为 gVim 编写的正则表达式,它将从列表(及其对应的 URL)中删除所有重复的 DOMAIN,留下它找到的第一个实例.
I need a regular expression written for gVim that will remove all duplicate DOMAINs from the list (and it's corresponding URL), leaving behind the first instance it finds.
因此它将采用上面发布的示例列表,最终结果应如下所示:
So it would take the example list posted above, and the end result should look like this:
http://www.example.com/some-url.php
http://example2.com/another_url.html
http://example3.com/
http://www.example4.com/anotherURL.htm
http://example5.com
这里有两个很好的网站,它们很好地解释了如何在 gVim 中使用正则表达式:
Here are two nice sites that explain how to use regular expressions within gVim pretty nicely:
http://www.softpanorama.org/Editors/Vimorama/vim_regular_expressions.shtml
推荐答案
如果你想用正则表达式来做,可以尝试调整以下内容:%s!\v%(^http://%(www\.)?(%([^./]+\.)+[^./]+)%(/.*)?$\_.{-})@<=^http://%(www\.)?\1%(/.*)?\n!!g
,但它在 60 亿个 url 上非常会很慢并且不起作用不明原因.这是一个更好的方法:
If you want to do it using regular expression, you can try to adjust the following: %s!\v%(^http://%(www\.)?(%([^./]+\.)+[^./]+)%(/.*)?$\_.{-})@<=^http://%(www\.)?\1%(/.*)?\n!!g
, but it is will be very slow on 6 billions of urls and does not work for unknown reason. Here is a better approach:
:let g:gotDomains={}
:%g/^/let curDomain=matchstr(getline('.'), '\v^http://%(www\.)?\zs[^/]+') | if !has_key(g:gotDomains, curDomain) | let g:gotDomains[curDomain]=1 | else | delete _ | endif
它正在执行以下操作:
let g:gotDomains={}
创建一个空字典,我们将在其中保存所有域%g/^/{command}
在每一行执行{command}
让 curDomain=matchstr(...)
获取域名
let g:gotDomains={}
creates an empty dictionary where we will hold all domains%g/^/{command}
execute{command}
on every linelet curDomain=matchstr(...)
get domain name
getline('.')
从当前行\v
允许我省略在正则表达式中写很多反斜杠(非常神奇)^
从字符串开始\zs
从这里开始匹配(省略捕获\zs
之前的所有内容)
getline('.')
from the current line\v
allow me omit writing lots of backslashes in regex (very magic)^
from start of string\zs
start match from here (omit capturing everything before\zs
)
if !has_key(g:gotDomains, curDomain)
如果域之前没有出现过.
if !has_key(g:gotDomains, curDomain)
if domain has not occurred before.
这篇关于gVim 中的正则表达式从列表中删除重复域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!