使用正则表达式通过Perl从纯文本中提取URL [英] Using regex to extract URLs from plain text with Perl
问题描述
如何使用Perl正则表达式从纯文本中提取具有特定扩展名的特定域(可能具有可变子域)的所有URL?我已经尝试过:
How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:
my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}
它可怕地失败了,并给了我
It fails horribly and gives me:
http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif
我以为不会发生这种情况,因为我使用的是.*?
,它应该是非贪婪的,并且给我最小的匹配.谁能告诉我我在做什么错? (我不想使用任何超级复杂的罐装正则表达式来验证URL;我想知道我做错了什么,所以我可以从中学习.)
I thought that wouldn't happen because I am using .*?
, which ought to be non-greedy and give me the smallest match. Can anyone tell me what I am doing wrong? (I don't want some uber-complex, canned regexp to validate URLs; I want to know what I am doing wrong so I can learn from it.)
推荐答案
URI ::查找是专门为解决此问题而设计的.它将找到所有URI,然后您可以对其进行过滤.它具有一些试探法来处理诸如尾随标点符号之类的事情.
URI::Find is specifically designed to solve this problem. It will find all URIs and then you can filter them. It has a few heuristics to handle things like trailing punctuation.
更新:最近更新为处理Unicode.
UPDATE: Recently updated to handle Unicode.
这篇关于使用正则表达式通过Perl从纯文本中提取URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!