如何使用 Perl 从纯文本中提取 URL? [英] How can I extract URLs from plain text with Perl?

查看:32
本文介绍了如何使用 Perl 从纯文本中提取 URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要 Perl 正则表达式来解析纯文本输入并将所有链接转换为有效的 HTML HREF 链接.我已经尝试了在网上找到的 10 个不同版本,但没有一个可以正常工作.我还测试了 StackOverflow 上发布的其他解决方案,但似乎都不起作用.正确的解决方案应该是能够在纯文本输入中找到任意网址并将其转换为:

I need the Perl regex to parse plain text input and convert all links to valid HTML HREF links. I've tried 10 different versions I found on the web but none of them seen to work correctly. I also tested other solutions posted on StackOverflow, none of which seem to work. The correct solution should be able to find any URL in the plain text input and convert it to:

<a href="$1">$1</a>

我尝试过的其他正则表达式无法正确处理的某些情况包括:

Some cases other regular expressions I tried didn't handle correctly include:

  1. 行尾的 URL,后面跟有回车
  2. 包含问号的网址
  3. 以https"开头的网址

我希望那里的另一个 Perl 人已经有了他们可以共享的正则表达式.预先感谢您的帮助!

I'm hoping that another Perl guy out there will already have a regular expression they are using for this that they can share. Thanks in advance for your help!

推荐答案

当我尝试 URI 时::Find::Schemeless 带有以下文本:

When I tried URI::Find::Schemeless with the following text:

Here is a URL  and one bare URL with 
https: https://www.example.com and another with a query
http://example.org/?test=one&another=2 and another with parentheses
http://example.org/(9.3)

Another one that appears in quotation marks "http://www.example.net/s=1;q=5"
etc. A link to an ftp site: ftp://user@example.org/test/me
How about one without a protocol www.example.com?

它搞砸了http://example.org/(9.3).所以,我在 Regexp::Common 的帮助下想出了以下内容:

it messed up http://example.org/(9.3). So, I came up with the following with the help of Regexp::Common:

#!/usr/bin/perl

use strict; use warnings;
use CGI 'escapeHTML';
use Regexp::Common qw/URI/;
use URI::Find::Schemeless;

my $heuristic = URI::Find::Schemeless->schemeless_uri_re;

my $pattern = qr{
    $RE{URI}{HTTP}{-scheme=>'https?'} |
    $RE{URI}{FTP} |
    $heuristic
}x;

local $/ = '';

while ( my $par = <DATA> ) {
    chomp $par;
    $par =~ s/</&lt;/g;
    $par =~ s/( $pattern ) / linkify($1) /gex;
    print "<p>$par</p>\n";
}

sub linkify {
    my ($str) = @_;
    $str = "http://$str" unless $str =~ /^[fh]t(?:p|tp)/;
    $str = escapeHTML($str);
    sprintf q|<a href="%s">%s</a>|, ($str) x 2;
}

这适用于显示的输入.当然,生活从来没有像您通过尝试 (http://example.org/(9.3)) 看到的那么容易.

This worked for the input shown. Of course, life is never that easy as you can see by trying (http://example.org/(9.3)).

这篇关于如何使用 Perl 从纯文本中提取 URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆