使用Java +正则表达式从文本文档中提取URL [英] Extracting URLs from a text document using Java + Regular Expressions

查看：227 发布时间：2018/12/5 10:22:32 java regex url

本文介绍了使用Java +正则表达式从文本文档中提取URL的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试创建一个正则表达式来使用Java从文本文档中提取URL，但到目前为止我还没有成功。我想要捕获的两个案例如下所示：

I'm trying to create a regular expression to extract URLs from text documents using Java, but thus far I've been unsuccessful. The two cases I'm looking to capture are listed below:

以http：//
开头的网址以www开头的网址。（缺少前面的协议）

URLs that start with http:// URLs that start with www. (Missing the protocol from the front)

以及查询字符串参数。

谢谢！我希望我真的更了解正则表达式。

Thanks! I wish I really knew Regular expressions better.

干杯，

推荐答案

如果你想确保你真正匹配一个网址，而不仅仅是一个以'www。'开头的单词，你可以使用DVK之前提到的表达式。我稍微修改了它并写了一个小代码片段作为你的起点：

If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

import java.util.*;
import java.util.regex.*;

class FindUrls
{
    public static List<String> extractUrls(String input) {
        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile(
            "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");

        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            result.add(matcher.group());
        }

        return result;
    }
}

这篇关于使用Java +正则表达式从文本文档中提取URL的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Java +正则表达式从文本文档中提取URL [英] Extracting URLs from a text document using Java + Regular Expressions

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用Java +正则表达式从文本文档中提取URL [英] Extracting URLs from a text document using Java + Regular Expressions

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭