查找所有不是 HTML 属性或超链接标签内容的 URL [英] Find All URL That Is Not An HTML Attribute or Content of A Hyperlink Tag

查看:30
本文介绍了查找所有不是 HTML 属性或超链接标签内容的 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出一个匹配所有不是元素属性或超链接内容的 URL 的正则表达式.

I'm trying to figure out a regex that matches all URL that are not an attribute of an element or is a content of a hyperlink.

应该匹配:

 1. This is a url http://www.google.com

不应该匹配:

 1. <a href="http://www.google.com">Google</a>
 2. <a href="http://www.google.com">http://www.google.com</a> 
 3. <img src="http://www.google.com/image.jpg">
 4. <div data-url="http://www.google.com"></div>

我目前正在使用这个正则表达式来匹配所有 URL,我想我知道我必须检测什么,但我就是不知道使用正则表达式.

I'm currently using this regex to match all URL and I think I know what I have to detect, but I just can't figure out using regex.

\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

<小时>

已编辑

我正在努力实现的目标如下.我想转换这个字符串.

What I'm trying to achieve is the following. I want to convert this string.

This is a url http://www.google.com <a href="http://www.google.com" title="Go to Google">Google</a><a href="http://www.google.com">http://www.google.com</a><img src="http://www.google.com/image.jpg"><div data-url="http://www.google.com"></div>

This is a url <a href="http://www.google.com">http://www.google.com</a> <a href="http://www.google.com" title="Go to Google">Google</a><a href="http://www.google.com">http://www.google.com</a><img src="http://www.google.com/image.jpg"><div data-url="http://www.google.com"></div>

通过删除标签然后将它们放回去进行预处理并不能解决问题,因为实际上最终会删除现有超链接元素的所有数据属性.当在 href 之外的其他属性中使用其他 URL 时,它也不能解决问题.

Preprocessing by removing tags and then put them back doesn't solve the problem since actually ends up removing all data attributes of the existing hyperlink elements. It also doesn't solve the issue when there are other URL using in other attributes beside href.

到目前为止,我还没有找到任何人建议的解决方案,而且到目前为止我还没有找到使用 HTML 解析器执行此操作的方法.使用正则表达式实际上似乎更可行.

So far, I haven't found a solution suggested by anyone and so far I also haven't found a way to do this using HTML parser. It's actually seem more doable using regex.

编辑 2

根据 Dean 的建议进行尝试后,我准备排除 HTML 解析器无法实现这一点,因为它无法在不使其成为有效 HTML 文档的情况下处理字符串.这是基于建议示例的代码 + 处理排除情况 2 的修复程序.

After the attempt based on Dean's suggestion, I'm about ready to rule out HTML parser from being able to achieve this for it inability to process string without making it a valid HTML document. Here's the code based on the suggested example + the fix to handle exclusion case 2.

    Document doc = Jsoup.parseBodyFragment(htmlText);

    final List<TextNode> nodesToChange = new ArrayList<TextNode>();

    NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {

        @Override
        public void tail(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                Node parent = node.parent();
                if(parent.nodeName().equals("a")){
                    return;
                }

                String text = textNode.getWholeText();

                List<String> allMatches = new ArrayList<String>();
                Matcher m = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]")
                        .matcher(text);
                while (m.find()) {
                    allMatches.add(m.group());
                }

                if(allMatches.size() > 0){
                    nodesToChange.add(textNode);
                }
            }
        }

        @Override
        public void head(Node node, int depth) {        
        }
    });

    nd.traverse(doc.body());

此代码将 HTML、HEAD 和 BODY 标记添加到结果中.我能想到的唯一解决这个问题的方法是检查字符串中是否存在 HTML、HEAD 和 BODY 标签.如果没有,请在处理后将它们条带化.

This code adds HTML, HEAD and BODY tag to the result. The only hack I can think of around this issue is to check whether HTML, HEAD and BODY tags exist in the string. If not, stripe them out after processing.

我希望其他人有比这个黑客更好的建议.就处理时间而言,使用 JSOUP 已经非常昂贵,所以我真的不想增加更多的开销.

I hope someone else has a better suggestion than this hack. Using JSOUP is already very expensive in terms of processing time so I really don't want to add more overhead if I don't have to.

推荐答案

根据 Dean 的建议和上述示例,这里是问题的解决方案".请记住,它的成本非常高,主要是由于解析 HTML 字符串(在四核/16GB RAM MBPr 上约为 160 毫秒).此解决方案还处理有效和无效的 HTML.请记住,围绕 JSOUP 的限制有一些小技巧,以确保不包含额外的标签以使最终结果成为有效的 HTML.我真的希望有人能想出更好的解决方案,但现在就在这里.

Based on suggestion by Dean and the mentioned example, here's the "solution" to the problem. Keep in mind that it's a very expensive one due mainly to the parsing of HTML string (~160ms on quad-core/16GB RAM MBPr). This solution also handles both valid and invalid HTML. Keep in mind there is a little hack around the limitation of JSOUP to make sure extra tags are not included to make the end result a valid HTML. I really hope someone can come up with a better solution, but here it is for now.

public static String makeHTML(String htmlText){
    boolean isValidDoc = false;
    if((htmlText.contains("<html") || htmlText.contains("<HTML")) && 
            (htmlText.contains("<head") || htmlText.contains("<HEAD")) &&
            (htmlText.contains("<body") || htmlText.contains("<BODY"))){
        isValidDoc = true;
    }

    Document doc = Jsoup.parseBodyFragment(htmlText);
    final String urlRegex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

    final List<TextNode> nodesToChange = new ArrayList<>();
    final List<String> changedContent = new ArrayList<>();

    NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {

        @Override
        public void tail(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                Node parent = node.parent();
                if(parent.nodeName().equals("a")){
                    return;
                }

                String text = textNode.getWholeText();

                List<String> allMatches = new ArrayList<String>();
                Matcher m = Pattern.compile(urlRegex)
                        .matcher(text);
                while (m.find()) {
                    allMatches.add(m.group());
                }

                if(allMatches.size() > 0){
                    String result = text;
                    for(String match : allMatches){
                        result = result.replace(match, "<a href=\"" + match + "\">" + match + "</a>");
                    }
                    changedContent.add(result);
                    nodesToChange.add(textNode);
                }
            }
        }

        @Override
        public void head(Node node, int depth) {        
        }
    });

    nd.traverse(doc.body());

    int count = 0;
    for (TextNode textNode : nodesToChange) {
        String result = changedContent.get(count++);
        Node newNode = new DataNode(result, textNode.baseUri());
        textNode.replaceWith(newNode);
    }

    String processed = doc.toString();
    if(!isValidDoc){
        int start = processed.indexOf("<body>") + 6;
        int end = processed.lastIndexOf("</body>");
        processed = processed.substring(start, end);
    }

    return processed;
}

这篇关于查找所有不是 HTML 属性或超链接标签内容的 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆