Javascript:在文档中查找URL [英] Javascript: find URLs in a document

查看:91
本文介绍了Javascript:在文档中查找URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在文档中查找URL(即www.domain.com),并将其放在锚点中: a href =www.domain.com> www.domain.com< / a>



html:

 嘿,老兄,看看这个链接www.google.com和www.yahoo.com! 

javascript:

 code>(function(){var text = document.body.innerHTML; / * do replace regex => text * /})(); 

输出:

 code>嘿,老兄,看看这个链接< a href =www.google.com> www.google.com< / a>和< a href =www.yahoo.com> www.yahoo.com< / a>! 


解决方案

首先, www.domain .com 不是URL,它是一个主机名,而

 < a href = www.domain.com> 

将无法正常工作 - 它会寻找一个 .com 文件称为 www.domain 相对于当前页面。



不可能突出显示主机名在一般情况下,因为几乎任何东西都可以是主机名。您可以尝试突出显示www.something.dot.separated.words,但它并不真正可靠,并且有许多网站不使用 www。主机名前缀。我试图避免这种情况。

  / \bhttps?:\ / \ / [^ \s< "{} | \ ^ \ [\] \\] + /; 

这是一个非常自由的模式,您可以将其用作检测HTTP URL的起点,根据您拥有的输入方式,您可能希望缩小其允许范围,并且可能会检测到尾随字符或这是URL的有效部分,但实际上通常不是。



(您可以使用 | 允许 URL语法 www.hostname 语法,如果你愿意的话)。



无论如何,一旦你结算了你喜欢的模式您需要在页面上的文本节点中找到该模式。不要在 innerHTML 标记之间运行正则表达式。您将最终完全破坏页面,尝试标记每个 href =http:// something这是一个已经在标记内,当您更换 innerHTML 内容时,还会销毁现有的JavaScript引用,事件或表单字段值。


$ b $一般来说,正则表达式根本无法以任何可靠的方式处理HTML。因此,利用浏览器已将HTML解析为元素和文本节点,只需查看文本节点。您也可以避免在< a> 元素内查看,因为当链接已经在链接中时将URL标记为链接是愚蠢的(并且无效)。 / p>

  //标记元素中的http:// ...文本,其后代作为链接。 
//
函数addLinks(element){
var urlpattern = / \bhttps?:\ / \ / [^ \s&}{} | \ ^ \ [\] \\] + / g;
findTextExceptInLinks(element,urlpattern,function(node,match){
node.splitText(match.index + match [0])。长度);
var a = document.createElement('a');
a.href = match [0];
a.appendChild(node.splitText(match.index));
node.parentNode.insertBefore(a,node.nextSibling);
});
}

//查找元素后代中的文本,反向文档order
// pattern必须是具有全局标志的正则表达式
//
function findTextExceptInLinks(element,pattern,callback){
for(var childi = element.childNodes.length; childi - > 0;){
var child = element.childNodes [childi];
if(child.nodeType === Node.ELEMENT_NODE){
if(child.tagName。 toLowerCase()!=='a')
findTextExceptInLinks (child,pattern,callback);
} else if(child.nodeType === Node.TEXT_NODE){
var matches = [];
var match;
while(match = pattern.exec(child.data))
matches.push(match);
for(var i = matches.length; i - > 0;)
callback.call(window,child,matches [i]);
}
}
}


how do I find URLs (i.e. www.domain.com) within a document, and put those within anchors: < a href="www.domain.com" >www.domain.com< /a >

html:

Hey dude, check out this link www.google.com and www.yahoo.com!

javascript:

(function(){var text = document.body.innerHTML;/*do replace regex => text*/})();

output:

Hey dude, check out this link <a href="www.google.com">www.google.com</a> and <a href="www.yahoo.com">www.yahoo.com</a>!

解决方案

Firstly, www.domain.com isn't a URL, it's a hostname, and

<a href="www.domain.com">

won't work — it'll look for a .com file called www.domain relative to the current page.

It's not possible to highlight hostnames in the general case because almost anything can be a hostname. You could try to highlight ‘www.something.dot.separated.words’, but it's not really that reliable and there are many sites that don't use the www. hostname prefix. I'd try to avoid that.

/\bhttps?:\/\/[^\s<>"`{}|\^\[\]\\]+/;

This is an very liberal pattern you could use as a starting point for detecting HTTP URLs. Depending on what sort of input you've got you may want to narrow down what it allows, and it may be worth detecting trailing characters like . or ! that would be valid parts of the URL but in practice generally aren't.

(You could use a | to allow either the URL syntax or the www.hostname syntax, if you like.)

Anyhow, once you've settled on your preferred pattern you'll need to find that pattern in text nodes on the page. Don't run the regexp over innerHTML markup. You'll end up completely ruining the page by trying to mark up every href="http://something" that's already inside markup. You'll also destroy any existing JavaScript references, events or form field values when you replace the innerHTML content.

In general regexp simply cannot process HTML in any reliable way. So take advantage of the fact that the browser has already parsed the HTML into elements and text nodes, and just look at the text nodes. You'll also want to avoid looking inside <a> elements, since marking up a URL as a link when it's already in a link is silly (and invalid).

// Mark up `http://...` text in an element and its descendants as links.
//
function addLinks(element) {
    var urlpattern= /\bhttps?:\/\/[^\s<>"`{}|\^\[\]\\]+/g;
    findTextExceptInLinks(element, urlpattern, function(node, match) {
        node.splitText(match.index+match[0].length);
        var a= document.createElement('a');
        a.href= match[0];
        a.appendChild(node.splitText(match.index));
        node.parentNode.insertBefore(a, node.nextSibling);
    });
}

// Find text in descendents of an element, in reverse document order
// pattern must be a regexp with global flag
//
function findTextExceptInLinks(element, pattern, callback) {
    for (var childi= element.childNodes.length; childi-->0;) {
        var child= element.childNodes[childi];
        if (child.nodeType===Node.ELEMENT_NODE) {
            if (child.tagName.toLowerCase()!=='a')
                findTextExceptInLinks(child, pattern, callback);
        } else if (child.nodeType===Node.TEXT_NODE) {
            var matches= [];
            var match;
            while (match= pattern.exec(child.data))
                matches.push(match);
            for (var i= matches.length; i-->0;)
                callback.call(window, child, matches[i]);
        }
    }
}

这篇关于Javascript:在文档中查找URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆