Javascript:在文档中查找URL [英] Javascript: find URLs in a document
问题描述
如何在文档中查找URL(即www.domain.com),并将其放在锚点中: a href =www.domain.com> www.domain.com< / a>
html:
嘿,老兄,看看这个链接www.google.com和www.yahoo.com!
javascript:
code>(function(){var text = document.body.innerHTML; / * do replace regex => text * /})();
输出:
code>嘿,老兄,看看这个链接< a href =www.google.com> www.google.com< / a>和< a href =www.yahoo.com> www.yahoo.com< / a>!
首先, www.domain .com
不是URL,它是一个主机名,而
< a href = www.domain.com>
将无法正常工作 - 它会寻找一个 .com
文件称为 www.domain
相对于当前页面。
不可能突出显示主机名在一般情况下,因为几乎任何东西都可以是主机名。您可以尝试突出显示www.something.dot.separated.words,但它并不真正可靠,并且有许多网站不使用 www。
主机名前缀。我试图避免这种情况。
/ \bhttps?:\ / \ / [^ \s< "{} | \ ^ \ [\] \\] + /;
这是一个非常自由的模式,您可以将其用作检测HTTP URL的起点,根据您拥有的输入方式,您可能希望缩小其允许范围,并且可能会检测到尾随字符或!
这是URL的有效部分,但实际上通常不是。
(您可以使用 |
允许 URL语法或 www.hostname
语法,如果你愿意的话)。
无论如何,一旦你结算了你喜欢的模式您需要在页面上的文本节点中找到该模式。不要在 innerHTML
标记之间运行正则表达式。您将最终完全破坏页面,尝试标记每个 href =http:// something
这是一个已经在标记内,当您更换 innerHTML
内容时,还会销毁现有的JavaScript引用,事件或表单字段值。
$ b $一般来说,正则表达式根本无法以任何可靠的方式处理HTML。因此,利用浏览器已将HTML解析为元素和文本节点,只需查看文本节点。您也可以避免在
< a>
元素内查看,因为当链接已经在链接中时将URL标记为链接是愚蠢的(并且无效)。 / p> //标记元素中的http:// ...文本,其后代作为链接。
//
函数addLinks(element){
var urlpattern = / \bhttps?:\ / \ / [^ \s&}{} | \ ^ \ [\] \\] + / g;
findTextExceptInLinks(element,urlpattern,function(node,match){
node.splitText(match.index + match [0])。长度);
var a = document.createElement('a');
a.href = match [0];
a.appendChild(node.splitText(match.index));
node.parentNode.insertBefore(a,node.nextSibling);
});
}
//查找元素后代中的文本,反向文档order
// pattern必须是具有全局标志的正则表达式
//
function findTextExceptInLinks(element,pattern,callback){
for(var childi = element.childNodes.length; childi - > 0;){
var child = element.childNodes [childi];
if(child.nodeType === Node.ELEMENT_NODE){
if(child.tagName。 toLowerCase()!=='a')
findTextExceptInLinks (child,pattern,callback);
} else if(child.nodeType === Node.TEXT_NODE){
var matches = [];
var match;
while(match = pattern.exec(child.data))
matches.push(match);
for(var i = matches.length; i - > 0;)
callback.call(window,child,matches [i]);
}
}
}
how do I find URLs (i.e. www.domain.com) within a document, and put those within anchors: < a href="www.domain.com" >www.domain.com< /a >
html:
Hey dude, check out this link www.google.com and www.yahoo.com!
javascript:
(function(){var text = document.body.innerHTML;/*do replace regex => text*/})();
output:
Hey dude, check out this link <a href="www.google.com">www.google.com</a> and <a href="www.yahoo.com">www.yahoo.com</a>!
Firstly, www.domain.com
isn't a URL, it's a hostname, and
<a href="www.domain.com">
won't work — it'll look for a .com
file called www.domain
relative to the current page.
It's not possible to highlight hostnames in the general case because almost anything can be a hostname. You could try to highlight ‘www.something.dot.separated.words’, but it's not really that reliable and there are many sites that don't use the www.
hostname prefix. I'd try to avoid that.
/\bhttps?:\/\/[^\s<>"`{}|\^\[\]\\]+/;
This is an very liberal pattern you could use as a starting point for detecting HTTP URLs. Depending on what sort of input you've got you may want to narrow down what it allows, and it may be worth detecting trailing characters like .
or !
that would be valid parts of the URL but in practice generally aren't.
(You could use a |
to allow either the URL syntax or the www.hostname
syntax, if you like.)
Anyhow, once you've settled on your preferred pattern you'll need to find that pattern in text nodes on the page. Don't run the regexp over innerHTML
markup. You'll end up completely ruining the page by trying to mark up every href="http://something"
that's already inside markup. You'll also destroy any existing JavaScript references, events or form field values when you replace the innerHTML
content.
In general regexp simply cannot process HTML in any reliable way. So take advantage of the fact that the browser has already parsed the HTML into elements and text nodes, and just look at the text nodes. You'll also want to avoid looking inside <a>
elements, since marking up a URL as a link when it's already in a link is silly (and invalid).
// Mark up `http://...` text in an element and its descendants as links.
//
function addLinks(element) {
var urlpattern= /\bhttps?:\/\/[^\s<>"`{}|\^\[\]\\]+/g;
findTextExceptInLinks(element, urlpattern, function(node, match) {
node.splitText(match.index+match[0].length);
var a= document.createElement('a');
a.href= match[0];
a.appendChild(node.splitText(match.index));
node.parentNode.insertBefore(a, node.nextSibling);
});
}
// Find text in descendents of an element, in reverse document order
// pattern must be a regexp with global flag
//
function findTextExceptInLinks(element, pattern, callback) {
for (var childi= element.childNodes.length; childi-->0;) {
var child= element.childNodes[childi];
if (child.nodeType===Node.ELEMENT_NODE) {
if (child.tagName.toLowerCase()!=='a')
findTextExceptInLinks(child, pattern, callback);
} else if (child.nodeType===Node.TEXT_NODE) {
var matches= [];
var match;
while (match= pattern.exec(child.data))
matches.push(match);
for (var i= matches.length; i-->0;)
callback.call(window, child, matches[i]);
}
}
}
这篇关于Javascript:在文档中查找URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!