在HTML中标记文本 [英] Mark text in HTML

查看:96
本文介绍了在HTML中标记文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些纯文本和html。我需要创建一个PHP方法,该方法将返回相同的html,但在文本的任何实例和<$ c $>之前使用< span class =marked> 注意,它应该支持html中的标签(例如,如果文本为 code> blabla ,所以它应该标记何时它是 bla bla < a href =http://abc.com> bla< / a> bla



例如,如果我用文本my name is josh和下面的html调用这个函数:

 < html> 
< head>
< title>我的名字是Josh !!!< / title>
< / head>
< body>
< h1>我的名字是< b> josh< / b>< / h1>
< div>
< a href =http://www.names.com>我的名字< / a>是josh
< / div>

< ; U&克吨;我< / U> < I>名称< / I> < b取代;为< / B个< span style =font-family:Tahoma;> Josh< / span> ;.
< / body>
< / html>

...应该返回:

 < HTML> 
< head>
< title>< span class =marked>我的名字是Josh< / span> !!!< / title>
< / head>
< body>
< h1>< span class =marked>我的名字是< b> josh< / b>< / span>< / h1>
< div>
< span class =marked>< a href =http://www.names.com>我的名字< / a>是乔什< / span>
< / div>

< span class =marked>< u>我的< / u> < I>名称< / I> < b取代;为< / B个< span style =font-family:Tahoma;> Josh< / span>< / span> ;.
< / body>
< / html>

谢谢。

解决方案<



虽然你可以用简单的正则表达式黑客来做,但忽略标签内的任何东西,像天真的:

/ p>

  preg_replace(
'My(< [>]>)* \ s +(< ^>]>)*名称(小于[^>]>)* \s +(小于[^>]>)*是(小于[^>]>)* \\ \\ s +(< [^>])* Josh',
'< span class =marked> $ 0< / span>',$ html

这并不可靠。部分原因是HTML不能用正则表达式解析:将> 放入属性值是有效的,其他非元素结构如注释将被错误解析。即使使用更严格的表达来匹配标签 - 像< [> \s] *(\ s +([^> \ s] +(\ s * = \s *?([^ '\s>] [\s>] * | [^] * | '[^'] *')\s *)))* \ s * \ /?> ,您仍然会遇到很多相同的问题,特别是如果输入HTML不能保证有效。



<这甚至可能是一个安全问题,就好像您正在处理的HTML不可信,它可能会欺骗您的解析器将文本内容转换为属性,从而导致脚本注入。



但即使忽略这一点,您也无法确保正确的元素嵌套。所以您可能会转:

 < em>我的名字是< strong> Josh< / strong> !!!< / EM> 

转入错误且无效:

 < span class =marked>< em>我的名字是< strong> Josh< / strong>< / span> !!!< / em> 

或:

 我的
< table>< tr>< td>名称是< / td>< / tr>< / table>
Josh

这些元素不能用span来包装。如果你不幸,浏览器修正纠正你的无效输出可能最终会留下一半的页面'标记',或弄乱页面布局。



所以你将不得不在解析的DOM级别上执行此操作,而不是使用字符串黑客行为。您可以使用PHP解析整个字符串,处理它并重新序列化,但如果从可访问性的角度来看它是可接受的,那么在JavaScript中的浏览器端可能会更容易一些,其中内容已经被解析为DOM节点。



这仍然会非常困难。 这个问题处理它的文本将全部在同一文本节点内,但这是一个更简单的情况。



实际上你需要做的是:对于元素中的每个子节点可能包含< span>:
的每个元素,

 
生成此节点和所有后续兄弟的文本内容
匹配整个文本的目标字符串/正则表达式
如果不匹配:
打破外部循环 - 继续下一个元件。
如果当前节点是一个元素节点并且匹配的索引不是0:
打破内部循环 - 打开到下一个兄弟节点
如果当前节点是文本节点并且匹配的索引是> Text节点数据的长度:
打破内部循环 - 打开下一个兄弟节点
//现在我们必须找到匹配结束的位置
n是匹配字符串
迭代剩余的文本节点数据和同级文本内容:
比较文本内容的长度与n
少于?:
从n减去长度并继续
相同?:
我们在节点边界上有一个匹配
必要时拆分第一个文本节点
在文档中插入新的范围
移动所有节点从第一个文本节点到跨越区域内的这个边界
break到外部循环,下一个元素
更大?:
我们得到了一个以节点结尾的匹配。
是节点的一个文本节点?:
然后我们可以拆分文本节点
也可以根据需要拆分第一个文本节点
在文档中插入一个新的范围
将所有包含的节点移动到跨度
break到外部循环,下一个元素
不,元素?:
亲爱的!

ouch。

如果可以单独包装属于匹配部分的每个文本节点,则可以接受一个稍微不太讨厌的替代建议。所以:

 < p>哦,我的< / p>名称< div>< div>是< / div>< div> Josh 

会让您输出:

 < p>哦,< span class =marked>我的< / span>< / p> 
< span class =marked>名称< / span>
< div>< div>< span class =marked>是< / span>< / div>< / div>
< span class =marked>约什< /跨度>

这可能看起来不错,具体取决于您如何设计匹配。它还可以解决部分匹配内部匹配的混淆问题。



ETA:哦,这个伪代码,我现在已经或多或少地编写了代码,可能以及完成它。下面是后一种方法的JavaScript版本:

  markTextInElement(document.body,/ My \ s + name \s + is\s +乔希/ GI); 


函数markTextInElement(element,regexp){
var nodes = [];
collectTextNodes(nodes,element);
var datas = nodes.map(function(node){return node.data;});
var text = datas.join('');

//获取[startnodei,startindex,endnodei,endindex]的列表匹配
//
var matches = [],match;
while(match = regexp.exec(text)){
var p0 = getPositionInStrings(datas,match.index,false);
var p1 = getPositionInStrings(datas,match.index + match [0] .length,true);
matches.push([p0 [0],p0 [1],p1 [0],p1 [1]]);
}

//获取每个匹配的节点列表,分割在
//文本的边缘。反向迭代以避免分裂更改节点we
//尚未处理。
//
for(var i = matches.length; i - > 0;){
var ni0 = matches [i] [0],ix0 = matches [i] [ 1],ni1 =匹配[i] [2],ix1 =匹配[i] [3];
var mnodes = nodes.slice(ni0,ni1 + 1); (ix1< nodes [ni1] .length)
nodes [ni1] .splitText(ix1);
if (ix0> 0)
mnodes [0] =节点[ni0] .splitText(ix0);

//将子列表中的每个文本节点替换为包装版本
//
mnodes.forEach(function(node){
var span = document.createElement ('span');
span.className ='marked';
node.parentNode.replaceChild(span,node);
span.appendChild(node);
}) ;



function collectTextNodes(texts,element){
var textok = [
'applet','col','colgroup', 'dl','iframe','map','object','ol',
'optgroup','option','script','select','style','table',
'tbody','textarea','tfoot','thead','tr','ul'
] .indexOf(element.tagName.toLowerCase()=== - 1)
for(var i = 0; i< element.childNodes.length; i ++){
var child = element.childNodes [i];
if(child.nodeType === 3&& textok)
texts.push(child);
if(child.nodeType === 1)
collectTextNodes(texts,child);
};
}

函数getPositionInStrings(strs,index,toend){
var ix = 0;
for(var i = 0; i var n = index-ix,l = strs [i] .length;
if(toend?l> = n:l> n)
return [i,n];
ix + = l;
}
return [i,0];
}


//我们已经使用了一些ECMAScript第五版数组功能。
//让它们在本地不支持它们的浏览器中工作。
//
if(!('ArrayOf'中的indexOf')){
Array.prototype.indexOf = function(find,i / * opt * /){
if(i === undefined)i = 0;
if(i <0)i + = this.length;
if(i <0)i = 0;
for(var n = this.length; i< n; i ++)
if(i in this&& this [i] === find)
return i;
返回-1;
};
}
if(!('forEach'in Array.prototype)){
Array.prototype.forEach = function(action,that / * opt * /){
for (var i = 0,n = this.length; i< n; i ++)
if(i in this)
action.call(that,this [i],i,this);
}; $!
$ b $ if(!(Array.prototype中的map)){
Array.prototype.map = function(mapper,that / * opt * /){
var other = new Array(this.length);
for(var i = 0,n = this.length; i< n; i ++)
if(i in this)
other [i] = mapper.call(that,this [我],我,这);
返回其他;
};
}


I have some plain text and html. I need to create a PHP method that will return the same html, but with <span class="marked"> before any instances of the text and </span> after it.

Note, that it should support tags in the html (for example if the text is blabla so it should mark when it's bla<b>bla</b> or <a href="http://abc.com">bla</a>bla.

It should be incase sensitive and support long text (with multilines etc) either.

For example, if I call this function with the text "my name is josh" and the following html:

<html>
<head>
    <title>My Name Is Josh!!!</title>
</head>
<body>
    <h1>my name is <b>josh</b></h1>
    <div>
        <a href="http://www.names.com">my name</a> is josh
    </div>

    <u>my</u> <i>name</i> <b>is</b> <span style="font-family: Tahoma;">Josh</span>.
</body>
</html>

... it should return:

<html>
<head>
    <title><span class="marked">My Name Is Josh</span>!!!</title>
</head>
<body>
    <h1><span class="marked">my name is <b>josh</b></span></h1>
    <div>
        <span class="marked"><a href="http://www.names.com">my name</a> is josh</span>
    </div>

    <span class="marked"><u>my</u> <i>name</i> <b>is</b> <span style="font-family: Tahoma;">Josh</span></span>.
</body>
</html>

Thanks.

解决方案

This is going to be tricky.

Whilst you could do it with simple regex hacking, ignoring anything inside a tag, something like the naïve:

preg_replace(
    'My(<[^>]>)*\s+(<[^>]>)*name(<[^>]>)*\s+(<[^>]>)*is(<[^>]>)*\s+(<[^>]>)*Josh',
    '<span class="marked">$0</span>', $html
)

that's not at all reliable. Partly because HTML can't be parsed with regex: it's valid to put > in an attribute value, and other non-element constructs like comments will be mis-parsed. Even with a more rigorous expression to match tags — something horribly unwieldy like <[^>\s]*(\s+([^>\s]+(\s*=\s*([^"'\s>][\s>]*|"[^"]*"|'[^']*')\s*))?)*\s*\/?>, you'd still have many of the same problems, especially if the input HTML is not guaranteed valid.

This could even be a security issue, as if the HTML you are processing is untrusted, it could fool your parser into turning text content into attributes, resulting in script injection.

But even ignoring that, you wouldn't be able to ensure proper element nesting. So you might turn:

<em>My name is <strong>Josh</strong>!!!</em>

into the misnested and invalid:

<span class="marked"><em>My name is <strong>Josh</strong></span>!!!</em>

or:

My
<table><tr><td>name is</td></tr></table>
Josh

where those elements can't be wrapped with a span. If you're unlucky, the browser fixups to ‘correct’ your invalid output could end up leaving half the page ‘marked’, or messing up the page layout.

So you would have to do this on a parsed-DOM level rather than with string hacking. You could parse the whole string in using PHP, process it and re-serialise, but if it's acceptable from an accessibility point of view, it would probably be easier to do it at the browser end in JavaScript, where the content is already parsed into DOM nodes.

It's still going to be pretty hard. This question handles it where the text will all be inside the same text node, but that's a much simpler case.

What you would effectively have to do would be:

for each Element that may contain a <span>:
    for each child node in the element:
       generate the text content of this node and all following siblings
       match the target string/regex against the whole text
       if there is no match:
           break the outer loop - on to the next element.
       if the current node is an element node and the index of the match is not 0:
           break the inner loop - on to the next sibling node
       if the current node is a text node and the index of the match is > the length of the Text node data:
           break the inner loop - on to the next sibling node
       // now we have to find the position of the end of the match
       n is the length of the match string
       iterate through the remaining text node data and sibling text content:
           compare the length of the text content with n
           less?:
               subtract length from n and continue
           same?:
               we've got a match on a node boundary
               split the first text node if necessary
               insert a new span into the document
               move all the nodes from the first text node to this boundary inside the span
               break to outer loop, next element
           greater?:
               we've got a match ending inside the node.
               is the node a text node?:
                   then we can split the text node
                   also split the first text node if necessary
                   insert a new span into the document
                   move all contained nodes inside the span
                   break to outer loop, next element
               no, an element?:
                   oh dear! We can't insert a span here

Ouch.

Here's an alternative suggestion which is slightly less nasty, if it's acceptable to wrap every text node that is part of a match separately. So:

<p>Oh, my</p> name <div><div>is</div><div> Josh

would leave you with the output:

<p>Oh, <span class="marked">my</span></p>
<span class="marked"> name </span>
<div><div><span class="marked">is</span></div></div>
<span class="marked"> Josh</span>

which might look OK, depending on how you're styling the matches. It would also solve the misnesting problem of matches partially inside elements.

ETA: Oh sod the pseudocode, I've more-or-less written the code now anyway, might as well finish it. Here's a JavaScript version of the latter approach:

markTextInElement(document.body, /My\s+name\s+is\s+Josh/gi);


function markTextInElement(element, regexp) {
    var nodes= [];
    collectTextNodes(nodes, element);
    var datas= nodes.map(function(node) { return node.data; });
    var text= datas.join('');

    // Get list of [startnodei, startindex, endnodei, endindex] matches
    //
    var matches= [], match;
    while (match= regexp.exec(text)) {
        var p0= getPositionInStrings(datas, match.index, false);
        var p1= getPositionInStrings(datas, match.index+match[0].length, true);
        matches.push([p0[0], p0[1], p1[0], p1[1]]);
    }

    // Get list of nodes for each match, splitted at the edges of the
    // text. Reverse-iterate to avoid the splitting changing nodes we
    // have yet to process.
    //
    for (var i= matches.length; i-->0;) {
        var ni0= matches[i][0], ix0= matches[i][1], ni1= matches[i][2], ix1= matches[i][3];
        var mnodes= nodes.slice(ni0, ni1+1);
        if (ix1<nodes[ni1].length)
            nodes[ni1].splitText(ix1);
        if (ix0>0)
            mnodes[0]= nodes[ni0].splitText(ix0);

        // Replace each text node in the sublist with a wrapped version
        //
        mnodes.forEach(function(node) {
            var span= document.createElement('span');
            span.className= 'marked';
            node.parentNode.replaceChild(span, node);
            span.appendChild(node);
        });
    }
}

function collectTextNodes(texts, element) {
    var textok= [
        'applet', 'col', 'colgroup', 'dl', 'iframe', 'map', 'object', 'ol',
        'optgroup', 'option', 'script', 'select', 'style', 'table',
        'tbody', 'textarea', 'tfoot', 'thead', 'tr', 'ul'
    ].indexOf(element.tagName.toLowerCase()===-1)
    for (var i= 0; i<element.childNodes.length; i++) {
        var child= element.childNodes[i];
        if (child.nodeType===3 && textok)
            texts.push(child);
        if (child.nodeType===1)
            collectTextNodes(texts, child);
    };
}

function getPositionInStrings(strs, index, toend) {
    var ix= 0;
    for (var i= 0; i<strs.length; i++) {
        var n= index-ix, l= strs[i].length;
        if (toend? l>=n : l>n)
            return [i, n];
        ix+= l;
    }
    return [i, 0];
}


// We've used a few ECMAScript Fifth Edition Array features.
// Make them work in browsers that don't support them natively.
//
if (!('indexOf' in Array.prototype)) {
    Array.prototype.indexOf= function(find, i /*opt*/) {
        if (i===undefined) i= 0;
        if (i<0) i+= this.length;
        if (i<0) i= 0;
        for (var n= this.length; i<n; i++)
            if (i in this && this[i]===find)
                return i;
        return -1;
    };
}
if (!('forEach' in Array.prototype)) {
    Array.prototype.forEach= function(action, that /*opt*/) {
        for (var i= 0, n= this.length; i<n; i++)
            if (i in this)
                action.call(that, this[i], i, this);
    };
}
if (!('map' in Array.prototype)) {
    Array.prototype.map= function(mapper, that /*opt*/) {
        var other= new Array(this.length);
        for (var i= 0, n= this.length; i<n; i++)
            if (i in this)
                other[i]= mapper.call(that, this[i], i, this);
        return other;
    };
}

这篇关于在HTML中标记文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆