将HTML内容拆分为句子,但保持子标签完整 [英] Splitting HTML Content Into Sentences, But Keeping Subtags Intact

查看:94
本文介绍了将HTML内容拆分为句子,但保持子标签完整的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用下面的代码将段落标签中的所有文本分成句子.除少数例外,其他一切正常.但是,段落中的标签会被咀嚼并吐出.示例:

I'm using the code below to separate all text within a paragraph tag into sentences. It is working okay with a few exceptions. However, tags within paragraphs are chewed up and spit out. Example:

<p>This is a sample of a <a href="#">link</a> getting chewed up.</p>

因此,如何忽略标签,以便我可以对句子进行语法分析并在其周围放置跨度标签,然后将,等等...标签保留在适当位置? DOM就是那样吗?

So, how can I ignore tags such that I could just parse sentences and place span tags around them and keep , , etc...tags in place? Or is it smarter to somehow walk the DOM and do it that way?

// Split text on page into clickable sentences
$('p').each(function() {
    var sentences = $(this)
        .text()
        .replace(/(((?![.!?]['"]?\s).)*[.!?]['"]?)(\s|$)/g, 
                 '<span class="sentence">$1</span>$3');
    $(this).html(sentences);
});

我正在Chrome扩展程序内容脚本中使用它;这意味着javascript会注入到它所接触的任何页面中,并动态解析<p>标记.因此,它必须是javascript.

I am using this in a Chrome extension content script; which means that the javascript is injected into any page that it comes in contact with and parses up the <p> tags on the fly. Therefore, it needs to be javascript.

推荐答案

Soapbox

我们可以制作一个正则表达式来匹配您的特定情况,但是鉴于这是HTML解析,并且您的用例暗示其中可以有任意数量的标签,因此最好使用DOM或类似这样的产品 HTML敏捷性(免费)

如果您只是想提取内部文本而又不想保留任何标签数据,则可以使用此正则表达式并将所有匹配项替换为空

If you're just looking to pull out the inner text and not interested in retaining any of the tag data, you could use this regex and replace all matches with a null

(<[^>]*>)

  • ((?:<p(?:\s[^>]*)?>).*?</p>)-保留段落标签和整个句子,但不保留段落之外的任何数据

  • ((?:<p(?:\s[^>]*)?>).*?</p>) - retain the paragraph tags and entire sentence, but not any data outside the paragraph

(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)-仅保留包括所有子标签的段落内文,并将句子存储到第1组

(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>) - retain just the paragraph innertext including all subtags, and store sentence into group 1

(<p(?:\s[^>]*)?>)(.*?)(</p>)-捕获打开和关闭段落标签以及包括任何子标签的内部文本

(<p(?:\s[^>]*)?>)(.*?)(</p>) - capture open and close paragraph tags and the innertext including any sub tags

这些都是PowerShell的示例,正​​则表达式和替换函数应该相似

Granted these are PowerShell examples, the regex and replace function should be similar

$string = '<img> not this stuff either</img><p class=SuperCoolStuff>This is a sample of a <a href="#">link</a> getting chewed up.</p><a> other stuff</a>'

Write-Host "replace p tags with a new span tag"
$string -replace '(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)', '<span class=sentence>$1</span>'

Write-Host
Write-Host "insert p tag's inner text into a span new span tag and return the entire thing including the p tags"
$string -replace '(<p(?:\s[^>]*)?>)(.*?)(</p>)', '$1<span class=sentence>$2</span>$3'

收益

replace p tags with a new span tag
<img> not this stuff either</img><span class=sentence>This is a sample of a <a href="#">link</a> getting chewed up.</span
><a> other stuff</a>

insert p tag's inner text into a span new span tag and return the entire thing including the p tags
<img> not this stuff either</img><p class=SuperCoolStuff><span class=sentence>This is a sample of a <a href="#">link</a> 
getting chewed up.</span></p><a> other stuff</a>

这篇关于将HTML内容拆分为句子,但保持子标签完整的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆