正则表达式:从HTML文档中提取可读的(非代码)文本和URL [英] Regex: Extracting readable (non-code) text and URLs from HTML documents

查看:146
本文介绍了正则表达式:从HTML文档中提取可读的(非代码)文本和URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个将URL作为输入的应用程序,从Web中检索页面的HTML内容并提取标签中未包含的所有内容。换句话说,该页面的访问者看到该页面的文本内容。这包括掩盖封装在< script>< / script> < style>< / style> <! - - > ,因为这些部分包含未包含在标签内的文本(但最好是单独放置) p>

我已经构建了这个正则表达式:

 (?:<(? P<标记>脚本|风格)[\s\S] * LT; /(P =标记)>)|(:其中 -  [\s\S] *  - ??!? - >)|(?:<[\\\ s] *?>)

它正确选择了我想要忽略的所有内容,只留下页面的文本内容。但是,这意味着我想要提取的内容不会显示在匹配集合中(我在Visual Studio 2010中使用VB.Net)。



有没有以这种方式来颠倒整个文档的匹配,这样我就可以匹配上述正则表达式匹配所遗漏的所有文本字符串了吗?

到目前为止,我所做的是在最后添加另一个替代方案,即选择不包含<或>的任何序列,这意味着剩下的文本。我在捕获组中命名了最后一位,当我遍历匹配时,我检查文本组中是否存在文本。这是有效的,但我想知道是否有可能通过正则表达式完成,而只是最终与纯文本匹配。



这应该是一般工作,不知道HTML中的任何特定标签。它应该提取所有文本。此外,我需要保留原始html,以便页面保留其所有链接和脚本 - 我只需要能够提取文本,以便我可以在其中执行搜索和替换,而不用担心重命名任何标签,属性或脚本变量等(所以我不能在所有匹配项上做一个无用替换,因为即使我留下了我所需要的内容,将它重新插入到正确的位置也是一件麻烦事。完全功能的文档)。



我想知道这是否可能使用正则表达式(我知道HTML敏捷包和XP​​ath,但感觉不到) 。

任何建议?



更新:
这是(基于正则表达式)解决方案我结束了: http://www.martinwardener.com/regex/,在演示Web应用程序中实现,该应用程序将显示活动正则表达式字符串以及可让您运行p的测试引擎在任何在线html页面上进行分析,为您提供解析时间和提取结果(单独链接,url和文本部分 - 以及所有正则表达式匹配在整个HTML文档中高亮显示的视图)。

解决方案

好的,我是这样做的:



使用我的原始正则表达式添加了纯文本的搜索模式,这恰好是在标记搜索完成后留下的任何文本):

(? :(:?≤(P<标记>脚本|风格)[\s\S] * LT; /(P =标记)>有)|(?:!< - [\ ?s\S] * - >)|(:其中[\s\S] * GT;))|(P<文本> [^<>] *)

然后在VB.Net中:

  Dim regexText As New Regex((?:(?:<(?< tag> script | style)[\s\S] *?< / \ k< tag>>)|(? !?:其中 -  [\s\S] *  - >)|(:其中[\s\S] * GT;))|(小于文本> [^???? <> *),RegexOptions.IgnoreCase)
暗淡酸作为String = File.ReadAllText(html.txt)
作为新的MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source,evaluateator)

文本的实际替换发生在这里:

  Private Function MatchEvalFunction(ByVal match As Match)As String 
Dim plainText As String = match.Groups(text)。Value
如果plainText IsNot Nothing AndAlso plainText<> 然后
MatchEvalFunction = match.Value.Replace(plainText,plainText.Replace(Original word,Replacement word))
Else
MatchEvalFunction = match.Value
End If
End Function

瞧。 newHtml 现在包含原件的精确副本,除了页面中每次出现原始单词(因为它在浏览器中显示)都会使用替换单词进行切换,并且所有的HTML和脚本代码都保持不变。当然,人们可以/会进行更精细的替换例程,但这显示了基本原则。这是12行代码,包括函数声明和加载html代码等。我会非常感兴趣的是看到并行解决方案,在DOM等中进行比较(是的,我知道这种方法可以通过某些嵌套标签怪癖的发生 - 在SCRIPT重写中 - 但是如果有的话(如果有的话)的损害仍然是非常有限的(参见上面的一些评论),并且通常这会使这项工作相当不错)。


I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).

I have constructed this regex:

(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)

It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).

Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?

So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.

This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).

I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).

Any suggestions?

Update: Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).

解决方案

OK, so here's how I'm doing it:

Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):

(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)

Then in VB.Net:

Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)

The actual replacing of text happens here:

Private Function MatchEvalFunction(ByVal match As Match) As String
    Dim plainText As String = match.Groups("text").Value
    If plainText IsNot Nothing AndAlso plainText <> "" Then
        MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
    Else
        MatchEvalFunction = match.Value
    End If
End Function

Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).

这篇关于正则表达式:从HTML文档中提取可读的(非代码)文本和URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆