用正则表达式选择HTML文本元素? [英] select HTML text element with regex?
问题描述
我想在HTML文档中查找& copy;
,并基本上获得版权所属的实体。
I want to look for ©
in an HTML document, and basically get the entity the copyright is attributed to.
版权线显示了几种不同的方式:
The copyright line shows up a couple of different ways:
<p class="bg-copy">© 2011 The New York Times Company</p>
或
<a href="http://www.nytimes.com/ref/membercenter/help/copyright.html">
© 2011</a>
<a href="http://www.nytco.com/">The New York Times Company</a>
或
<br>Published since 1996<br>Copyright © CounterPunch<br>
All rights reserved.<br>
我想忽略日期和干预标签,只是得到纽约时报公司或反击。
I want to ignore the dates and intervening tags and just get "The New York Times Company" or "Counterpunch".
我在使用JavaScript或JQuery的正则表达方式上找不到多少,但我觉得它可能会导致严重的问题。如果有更好的方法,请告诉我。
I haven't been able to find much on using regex with JavaScript or JQuery, though I get the impression that it can lead to major headaches. If there is a better approach to this, let me know.
推荐答案
对于强大的解决方案,您可能需要组合使用DOM导航和一些启发式。您的示例可以使用正则表达式解决,但可能有更多场景...
For a robust solution, you will probably need a combination of DOM navigation and some heuristics. Your examples are solvable with regex, but there are so many more scenarios possible...
©[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)
适用于您的三个样本。但仅适用于他们和类似案例。
works for your three samples. But ONLY for them and similar cases.
请参阅 rubular
说明:
© // copyright symbol
[\s\d]* // followed by spaces or digits
(?:</.+?>[^>]*>)? // maybe followed by a closing tag and another opening one
([^<]*) // than match anything up to the next tag
请参阅此关于如何在javascript中使用javascript的答案。基本上你可以使用匹配(/ regex /)函数:
See this answer on how to use in javascript with jquery. Basically you can use the match(/regex/) function:
var result = string.match(/©[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)/)
这篇关于用正则表达式选择HTML文本元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!