用正则表达式选择HTML文本元素? [英] select HTML text element with regex?

查看:132
本文介绍了用正则表达式选择HTML文本元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在HTML文档中查找& copy; ,并基本上获得版权所属的实体。

I want to look for © in an HTML document, and basically get the entity the copyright is attributed to.

版权线显示了几种不同的方式:

The copyright line shows up a couple of different ways:

<p class="bg-copy">&copy; 2011  The New York Times Company</p>

<a href="http://www.nytimes.com/ref/membercenter/help/copyright.html">
&copy; 2011</a> 
<a href="http://www.nytco.com/">The New York Times Company</a>

<br>Published since 1996<br>Copyright &copy; CounterPunch<br>
All rights reserved.<br>

我想忽略日期和干预标签,只是得到纽约时报公司或反击。

I want to ignore the dates and intervening tags and just get "The New York Times Company" or "Counterpunch".

我在使用JavaScript或JQuery的正则表达方式上找不到多少,但我觉得它可能会导致严重的问题。如果有更好的方法,请告诉我。

I haven't been able to find much on using regex with JavaScript or JQuery, though I get the impression that it can lead to major headaches. If there is a better approach to this, let me know.

推荐答案

对于强大的解决方案,您可能需要组合使用DOM导航和一些启发式。您的示例可以使用正则表达式解决,但可能有更多场景...

For a robust solution, you will probably need a combination of DOM navigation and some heuristics. Your examples are solvable with regex, but there are so many more scenarios possible...

&copy;[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)

适用于您的三个样本。但仅适用于他们和类似案例。

works for your three samples. But ONLY for them and similar cases.

请参阅 rubular

说明:

&copy; // copyright symbol
[\s\d]* // followed by spaces or digits 
(?:</.+?>[^>]*>)? // maybe followed by a closing tag and another opening one
([^<]*) // than match anything up to the next tag

请参阅关于如何在javascript中使用javascript的答案。基本上你可以使用匹配(/ regex /)函数:

See this answer on how to use in javascript with jquery. Basically you can use the match(/regex/) function:

var result = string.match(/&copy;[\s\d]*(?:<\/.+?>[^>]*>)?([^<]*)/)

这篇关于用正则表达式选择HTML文本元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆