用正则表达式提取所有关闭的html标签 [英] Extract all html tag closed with a regex expression

查看:179
本文介绍了用正则表达式提取所有关闭的html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R上工作,我想从PlainTextDocument中提取所有HTML标签关闭

我使用gsub方法和正则表达式:

$ $ $ $ $ $ $ $ $ gsub(<?!([^<] / *)>,,fm,perl = TRUE,ignore.case = TRUE)

但是,斜杠'/'没有被评估。






我想我不是很清楚。这是我需要做的:

我有一个文本(一个HTML文档),我只想保留标签(<> < /> )。我认为使用gsub会是一个好主意,但也许你有更好的解决方案。 解决方案

您的问题的措辞不清楚,而你的正则表达式没什么意义,但是如果你只想匹配任何看起来像HTML标签的东西,应该这样做:

  的百分比抑制率^<>] +> 中

这将匹配开始和结束标签(例如,< tag attr =value> < / tag> )。如果您只想匹配自闭标签(例如,< tag /> ),这应该可以工作:

  的百分比抑制率^<>] + /> 中

其他人建议斜杠( / )有特殊的含义,需要逃避,但事实并非如此。如果您使用的是Perl,则可以使用此命令进行替换:

  s /< [^<> ] + \ /> / / g 

但斜杠本身没有特别的含义;我只需要转义它,因为我用它作为正则表达式分隔符。我可以简单地使用不同的分隔符:

  s〜< [^<>] + />〜 〜g 

但是R不像Perl那样在语言级别支持正则表达式;正则表达式和替换是以字符串文字的形式编写的,就像它们在Java和C#中一样。与PHP不同,它不需要添加分隔符,如下所示:

  preg_replace(/< [^ <>] + \ /> /,)

你选择你自己的分隔符:

  preg_replace('〜< [^<>] + />〜' ,'')

在任何人打电话给我之前,我知道< ; [^<>] +> 存在缺陷 - 实际上没有像HTML标记正确的正则表达式那样的东西。这在很多情况下都会执行,但解析HTML的唯一真正可靠的方法是使用专用的HTML解析器。


I work on R, and I will want to extract all HTML tag closed from a PlainTextDocument. I use a gsub method with a regex :

gsub("<?!([^<]/*)>"," ",fm,perl=TRUE,ignore.case=TRUE)

But, the slash '/' isn't evaluated.


I think I wasn't very clear.

Here is what I need to do :

I have a text (a HTML document) and I want to only keep the tags (<> and </>). I thought using gsub would be a good idea, but maybe you have a better solution.

解决方案

The wording of your question is unclear, and your regex doesn't make much sense, but if you just want to match anything that looks like an HTML tag, this should do it:

"<[^<>]+>"

That will match both opening and closing tags (e.g., <tag attr="value"> and </tag>). If you want to match only self-closing tags (e.g., <tag />), this should work:

"<[^<>]+/>"

Others have suggested that the slash (/) has special meaning and needs to be escaped, but that's not true. If you were using Perl, you might use this command to do the substitution:

s/<[^<>]+\/>/ /g

But the slash itself has no special meaning; I only had to escape it because I used it as the regex delimiter. I could just as easily use a different delimiter:

s~<[^<>]+/>~ ~g

But R doesn't support regexes at the language level like Perl does; the regex and the replacement are written in the form of string literals, just like they are (for example) in Java and C#. And unlike PHP, it doesn't require you to add delimiters anyway, as in:

preg_replace("/<[^<>]+\/>/", " ")

But even PHP allows you to choose your own delimiter:

preg_replace('~<[^<>]+/>~', ' ')

Before anyone calls me out on this, I know <[^<>]+> is flawed--that there is in fact no such thing as a correct regex for HTML tags. This will do in many cases, but the only truly reliable way to parse HTML is with a dedicated HTML parser.

这篇关于用正则表达式提取所有关闭的html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆