我怎样才能删除整个HTML标记(及其内容)的类使用正则表达式? [英] How can I remove an entire HTML tag (and its contents) by its class using a regex?

查看:168
本文介绍了我怎样才能删除整个HTML标记(及其内容)的类使用正则表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我想通过类名删除一些html标签。我不喜欢正则表达式,但我正在学习。这是我迄今为止:

 < div class =footer。*?>(。*?) < / DIV> 

第一个。*?是因为它可能包含其他属性,第二是它可能包含其他的HTML的东西。



我做错了什么?我已经尝试了很多设置没有成功。

更新



在DIV中可以包含多行和我正在玩Perl正则表达式。

解决方案

 < div [^>] * class =footer[^>]>(。*?)< / DIV> 

另外,不区分大小写。您可能需要使用引号或结束标记中的斜线。你在做什么上下文?



另外请注意,根据输入,使用正则表达式进行HTML解析可能是非常讨厌的。在下面的答案中提出一个好的观点 - 假设你有一个像这样的结构:

 < div> 
< div class =footer>
< div>嗨!< / div>
< / div>
< / div>

试图为此构建一个正则表达式是灾难的秘诀。您最好的选择是将文档加载到DOM中,然后对其进行操作。
$ b

应该与XML :: DOM紧密映射的伪代码:

  document = //载入文件
divs = document.getElementsByTagName(div);
(div divs){
if(div.getAttributes [class] ==footer){
parent = div.getParent();
for(child in div.getChildren()){
//过滤属性类型?
parent.insertBefore(div,child);
}
parent.removeChild(div);




$ b $ h
这里是一个perl图书馆, HTML :: DOM 以及另一个 XML :: DOM

.NET有内置的库来处理dom解析。


I am not very good with Regex but I am learning.

I would like to remove some html tag by the class name. This is what I have so far :

<div class="footer".*?>(.*?)</div>

The first .*? is because it might contain other attribute and the second is it might contain other html stuff.

What am I doing wrong? I have try a lot of set without success.

Update

Inside the DIV it can contain multiple line and I am playing with Perl regex.

解决方案

You will also want to allow for other things before class in the div tag

<div[^>]*class="footer"[^>]*>(.*?)</div>

Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?

Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:

<div>
    <div class="footer">
        <div>Hi!</div>
    </div>
</div>

Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.

Pseudocode that should map closely to XML::DOM:

document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
    if(div.getAttributes["class"] == "footer") {
        parent = div.getParent();
        for(child in div.getChildren()) {
            // filter attribute types?
            parent.insertBefore(div, child);
        }
        parent.removeChild(div);
    }
}


Here is a perl library, HTML::DOM, and another, XML::DOM
.NET has built-in libraries to handle dom parsing.

这篇关于我怎样才能删除整个HTML标记(及其内容)的类使用正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆