如何在特定班级中获取潜水内容 [英] how to get content inside of a dive in a particular class

查看:93
本文介绍了如何在特定班级中获取潜水内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在HTML代码中我找到了这样的div



 ************* ***其他HTML代码************************** 
< div class = imageContainer style = width:736px; >
< img src = http: //media-cache-ak0.pinimg.com/736x/88/d4/8f/88d48f85d605906c88c05df6c931f8f1.jpg\" <跨度class =code-keyword> = pinImage style = height:580px; width:540px; margin:0 auto; padding: 40px 0px; alt = 蝙蝠侠 - 好的,我真的很喜欢这个。这是蝙蝠侠作为骑士。现在,如果盔甲被提升到当前的时间轴并且是未来的技术 - 这将是一件事。 >

< / div >
****************其他html代码**************************





i希望过滤class =imageContainer



 <   img     src   =  http://media-cache-ak0.pinimg .com / 736x / 88 / d4 / 8f / 88d48f85d605906c88c05df6c931f8f1.jpg    class   =  pinImage    ST yle   =  height:580px; width:540px; margin:0 auto; padding:40px 0px;     alt   = < span class =code-keyword>蝙蝠侠 - 好的,我真的很喜欢这个。这是蝙蝠侠作为骑士。现在,如果盔甲被提升到当前的时间轴并且是未来的技术 - 这将是一件事。 >  





这是一个文本框使用



 Regex r =  new 正则表达式( 

解决方案

严重吗?不要。

使用正则表达式处理HTML是一个常见的错误:通常以泪流满面,因为HTML处理需要的数量远远超过哑模式匹配:它是一个分层数据结构,真的需要这样处理。正则表达式确实非常糟糕。



今天你可能能够创建一个适用于该特定示例的解决方案,但是有一个非常非常好的机会,它会在您正在抓取的网站的第一次更改时中断来自的数据,并且不提供任何信息n,或者错误的信息。第一个很容易发现,但第二个通常需要一个人,并且是一个真正的痛苦要整理,特别是如果没有立即发现,并且坏数据通过系统传递并存储。弄清楚哪些信息已被破坏并修复可能需要大量手动工作。



有很多HTML解析器/网站抓取工具:有一个谷歌找到一个适合你想做的事情,并使用它。使用简单易用的正则表达式将在未来给你带来更多麻烦,而不是从一开始就花一点时间做正确的事。



Typo:改变的机会 - OriginalGriff [/ edit]


  public   string  get_div( string  html)
{

< span class =code-keyword> string input = html;


匹配匹配= Regex.Match(输入,Properties.Resources.DIVReg,RegexOptions.IgnoreCase);

if (match.Success)
{
string key = match.Groups [ 1 ]。值;
return (key);
}
其他
{
返回 ;
}
}











Properties.Resources.DIVReg包含



(?s)< div [^>] *?class =imageContainer [^>] *>?(?*)


in a html code i found a div like this

****************other html codes**************************
<div class="imageContainer" style="width:736px;">
        <img src="http://media-cache-ak0.pinimg.com/736x/88/d4/8f/88d48f85d605906c88c05df6c931f8f1.jpg" class="pinImage" style="height:580px;width:540px;margin:0 auto;padding:40px 0px;" alt="Batman - OKAY, I really like this. This is BATMAN as a Knight. Now if the armor was brought up to a current timeline and was future tech - that would be something.">

    </div>
****************other html codes**************************



i would like to filter class="imageContainer"

<img src="http://media-cache-ak0.pinimg.com/736x/88/d4/8f/88d48f85d605906c88c05df6c931f8f1.jpg" class="pinImage" style="height:580px;width:540px;margin:0 auto;padding:40px 0px;" alt="Batman - OKAY, I really like this. This is BATMAN as a Knight. Now if the armor was brought up to a current timeline and was future tech - that would be something.">



this to a textbox using

Regex r = new Regex("")

解决方案

Seriously? Don't.
Using a Regex to process HTML is a common mistake: it normally ends in tears, because HTML processing requires a fair amount more than "dumb" pattern matching: it is a hierarchical data structure, and really needs to be processed as such. Regexes are really, really bad at that.

You might be able to create a solution that works for that specific example, today, but there is a very, very good chance that it will break with the first change to the site you are scraping the data from, and either deliver no information, or the wrong information. The first is easy to spot, but the second normally requires a human, and is a real pain to sort out, particularly if it isn't spotted immediately, and "bad" data gets passed through the system and stored. Working out what info has been corrupted and fixing that can take a lot of manual work.

There are loads of HTML parsers / site scraping tools out there: have a google, find one that fits what you are trying to do, and use that. Going with a simple-to-implement regex will give you a lot more trouble in the future than investing a little time in doing it right, right from the start.

[edit]Typo: "chance" for "change" - OriginalGriff[/edit]


public string get_div(string html)
{

    string input = html;


    Match match = Regex.Match(input, Properties.Resources.DIVReg, RegexOptions.IgnoreCase);

    if (match.Success)
    {
        string key = match.Groups[1].Value;
        return (key);
    }
    else
    {
        return "";
    }
}






"Properties.Resources.DIVReg" contains

"(?s)<div[^>]*?class="imageContainer"[^>]*?>(.*?)"


这篇关于如何在特定班级中获取潜水内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆