用C#屏幕抓取HTML [英] Screen Scraping HTML with C#

查看:112
本文介绍了用C#屏幕抓取HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在考虑在屏幕抓取我们的传统Web应用程序之一,从代码中提取某些数据的工作任务。该数据的格式和应该完全一样,每次显示。我只是不知道如何去这样做。这是一个与页眉和页脚导航完整的HTML文件,但在这一切的中间是我所需要的数据。

I have been given the task at work of screen scraping one of our legacy web apps to extract certain data from the code. The data is formatted and "should" be displayed exactly the same every time. I am just not sure how to go about doing this. It's a full html file with header and footer navigations but in the middle of all this is the data I need.

我需要提取的公司名称值,联系人姓名,电话,电子邮件地址等。

I need to extract the Company Name value, Contact Name, Telephone, email address, etc.

下面是什么样的代码看起来像一个例子:

Here is an example of what the code looks like:

...html above here

<br /><br />
<table cellpadding="0" cellspacing="12" border="0">
    <tr>
        <td valign="top" align="center">
            <!-- Company Info -->

            <table cellpadding="0" cellspacing="0" border="0">
                <tr>
                    <td class="black">
                        <table cellspacing="1" cellpadding="0" border="0" width="370">
                            <tr>
                                <th>ABC INDUSTRIES</th>
                            </tr>
                            <tr>
                                <td class="search">

                                    <table cellpadding="5" cellspacing="0" border="0" width="100%">
                                        <tr>
                                            <td>
                                                <table cellpadding="1" cellspacing="0" border="0" width="100%">
                                                   <tr>
                                                        <td align="center" colspan="2"><hr></td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">Contact Person&nbsp;<img src="/images/icon_contact.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;Joe Smith</td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">Phone Number&nbsp;<img src="/images/icon_phone.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;555-555-5555</td>
                                                    </tr>
                                                    <tr>
                                                        <td align="right" nowrap><b><font color="FF0000">E-mail Address&nbsp;<img src="/images/icon_email.gif" align="absmiddle">&nbsp;:</font></b></td>
                                                        <td align="left" width="100%">&nbsp;<a HREF="mailto:joe@joe.com">joe@joe.com</a></td>
                                                    </tr>
                                                    more...

有在不同的表结构在屏幕上的代码我还需要拉。

There is more code on the screen in a different table structure that I also need to pull.

推荐答案

您只是在寻找如何做到这一点建议吗?该 HTML敏捷性包很可能将成为DOM一般解析你最好的选择。有可能是修修补补和试错好位,以保持你的屏幕抓取(而且通常是诸如此类的事情),但该库是用来解析HTML相当不错的。

Are you just looking for suggestions on how to accomplish this? The HTML Agility Pack is probably going to be your best bet for DOM parsing in general. There may be a good bit of tinkering and trial and error to maintain your screen scrape (there usually is for that sort of thing), but that library is pretty good for parsing HTML.

从技术上讲,任何XML解析(甚至是本地的LINQ to XML)的的做的伎俩,但网站并没有被充分形成,所以你可能会遇到的小麻烦在这里和那里的坏习惯。

Technically, any XML parsing (even native LINQ to XML) should do the trick, but websites have a nasty habit of not being well-formed so you may run into small headaches here and there.

这篇关于用C#屏幕抓取HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆