在C#中实现动态Web Scraper的逻辑 [英] Logic for Implementing a Dynamic Web Scraper in C#

查看:182
本文介绍了在C#中实现动态Web Scraper的逻辑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找以C#窗口形式开发Web刮刀。我想要完成的如下:

I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows:


  1. 从用户处获取URL。

  2. 在WINForms中的IE UI控件(嵌入式浏览器)中加载Web页面。

  3. 允许用户选择一个文本(连续的,小的(不超过50个字符))。来自加载的网页。

  4. 当用户希望保留位置( HTML DOM位置)时,必须将其持久化到DB中,以便用户可以使用该位置在后续访问期间获取该位置的数据。

  1. Get the URL from the user.
  2. Load the Web page in the IE UI control(embedded browser) in WINForms.
  3. Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page.
  4. When the User wishes to persist the location (the HTML DOM location) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his subsequent visits.

假设加载的网站是一个有价格的网站,引用率不断变化,想法是持续DOM层次结构,以便下次可以遍历。

Assume that the loaded website is a pricelisting site and the quoted rate keeps on changing, the idea is to persist the DOM hierarchy so that I can traverse it next time.

如果所有的HTML元素都有自己的id属性,我可以这样做。在id为null的情况下,我无法完成此操作。

I would be able to do this if all the HTML elements had their id attributes. In the case where the id is null , i am not able to accomplish this .

有人可以建议一个有效的想法(如果可能的话,最少的代码片段)。 ?

Could someone suggest a valid idea on this (a bare minimum code snippet if possible).?

即使您可以共享一些在线资源,这将是有帮助的。

It would be helpful , even if you can share some online resources.

谢谢,

vijay

推荐答案

一种方法是根据要选择的元素构建一堆标签/样式/ id。

One approach is to build a stack of tags/styles/id down to the element which you want to select.

从元素中想要遍历到最近的id元素。这样你就可以摆脱大部分的顶部标题等等,然后构建一个序列来寻找。

From the element you want, traverse up to the nearest id element. This way you will get rid of most of the top header etc. Then build a sequence to look for.

示例:

<html>
  <body>
    <!-- lots of html -->
    <div id="main">
       <div>
          <span>
             <div class="pricearea">
                <table> <!-- with price data -->

对于exmaple,您将在数据库中存储一个序列: [id = main] ,div,span,div,table 或者可能 div [class = pricearea],表

For the exmaple you would store in your db a sequence of: [id=main],div,span,div,table or perhaps div[class=pricearea],table.

也可以用来创建你的路径。您可以选择标签,标签的属性或组合。您希望尽可能准确地使用尽可能少的元素来使其变得健壮。

Using styles/classes might also be used to create your path. It's your choice to look for either a tag, an attribute of a tag or a combination. You want it as accurate as possible with as few elements as possible to make it robust.

如果布局很少更改,这将让您每次导航到相同的位置。

If the layout seldom changes, this would let you navigate to the same location each time.

我也建议您使用 HTML Agility Pack 或类似于DOM解析的东西,因为IE控件很慢。

I would also suggest you perhaps use HTML Agility Pack or something similar for the DOM parsing, as the IE control is slow.

屏幕抓取很有趣,但很难让所有页面获得100% 。祝你好运!

Screen scraping is fun, but it's difficult to get it 100% for all pages. Good luck!

这篇关于在C#中实现动态Web Scraper的逻辑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆