网页抓取以获得网页的主要内容? [英] web scrapping to get main content of a web page?

查看:133
本文介绍了网页抓取以获得网页的主要内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,

我必须创建一个Web服务,在其中传递任何URL时,我必须下载该网页的内容.(可以使用WebClient,Web Request来完成).
我需要解析我下载的html文档. (可以使用htmlagilitypack完成).

但是我不知道如何从该页面获取相关内容.我的意思是,如果网页是一篇文章,那么我需要删除该文章内容,并保留所有广告,其他链接等;如果该页面不包含长文本,而是包含许多链接和按钮,那么我必须下载它们以及CSS和JS.

我使用htmlaglitypack进行了尝试,但是对于某些网站,它无法获取css和js文件的真实内容.下载后,当我打开这些文件时,它们要么为空,要么包含一些错误消息.

我搜索发现诸如数据挖掘算法,html解析之类的东西.但是我没有找到任何代码示例或api或至少一个清晰的示例.
可以使用可读性吗?

请向我解释与该主题相关的任何内容.我该怎么办??

我该如何实现?

(另外:我需要取出相关内容,有时还要取出整个网页,并保存下来以供离线阅读)

在此先感谢您.

Hello All,

I have to create a web service in which, on passing any url, I have to download the content of that web page.(this can be done using WebClient, Web Request).
I need to parse that html document I have downloaded. (this can be done using htmlagilitypack).

But I don''t know how to get relevant content from that page. I mean if the web page is an article, then i need to take out that article content leaving all the ads, other links etc and if the page contains no long text, but a lot of links and buttons, then i have to download them along with css and js.

I tried it using htmlaglitypack, but for some sites its not able to get the real content of css and js files. after download when i open these file either they are either blank or contains some error message.

I searched and found something like data mining algorithms, html parsing. But I didn''t found any code sample or api or atleast a clear example.
Can Readability be used for this?

Please explain me anything related to this topic. What should be my approach.?

How can I achieve this?

(also: i need to take out that relevant content and in some cases whole web page and save it for offline reading)

Thanks in advance.

推荐答案

您需要精确定义相关内容".目前,您遇到了麻烦,因为您实际上并不知道自己想要什么.如果您已经下载了整个页面,那么您将拥有完整的对象树,可以对其进行分析并为每个节点的各种指标分配值,然后可以进行启发式分析以查找主"节点.但是您需要定义所需的内容.

由于div可用于实际的划分和布局,因此可能非常困难.您如何区分

You need to define ''relevant content'' precisely. At the moment you''re having trouble because you don''t actually know what you want; if you have downloaded the whole page then you have the complete object tree, you can analyse it and assign values to various metrics for each node and you can run a heuristic analysis to find the ''main'' node. But you need to define what you are looking for.

Because divs can be used for actual divisions and for layout, it can be quite difficult. How do you distinguish between

<pre><div id="layout">
 <div id="content">
  <p>bla bla</b>
 </div>
 <div id="footer">
  <p>some template footer stuff
 </div>
</div></pre>



...和



... and

<pre><div id="content">
 <div id="p1">
  <p>bla bla</b>
 </div>
 <div id="p2">
  <p>yak yak
 </div>
</div></pre>



...在顶部示例中只需要第一段,而在第二个示例中都只需要第一段?如果div的样式不同,则可能是一个线索,但是同样,您想要包含图像标题,插图和其他可能具有不同样式的div.

如果您知道要抓取的网站,则可以使用该信息来帮助您.在极端情况下,您会知道网站X将其主要内容放在< div id ="content"/>您可以直接转到该项目.

除非您运行脚本引擎,否则任何依赖脚本工作(即加载主要内容)的页面都不会可读.对于主要内容,这样做的人并不多,但这是需要注意的.



... where you want only the first paragraph in the top example but both in the second? If the divs are different styles then that can be a clue, but then again you want to include image captions, insets and other divs which may have a different style.

If you know the sites that you''re scraping then you can use that information to help you. In the extreme case, you will know that site X puts its main content inside a <div id="content"/> and you can just go straight to that item.

Any page that relies on scripts to work (i.e. to load the primary content) won''t be readable unless you run a scripting engine. Not many do that for the main content, but it''s something to be aware of.


尝试

这篇关于网页抓取以获得网页的主要内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆