Web挖掘或抓取或爬行?我应该使用什么工具/库? [英] Web mining or scraping or crawling? What tool/library should I use?

查看:122
本文介绍了Web挖掘或抓取或爬行?我应该使用什么工具/库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取并将一些网页保存为HTML。比如说,爬进数百个热门网站,只需保存他们的前页和关于页面。

I want to crawl and save some webpages as HTML. Say, crawl into hundreds popular websites and simply save their frontpages and the "About" pages.

我调查了很多问题,但未找到答案这可以来自网页抓取或网页抓取问题。

I've looked into many questions, but didn't find an answer to this from either web crawling or web scraping questions.

我应该使用哪个库或工具来构建解决方案?或者是否有一些现有的工具可以处理这个?

What library or tool should I use to build the solution? Or is there even some existing tools that can handle this?

推荐答案

这里确实没有好的解决方案。你是对的,因为你怀疑Python可能是最好的开始方式,因为它非常强大的支持正则表达式。

There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it's incredibly strong support of regular expression.

为了实现这样的东西,强大的知识SEO(搜索引擎优化)将有所帮助,因为有效地优化搜索引擎的网页会告诉您搜索引擎的行为方式。我会从像 SEOMoz 这样的网站开始。

In order to implement something like this, strong knowledge of SEO (Search Engine Optimization) would help since effectively optimizing a webpage for search engines tells you how search engines behave. I would start with a site like SEOMoz.

As就识别关于我们页面而言,您只有2个选项:

As far as identifying the "about us" page, you only have 2 options:

a)为每个页面获取关于我们页面的链接并将其提供给您的抓取工具。

a) For each page get the link of the about us page and feed it to your crawler.

b)解析某些关键字的页面的所有链接,例如关于我们,关于了解更多等等。

b) Parse all the links of the page for certain keywords like "about us", "about" "learn more" or whatever.

在使用选项b时,要小心,因为一个网站会多次链接到同一页面,特别是如果链接在页眉或页脚页面甚至可以链接回自身。为避免这种情况,您需要创建一个访问过的链接列表,并确保不要重新访问它们。

in using option b, be careful as you could get stuck in an infinite loop since a website will link to the same page many times especially if the link is in the header or footer a page may link back to itself even. To avoid this you'll need to create a list of visited links and make sure not to revisit them.

最后,我建议让您的抓取工具尊重 robot.txt 文件并不值得关注标有 rel =nofollow的链接,因为这些主要使用在外部链接上。再次,通过阅读SEO来学习这一点和更多。

Finally, I would recommend having your crawler respect instructions in the robot.txt file and it's probably a great idea not to follow links marked rel="nofollow" as these are mostly used on external links. Again, learn this and more by reading up on SEO.

问候,

这篇关于Web挖掘或抓取或爬行?我应该使用什么工具/库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆