根据某些条件来抓取Google的所有搜索结果吗? [英] Scrape google's all search results based on certain criteria?

查看:60
本文介绍了根据某些条件来抓取Google的所有搜索结果吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用我的地图绘制器,我需要获取newegg.com的完整地图

I am working on my mapper and I need to get the full map of newegg.com

我可以尝试直接废弃NE(违反NE的政策),但是它们有很多产品无法通过NE直接搜索,而只能通过google.com搜索获得;而且我也需要那些链接.

I could try to scrap NE directly (which kind of violates NE's policies), but they have many products that are not available via direct NE search, but only via google.com search; and I need those links too.

以下是返回1600万结果的搜索字符串:

Here is the search string that returns 16mil of results: https://www.google.com/search?as_q=&as_epq=.com%2FProduct%2FProduct.aspx%3FItem%3D&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=newegg.com&as_occt=url&safe=off&tbs=&as_filetype=&as_rights=

我希望我的刮板查看所有结果并记录指向所有这些结果的超链接.我可以删除Google搜索结果中的所有链接,但是Google每个查询最多只能搜索100个页面-1,000个结果,同样,Google对这种方法不满意.:)

I want my scraper to go over all results and log hyperlinks to all these results. I can scrap all the links from google search results, but google has limit of 100 pages for each query- 1,000 results and again, google is not happy with this approach. :)

我是新来的;您能给我建议/指出正确的方向吗?是否有任何工具/方法可以帮助我实现目标?

I am new to this; Could you advise / point me in the right direction ? Are there any tools/methodology that could help me to achieve my goals?

推荐答案

我是新来的;您能给我建议/指出正确的方向吗?是否有任何工具/方法学可以帮助我实现自己的目标目标?

I am new to this; Could you advise / point me in the right direction ? Are there any tools/methodology that could help me to achieve my goals?

Google采取了很多措施来阻止您抓取他们的页面,我并不是在说要您遵守他们的robots.txt.我不同意他们的道德规范,T& C,甚至他们推出的简体" 版本也不同意(但这是一个单独的问题).

Google takes a lot of steps to prevent you from crawling their pages and I'm not talking about merely asking you to abide by their robots.txt. I don't agree with their ethics, nor their T&C, not even the "simplified" version that they pushed out (but that's a separate issue).

如果想被别人看到,那么您必须让google抓取您的页面; 但是,如果您想抓取Google,则必须跳过一些主要的难题!即,您必须获得一堆代理,这样您才能摆脱速率限制以及任何他们怀疑您的活动"时都会张贴的302s +验证码页面.

If you want to be seen, then you have to let google crawl your page; however, if you want to crawl Google then you have to jump through some major hoops! Namely, you have to get a bunch of proxies so you can get past the rate limiting and the 302s + captcha pages that they post up any time they get suspicious about your "activity."

尽管对Google的条款与条件感到严重不满,但我不建议您违反它!但是,如果您绝对需要获取数据,那么您可以得到一个很大的代理列表,将它们加载到队列中并从队列中提取代理每次您想要获得一个页面.如果代理有效,则将其放回队列;否则,将其放回队列.否则,请丢弃该代理.甚至为每个失败的代理提供一个计数器,如果超过一定数量的失败,则将其丢弃.

Despite being thoroughly aggravated about Google's T&C, I would NOT recommend that you violate it! However, if you absolutely need to get the data, then you can get a big list of proxies, load them in a queue and pull a proxy from the queue each time you want to get a page. If the proxy works, then put it back in the queue; otherwise, discard the proxy. Maybe even give a counter for each failed proxy and discard it if it exceeds some number of failures.

这篇关于根据某些条件来抓取Google的所有搜索结果吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆