根据某些条件来抓取Google的所有搜索结果吗? [英] Scrape google&#39;s all search results based on certain criteria?

查看：60 发布时间：2021/4/18 20:28:30 c# web-crawler web-scraping

本文介绍了根据某些条件来抓取Google的所有搜索结果吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用我的地图绘制器，我需要获取newegg.com的完整地图

I am working on my mapper and I need to get the full map of newegg.com

我可以尝试直接废弃NE(违反NE的政策)，但是它们有很多产品无法通过NE直接搜索，而只能通过google.com搜索获得；而且我也需要那些链接.

I could try to scrap NE directly (which kind of violates NE's policies), but they have many products that are not available via direct NE search, but only via google.com search; and I need those links too.

以下是返回1600万结果的搜索字符串:

Here is the search string that returns 16mil of results: https://www.google.com/search?as_q=&as_epq=.com%2FProduct%2FProduct.aspx%3FItem%3D&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=newegg.com&as_occt=url&safe=off&tbs=&as_filetype=&as_rights=

我希望我的刮板查看所有结果并记录指向所有这些结果的超链接.我可以删除Google搜索结果中的所有链接，但是Google每个查询最多只能搜索100个页面-1,000个结果，同样，Google对这种方法不满意.:)

I want my scraper to go over all results and log hyperlinks to all these results. I can scrap all the links from google search results, but google has limit of 100 pages for each query- 1,000 results and again, google is not happy with this approach. :)

我是新来的；您能给我建议/指出正确的方向吗?是否有任何工具/方法可以帮助我实现目标?

I am new to this; Could you advise / point me in the right direction ? Are there any tools/methodology that could help me to achieve my goals?

推荐答案

我是新来的；您能给我建议/指出正确的方向吗?是否有任何工具/方法学可以帮助我实现自己的目标目标?

I am new to this; Could you advise / point me in the right direction ? Are there any tools/methodology that could help me to achieve my goals?

Google采取了很多措施来阻止您抓取他们的页面，我并不是在说要您遵守他们的robots.txt.我不同意他们的道德规范，T& C，甚至他们推出的简体" 版本也不同意(但这是一个单独的问题).

Google takes a lot of steps to prevent you from crawling their pages and I'm not talking about merely asking you to abide by their robots.txt. I don't agree with their ethics, nor their T&C, not even the "simplified" version that they pushed out (but that's a separate issue).

如果想被别人看到，那么您必须让google抓取您的页面； 但是，如果您想抓取Google，则必须跳过一些主要的难题！即，您必须获得一堆代理，这样您才能摆脱速率限制以及任何他们怀疑您的活动"时都会张贴的302s +验证码页面.

If you want to be seen, then you have to let google crawl your page; however, if you want to crawl Google then you have to jump through some major hoops! Namely, you have to get a bunch of proxies so you can get past the rate limiting and the 302s + captcha pages that they post up any time they get suspicious about your "activity."

尽管对Google的条款与条件感到严重不满，但我不建议您违反它！但是，如果您绝对需要获取数据，那么您可以得到一个很大的代理列表，将它们加载到队列中并从队列中提取代理每次您想要获得一个页面.如果代理有效，则将其放回队列；否则，将其放回队列.否则，请丢弃该代理.甚至为每个失败的代理提供一个计数器，如果超过一定数量的失败，则将其丢弃.

Despite being thoroughly aggravated about Google's T&C, I would NOT recommend that you violate it! However, if you absolutely need to get the data, then you can get a big list of proxies, load them in a queue and pull a proxy from the queue each time you want to get a page. If the proxy works, then put it back in the queue; otherwise, discard the proxy. Maybe even give a counter for each failed proxy and discard it if it exceeds some number of failures.

这篇关于根据某些条件来抓取Google的所有搜索结果吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据某些条件来抓取Google的所有搜索结果吗? [英] Scrape google&#39;s all search results based on certain criteria?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

根据某些条件来抓取Google的所有搜索结果吗? [英] Scrape google&amp;#39;s all search results based on certain criteria?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

根据某些条件来抓取Google的所有搜索结果吗? [英] Scrape google's all search results based on certain criteria?

登录关闭