基于浏览器的客户端抓取 [英] Browser-based client-side scraping

查看:307
本文介绍了基于浏览器的客户端抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有可能通过用户的IP抓取外部(跨域)页面?

I wonder if its possible to scrape an external (cross-domain) page through the user's IP?

对于购物比较网站,我需要抓一个电子商务网站的网页,但服务器的几个请求会让我被禁止,所以我正在寻找方法做客户端抓取 - 即从用户的IP请求页面并发送到服务器进行处理。

For a shopping comparison site, I need to scrape pages of an e-com site but several requests from the server would get me banned, so I'm looking for ways to do client-side scraping — that is, request pages from the user's IP and send to server for processing.

推荐答案

不,由于安全措施称为同源策略

No, you won't be able to use the browser of your clients to scrape content from other websites using JavaScript because of a security measure called Same-origin policy.

应该没有办法规避这个政策,那是好的原因。想象一下,您可以指示访问者的浏览器在任何网站上执行任何操作。这不是你想要自动发生的事情。

There should be no way to circumvent this policy and that's for a good reason. Imagine you could instruct the browser of your visitors to do anything on any website. That's not something you want to happen automatically.

但是,你可以创建一个浏览器扩展来做到这一点。 JavaScript浏览器扩展可以配备比常规JavaScript更多的权限。

However, you could create a browser extension to do that. JavaScript browser extensions can be equipped with more privileges than regular JavaScript.

Adob​​e Flash具有类似的安全功能,但我想您可以使用Java(而不是JavaScript)来创建Web-使用用户IP地址的scraper。然后,您可能不希望这样做,因为Java插件被认为是不安全的(并且加载速度很慢!)并且并非所有用户都会安装它。

Adobe Flash has similar security features but I guess you could use Java (not JavaScript) to create a web-scraper that uses your user's IP address. Then again, you probably don't want to do that as Java plugins are considered insecure (and slow to load!) and not all users will even have it installed.

现在回到你的问题:


我需要抓一个电子报网站的网页,但是来自服务器的几个请求会让我被禁止。

I need to scrape pages of an e-com site but several requests from the server would get me banned.

如果该网站的所有者不希望您以这种方式使用他的服务,那么您可能不应该这样做。否则,您将面临法律影响(详见此处)。

If the owner of that website doesn't want you to use his service in that way, you probably shouldn't do it. Otherwise you would risk legal implications (look here for details).

如果你处于法律的黑暗面并且不在乎这是否违法,你可以使用像 http://luminati.io/ 使用真人的IP地址。

If you are on the "dark side of the law" and don't care if that's illegal or not, you could use something like http://luminati.io/ to use IP adresses of real people.

这篇关于基于浏览器的客户端抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆