Facebook是否知道我正在使用PhantomJS进行抓取,并且它可以更改其网站来反击我吗? [英] Does Facebook know I'm scraping it with PhantomJS and can it change its website to counter me?

查看:128
本文介绍了Facebook是否知道我正在使用PhantomJS进行抓取,并且它可以更改其网站来反击我吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,也许我很偏执.

我正在使用PhantomJS抓取我的Facebook时间线用于一个业余项目.基本上,我编写了一个程序,该程序通过查询页面中幻影page.evaluate块内带有XPATH的文本Sponsored来查找所有广告.该文本显示为html a元素的innerHTML.

I'm scraping my Facebook timeline for a hobby project using PhantomJS. Basically, I wrote a program that finds all of my ads by querying the page for the text Sponsored with XPATH inside of phantom's page.evaluate block. The text was being displayed as innerHTML of html a elements.

几天来一切都很好,正在寻找大量的广告.

Things were working great for a few days and it was finding tons of ads.

然后它停止返回任何结果.

Then it stopped returning any results.

当我手动登录Facebook再次检查元素时,我发现Sponsored这个词现在出现在页面上的::after伪类元素中,具有css属性content: sponsored.这意味着对文本的XPATH查询不再产生任何结果.开个玩笑,Facebook在被刮了几天后似乎改变了他们渲染这个词的方式.

When I logged into Facebook manually to inspect the elements again, I found that the word Sponsored was now appearing on the page in an ::after pseudoclass element with the css property content: sponsored. This means that an XPATH query for the text no longer yields any results. No joke, Facebook seemed to have changed the way they rendered this word after being scraped for a couple days.

Paranoid.我告诉你了.

Paranoid. I told you.

因此,我向那里的Javascript,Web-Scraping和PhantomJS开发人员社区提出了这个问题.到底他妈发生了什么. Facebook可以知道page.evaluate块中我的PhantomJS程序在做什么吗?

So, I offer this question to the community of Javascript, Web-Scraping, and PhantomJS developers out there. What the heck is going on. Can Facebook know what my PhantomJS program is doing inside of the page.evaluate block?

如果是这样,怎么办?例如,我的幻象命令会出现在页面中嵌入的按键记录器程序中吗?

If so, how? Would my phantom commands appear in a key logger program embedded in the page, for instance?

您的理论是什么?

推荐答案

即使欺骗了用户代理,也完全有可能检测到PhantomJS. 在很多方面,它与其他浏览器都不同,

It is perfectly possible to detect PhantomJS even if the useragent is spoofed. There are plenty of litte ways in which it differs from other browsers, among others:

  • 标题顺序错误
  • 缺少媒体插件和最新的JS功能
  • 特定于PhantomJS的方法,例如window.callPhantom
  • 堆栈跟踪中的PhantomJS名称
  • Wrong order of headers
  • Lack of media plugins and latest JS capabilities
  • PhantomJS-specific methods, like window.callPhantom
  • PhantomJS name in the stack trace

和许多其他人.

请参阅此出色的文章和链接的链接以获取详细信息:

Please refer to this excellent article and presentation linked there for details: https://blog.shapesecurity.com/2015/01/22/detecting-phantomjs-based-visitors/

也许木偶会更适合您的需求,因为它基于真实的切割效果,边缘Chromium浏览器.

Maybe puppeteer would be a better fit for your needs as it is based on a real cutting-edge Chromium browser.

这篇关于Facebook是否知道我正在使用PhantomJS进行抓取,并且它可以更改其网站来反击我吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆