超过390KB的文件抓取失败 [英] Scraper fails on files over ~390KB

查看:104
本文介绍了超过390KB的文件抓取失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Facebook的URL Scarper是否有大小限制?我们在网站上有几本书。具有HMTL文件大小小于一定大小(〜390KB)的文件会被刮擦并正确读取,但较大的4个文件不会被读取。这些较大的项目会得到200的响应代码,并且会打开规范的URL。

Does the Facebook's URL scarper have a size limitation on it? We have several books available on a website. Those that have an HMTL filesize under a certain size (~390KB) get scraped and read properly but the 4 that are larger do not. These larger items get a 200 response code and the canonical URL opens.

所有这些页面都是使用相同的模板构建的,唯一的区别是其中内容的大小每本书以及每本书与网站上其他页面的链接数。

All of these pages are built using the same template, the only differences being the size of the content within each book and the number of links each book makes to other pages on the site.


  1. 单击规范URL

  2. 在Firefox中打开Firebug或在Chrome中使用开发人员工具打开网络标签
    3,列出的故障的* .html大小为>〜390KB, <〜390K成功

  3. 单击准确查看我们的抓取工具为您的URL看到的内容

  4. 空白页显示失败,HTML显示为成功

  1. click on canonical URL
  2. Open Firebug In Firefox or developer tools in Chrome to network tab 3, The *.html size at >~390KB for the listed failures & <~390K for the successes
  3. Click on "See exactly what our scraper sees for your URL"
  4. Blank page for failures, HTML present for successes

失败:

  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftapom.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftbgpu.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fttjc.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftbdse.html

成功:

  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fthogtc.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Faabibp.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftww.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftsosw.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fsyottc.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fttigtio.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Faadac.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Fsiud.html
  • https://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Frcg.org%2Fbooks%2Ftuyc.html

推荐答案

针对您的问题的解决方案可能是检查是否有真实用户或Facebook机器人正在访问您的页面。如果它是机器人,则仅为其渲染必要的元数据。您可以通过 Facebook文档通过其用户代理检测该机器人。 :
facebookexternalhit / 1.1(+ http://www.facebook.com/externalhit_uatext.php)

A solution for your problem might be to check whether a real user or the Facebook bot is visiting your page. If it is the bot, then render only the necessary meta data for it. You can detect the bot via its user agent which according to the Facebook documentation is:
"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

代码看起来像这样(在PHP中):

The code would look something like this (in PHP):

function userAgentIsFacebookBot() {
    if ($_SERVER['HTTP_USER_AGENT'] == "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)") {
        return true;
    }
    return false;
}

这篇关于超过390KB的文件抓取失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆