网页抓取工具如何处理javascript [英] how do web crawlers handle javascript

查看:157
本文介绍了网页抓取工具如何处理javascript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天,Internet上的很多内容都是使用JavaScript生成的(特别是通过后台AJAX调用)。我想知道像Google这样的网络抓取工具如何处理它们。他们是否了解JavaScript?他们有内置的JavaScript引擎吗?或者他们是否简单地忽略了页面中所有JavaScript生成的内容(我猜不太可能)。人们是否使用特定的技术来获取索引的内容,否则这些内容可以通过后台AJAX请求提供给普通的互联网用户?

Today a lot of content on Internet is generated using JavaScript (specifically by background AJAX calls). I was wondering how web crawlers like Google handle them. Are they aware of JavaScript? Do they have a built-in JavaScript engine? Or do they simple ignore all JavaScript generated content in the page (I guess quite unlikely). Do people use specific techniques for getting their content indexed which would otherwise be available through background AJAX requests to a normal Internet user?

推荐答案

JavaScript由Bing和Google抓取工具处理。 Yahoo使用Bing抓取工具数据,因此也应该进行处理。我没有调查其他搜索引擎,所以如果你关心它们,你应该查看它们。

JavaScript is handled by both Bing and Google crawlers. Yahoo uses the Bing crawler data, so it should be handled as well. I didn't look into other search engines, so if you care about them, you should look them up.

Bing于2014年3月发布指南关于如何创建JavaScript与其爬虫一起使用的网站(主要与 pushState 相关)通常是一种良好做法:

Bing published guidance in March 2014 as to how to create JavaScript-based websites that work with their crawler (mostly related to pushState) that are good practices in general:

  • Avoid creating broken links with pushState
  • Avoid creating two different links that link to the same content with pushState
  • Avoid cloaking. (Here's an article Bing published about their cloaking detection in 2007)
  • Support browsers (and crawlers) that can't handle pushState.

谷歌后来于2014年5月发布了关于如何创建与其抓取工具兼容的基于JavaScript的网站的指南,并建议他们推荐:

Google later published guidance in May 2014 as to how to create JavaScript-based websites that work with their crawler, and their recommendations are also recommended:


  • 请勿阻止robots.txt文件中的JavaScript(和CSS)。

  • 确保您可以处理抓取工具的负载。

  • 最好支持无法处理的浏览器和抓取工具(或不允许的用户和组织)JavaScript

  • 依赖于奥术或特定的棘手JavaScript该语言的功能可能无法使用e crawlers。

  • 如果您的JavaScript从页面中删除了内容,则可能无法编入索引。
    左右。

  • Don't block the JavaScript (and CSS) in the robots.txt file.
  • Make sure you can handle the load of the crawlers.
  • It's a good idea to support browsers and crawlers that can't handle (or users and organizations that won't allow) JavaScript
  • Tricky JavaScript that relies on arcane or specific features of the language might not work with the crawlers.
  • If your JavaScript removes content from the page, it might not get indexed. around.

这篇关于网页抓取工具如何处理javascript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆