抓取工具的HTML快照 - 了解工作原理 [英] HTML Snapshot for crawler - Understanding how it works

查看:164
本文介绍了抓取工具的HTML快照 - 了解工作原理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读文章

今天。说实话,我真的很想要2.你的许多内容是由诸如PHP或ASP.NET的服务器端技术创建的点。



我想要)



我创建了一个php脚本(gethtmlsnapshot.php),其中包括服务器端ajax页面(getdata.php),并且我转义为了安全)参数。然后我将它添加到html静态页面(index-movies.html)的末尾。对?现在...

<1> - 我把那个gethtmlsnapshot.php放在哪里?换句话说,我需要打电话(或者更好,爬虫需要)该页面。但是,如果我在主页面上没有链接,抓取工具就无法调用它:O抓取工具如何使用_escaped_fragment_参数调用页面?如果我不指定它们,它就无法知道它们:)

2 - 如何使用参数调用该页面?和以前一样,我需要使用参数链接到该脚本,这样漫游者浏览每个页面并保存动态结果的内容。



你能帮助我吗?你对这种技术有什么看法?如果爬虫的开发者以其他方式做自己的机器人,会不会更好? :)



让我知道你在想什么。干杯

解决方案

我认为你错了,所以我会试着解释这里发生了什么,包括背景和选择。因为这确实是一个非常重要的话题,我们大多数人不时偶然发现(或者至少类似)。



使用AJAX或者异步增量页面更新(因为大多数网页实际上并不使用XML,而是使用JSON),丰富了网络,并提供了良好的用户体验。



然而它也有代价。



主要问题是客户端根本不支持xmlhttpget对象或JavaScript。
一开始你必须提供向后兼容性。
这通常是通过提供链接并捕获onclick事件并启动AJAX调用来完成的,而不是重新加载页面(如果客户端支持它)。



今天几乎每个客户端都支持必要的功能。



所以今天的问题是搜索引擎。因为他们没有。那么这不完全是真的,因为他们部分地(尤其是谷歌),但为了其他目的。
谷歌评估某些JavaScript代码以防止Blackhat SEO(例如,某个链接指向某个地方,但JavaScript会打开一些完全不同的网页)或者HTML关键字代码对客户端不可见,因为它们被JavaScript或其他但保持简单,最好考虑一个非CSS或JS支持的非常基本的浏览器的搜索引擎爬行器(CSS也是如此,所以如果你的网站上有AJAX链接,并且Webcrawler不支持使用JavaScript来跟踪他们,他们只是不要不会被抓取。或者他们呢?
好​​的答案就是JavaScript链接(就像document.location一样)。谷歌通常足够聪明地猜测目标。
但阿贾克斯调用不是。简单是因为它们返回部分内容,并且由于上下文是未知的,并且唯一的URI不代表内容的位置,所以它不能构建有意义的整页。



所以基本上有三种策略可以解决这个问题。


  1. 在正常href属性作为回退的链接上有一个onclick事件(即最佳选项,因为它解决了客户以及搜索引擎的问题)

  2. 通过您的站点地图提交内容网站,以便它们获得索引,但完全脱离您的站点链接(通常页面提供永久链接以便外部页面将它们链接到pagerank)
  3. ajax抓取方案

这个想法是让你的JavaScript xmlhttpget请求与相应的href属性纠缠在一起,如下所示:
www.example.com/ajax.php#!key=value



所以链接看起来像:

 < a href =  http://www.example.com/ajax.php#!page=imprintonclick =handleajax()>转到我的印记< / a> 

函数 handleajax 可以评估 document.location 变量来触发增量异步页面更新。它也可以传递一个id或url或其他。



然而,抓取程序识别ajax抓取方案格式并自动获取 http:// www .example.com / ajax.php.php?%23!page = imprint 而不是 http://www.example.com/ajax.php#!page=imprint
这样你查询字符串就会包含HTML片段,从中可以知道哪些部分内容已被更新。
,所以你只需要确保
http: //www.example.com/ajax.php.php?%23!page=imprint 返回一个完整的网站,看起来像网站在xmlhttpget更新完成后应该看到用户。



一个非常优雅的解决方案也是将一个对象本身传递给处理函数,该函数然后获取相同的URL,因为抓取工具会使用ajax获取但使用其他参数。然后你的服务器端脚本决定是传递整个页面还是只传递部分内容。



确实是一个非常有创意的方法,我的个人公关分析:






  • 部分更新的网页会收到一个唯一标识符,语义网上的完全合格的资源
  • 部分更新的网站会收到一个唯一的标识符,可以由搜索引擎呈现


con:


  • 它只是搜索引擎的后备解决方案,不适用于没有JavaScript的客户端
  • 它为黑帽SEO提供了机会。所以谷歌肯定不会完全采用它,或者用这种技术对页面进行排名,并对内容进行适当的验证。



结论:


  • 通常与后备遗留工作href属性链接,但是onclick处理程序是更好的方法,因为它们为旧浏览器。

  • ajax抓取方案的主要优点是部分更新的网站可以获得唯一的URI,并且您不必创建重复的内容作为可索引和可​​链接的副本。

  • 您可以争辩说,ajax抓取方案的实现更加一致并且更容易实现。我认为这是你的应用程序设计的问题。

i'm reading this article today. To be honest, im really interessed to "2. Much of your content is created by a server-side technology such as PHP or ASP.NET" point.

I want understand if i have understood :)

I create that php script (gethtmlsnapshot.php) where i include the server-side ajax page (getdata.php) and i escape (for security) the parameters. Then i add it at the end of the html static page (index-movies.html). Right? Now...

1 - Where i put that gethtmlsnapshot.php? In other words, i need to call (or better, the crawler need) that page. But if i don't have link on the main page, the crawler can't call it :O How can crawler call the page with _escaped_fragment_ parameters? It can't know them if i don't specific them somewhere :)

2 - How can crewler call that page with the parameters? As before, i need link to that script with the parameters, so crewler browse each page and save the content of the dinamic result.

Can you help me? And what do you think about this technique? Won't be better if the developers of crawler do their own bots in some others ways? :)

Let me know what do you think about. Cheers

解决方案

I think you got something wrong so I'll try to explain what's going on here including the background and alternatives. as this indeed a very important topic that most of us stumbled upon (or at least something similar) from time to time.

Using AJAX or rather asynchronous incremental page updating (because most pages actually don't use XML but JSON), has enriched the web and provided great user experience.

It has however also come at a price.

The main problem were clients that didn't support the xmlhttpget object or JavaScript at all. In the beginning you had to provide backwards compatibility. This was usually done by providing links and capture the onclick event and fire an AJAX call instead of reloading the page (if the client supported it).

Today almost every client supports the necessary functions.

So the problem today are search engines. Because they don't. Well that's not entirely true because they partly do (especially Google), but for other purposes. Google evaluates certain JavaScript code to prevent Blackhat SEO (for example a link pointing somewhere but with JavaScript opening some completely different webpage... Or html keyword codes that are invisible to the client because they are removed by JavaScript or the other way round).

But keeping it simple its best to think of a search engine crawler of a very basic browser with no CSS or JS support (it's the same with CSS, its party parsed for special reasons).

So if you have "AJAX links" on your website, and the Webcrawler doesn't support following them using JavaScript, they just don't get crawled. Or do they? Well the answer is JavaScript links (like document.location whatever) get followed. Google is often intelligent enough to guess the target. But ajax calls are not made. simple because they return partial content and no senseful whole page can be constructed from it as the context is unknown and the unique URI doesn't represent the location of the content.

So there are basically 3 strategies to work around that.

  1. have an onclick event on the links with normal href attribute as fallback (imo the best option as it solves the problem for clients as well as search engines)
  2. submitting the content websites via your sitemap so they get indexed, but completely apart from your site links (usually pages provide a permalink to this urls so that external pages link them for the pagerank)
  3. ajax crawling scheme

the idea is to have your JavaScript xmlhttpget requests entangled with corresponding href attributes that look like so: www.example.com/ajax.php#!key=value

so the link looks like:

<a href="http://www.example.com/ajax.php#!page=imprint" onclick="handleajax()">go to my imprint</a>

the function handleajax could evaluate the document.location variable to fire the incremental asynchronous page update. its also possible to pass an id or url or whatever.

the crawler however recognises the ajax crawling scheme format and automatically fetches http://www.example.com/ajax.php.php?%23!page=imprint instead of http://www.example.com/ajax.php#!page=imprint so you the query string then contanis the html fragment from which you can tell which partial content has been updated. so you have just have to make sure that http://www.example.com/ajax.php.php?%23!page=imprint returns a full website that just looks like the website should look to the user after the xmlhttpget update has been made.

a very elegant solution is also to pass the a object itself to the handler function which then fetches the same URL as the crawler would have fetched using ajax but with additional parameters. Your server side script then decides whether to deliver the whole page or just the partial content.

It's a very creative approach indeed and here comes my personal pr/ con analysis:

pro:

  • partial updated pages receive a unique identifier at which point they are fully qualified resources in the semantic web
  • partially updated websites receive a unique identifier that can be presented by search engines

con:

  • it's just a fallback solution for search engines, not for clients without JavaScript
  • it provides opportunities for black hat SEO. So Google for sure won't adopt it fully or rank pages with this technique high with out proper verification of the content.

conclusion:

  • just usual links with fallback legacy working href attributes, but an onclick handler are a better approach because they provide functionality for old browsers.

  • the main advantage of the ajax crawling scheme is that partially updated websites get a unique URI, and you don't have to do create duplicate content that somehow serves as the indexable and linkable counterpart.

  • you could argue that ajax crawling scheme implementation is more consistent and easier to implement. I think this is a question of your application design.

这篇关于抓取工具的HTML快照 - 了解工作原理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆