什么是使用JavaScript生成的​​HTML抓取站点最少的冗余方式是什么? [英] What's the least redundant way to make a site with JavaScript-generated HTML crawlable?

查看:142
本文介绍了什么是使用JavaScript生成的​​HTML抓取站点最少的冗余方式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对制定政策之后的Ajax生成的内容抓取,许多沿开发者博客文章和#1 Q&放大器;关于这个问题的线程,我留下了这样的结论是没有办法让一个站点只有JavaScript / AJAX生成的HTML抓取。我目前工作的一个网站是没有得到它的内容索引相当数量。所有我们的非索引内容presentation层是建立在JavaScript中由JSON生成HTML从基于Ajax的Web服务调用返回的,我们相信谷歌没有索引,因为其内容。这是否正确?

After reading Google's policy on making Ajax-generated content crawlable, along with many developers' blog posts and Stackoverflow Q&A threads on the subject, I'm left with the conclusion that there is no way to make a site with only JavaScript/Ajax-generated HTML crawlable. A site I'm currently working isn't getting a fair amount of its content indexed. All of the presentation layer for our non-indexed content is built in JavaScript by generating HTML from JSON returned from Ajax-based web service calls, and we believe Google is not indexing the content because of that. Is that correct?

唯一的解决办法似乎是也有网站的搜索引擎(具体谷歌),因为它传统上一直是在哪里,就可以产生的所有HTML和内容,在服务器侧的回退的版本。对于支持JavaScript客户端,我们似乎可以基本上是使用我们现在做同样的方法:使用JavaScript来生成异步加载的JSON HTML

The only solution seems to be to also have a "fall-back" version of the site for search engines (specifically Google) where all the HTML and content would be generated as it traditionally has been, on the server-side. For clients with JavaScript enabled, it seems that we could use essentially the same approach that we do now: using JavaScript to generate HTML from asynchronously loaded JSON.

阅读身边,我的理解是,应用 DRY原则创建抓取Ajax-当前的最佳实践生成网站如上所述是使用可以用在客户端和服务器端相同的模板的模板引擎。对于启用了JavaScript客户端,客户端模板引擎,例如 mustache.js ,将改变从发送JSON数据服务器进入HTML作为其模板文件副本定义。而对于搜索爬虫和客户端禁用JavaScript,在服务器端执行相同的模板引擎,例如 mustache.java ,将其完全一样的模板文件输出HTML的副本同样操作。

Reading around, my understanding is that the current best practice for applying the DRY principle in creating crawlable Ajax-generated websites as described above is to use a templating engine that can use the same templates on the client-side and the server-side. For clients with JavaScript enabled, the client-side templating engine, for example mustache.js, would transform JSON data sent from the server into HTML as defined by its copy of a template file. And for search crawlers and clients with JavaScript disabled, the server-side implementation of the same templating engine, for example mustache.java, would similarly operate on its copy of the same exact template file to output HTML.

如果该解决方案是正确的,那么这是怎么不同于方法使用4或5年前由前端沉重的网站,这些网站的网站基本上是不得不保持模板code的两个副本,一个副本与用户启用JavaScript(几乎每个人),另一个副本(如 FreeMarker的或的速度)搜索引擎和浏览器没有启用JavaScript(几乎没有人)?好像应该有一个更好的办法。

If that solution is correct, then how is this different than approaches used 4 or 5 years ago by front-end heavy sites, where sites essentially had to maintain two copies of the templating code, one copy for users with JavaScript enabled (nearly everyone) and another copy (e.g. in FreeMarker or Velocity) for search engines and browsers without JavaScript enabled (nearly noone)? It seems like there should be a better way.

这是否意味着两个模板模型层将需要维护,一个在客户端和一个在服务器端?如何最好是它的那些客户端的模板与前端的MVC(MV / MVVC)如 Backbone.js的的框架结合, Ember.js ,或 YUI应用库?如何解决这些影响的维护费用?它会更好去尝试,而不会引入更多的框架,这样做的 - 一个新的模板引擎和前端MVC框架 - 成为一个开发团队的技术栈?有没有办法做到这一点少冗余?

Does this imply that two templating model layers would need to be maintained, one on the client-side and one on the server-side? How advisable is it to combine those client-side templates with a front-end MVC (MV/MVVC) framework like Backbone.js, Ember.js, or YUI App Library? How do these solutions affect maintenance costs? Would it be better to try doing this without introducing more frameworks -- a new templating engine and a front-end MVC framework -- into a development team's technology stack? Is there a way to do this less redundantly?

如果该解决方案是不正确的,那么是有什么我们缺少和可以做的与我们的JavaScript更好地保持我们现有的异步的HTML从JSON的结构,把它编入索引,所以我们并不需要介绍架构堆栈新的东西?我们真的宁愿不会有更新presentation层的两个版本,当业务需求发生变化。

If that solution isn't correct, then is there something we're missing and could be doing better with our JavaScript to keep our existing asynchronous HTML-from-JSON structure and get it indexed, so we don’t need to introduce something new to the architecture stack? We really rather wouldn't have to update two versions of the presentation layer when the business needs change.

推荐答案

为什么我没有想到这之前的!只需使用 http://phantomjs.org 。这是一个无头WebKit浏览器。你只是建立一组动作抓取的用户界面,并在每一个你想捕捉状态的HTML。幻影可以把捕捉到的html .html文件为你和它们保存到Web服务器。

Why didn't I think of this before! Just use http://phantomjs.org. It's a headless webkit browser. You'd just build a set of actions to crawl the UI and capture the html at every state you'd like. Phantom can turn the captured html into .html files for you and save them to your web server.

整个事情将自动每次构建/提交(PhantomJS是驱动命令行)。该JS code你写抓取的用户界面为您更改UI将打破,但它不应该比任何自动化UI测试糟糕的是,它只是使用Javascript,所以你可以使用jQuery选择抢按钮,然后点击它们。

The whole thing would be automated every build/commit (PhantomJS is command line driven). The JS code you write to crawl the UI would break as you change the UI, but it shouldn't be any worse than automated UI testing, and it's just Javascript so you can use jQuery selectors to grab buttons and click them.

如果我必须解决这个问题的SEO,这绝对是我原型第一种方法。抓取并保存,宝贝。 Yessir。

If I had to solve the SEO problem, this is definitely the first approach I'd prototype. Crawl and save, baby. Yessir.

这篇关于什么是使用JavaScript生成的​​HTML抓取站点最少的冗余方式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆