Nokogiri在heroku上产生不同的结果? [英] Nokogiri producing different results on heroku?

查看:79
本文介绍了Nokogiri在heroku上产生不同的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我使用nokogiri gem来解析一些html,而我我正在解析一个有奇怪字符的文件。在vim中显示为^ Q。



在我自己的计算机上,一切正常,但是在heroku上插入了一个< / body>< / html>< html>< / code>当它击中字符并且选择器仅在怪异字符之前返回元素时。

举例说明:
Nokogiri :: HTML(open(http://thoms.net.nz/e2.html)).css(body div)。计数在heroku上是1,在我的电脑上是2。 - 包含此字符的文件可以从 http://thoms.net.nz/e2.html



我的电脑和heroku都运行nokogiri 1.5.5和ruby 1.9.3。

解决方案

^ Q 是一个软件控制字符(XON),它不应该在HTML中。我怀疑它的意外存在使Nokogiri和Heroku混淆,但方式不同。

互联网上的HTML文件可能会以各种方式破坏。我已经看到了它们中的各种垃圾,如果我无法使用iconv或Unicode音译理解它,我会求助于快速全局搜索并在进一步之前删除任何不在正常ASCII范围内的东西处理。




在Ruby中,全局搜索和替换使用 String#gsub

  doc = Nokogiri :: HTML(html.gsub(\\\,''))


I'm having a very strange problem and I'd appreciate help tracking it down.

I'm using the nokogiri gem to parse some html, and I am parsing a file which has a weird character in it. Not entirely sure what this character is, in vim it shows as ^Q.

On my own computer, everything works fine, however on heroku it inserts a </body></html><html> when it hits the character and selectors only return the elements before the weird character.

To illustrate: Nokogiri::HTML( open("http://thoms.net.nz/e2.html")).css("body div").count is 1 on heroku, and two on my computer. - The file containing this character can be downloaded from http://thoms.net.nz/e2.html.

Both my computer and heroku are running nokogiri 1.5.5 with ruby 1.9.3.

解决方案

The ^Q is a software control character (XON), which isn't supposed to be in HTML. I suspect its unexpected presence is confusing both Nokogiri and Heroku, but in different ways.

HTML documents from the wilds of the internet can be corrupted in any numbers of ways. I've seen all sorts of garbage in them, and if I couldn't make sense of it using iconv or a Unicode transliteration, I'd resort to a quick global search and replace to remove anything not in the normal ASCII range before further processing.


In Ruby, global search and replace uses String#gsub.

doc = Nokogiri::HTML(html.gsub("\u0011", ''))

这篇关于Nokogiri在heroku上产生不同的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆