打开iso-8859-1编码html与nokogiri混乱的口音 [英] Open iso-8859-1 encoded html with nokogiri messes up accents

查看:147
本文介绍了打开iso-8859-1编码html与nokogiri混乱的口音的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



doc = Nokogiri :: HTML(open(html_file)) )



puts doc.to_html将页面中的所有重音都弄乱。所以如果我保存它,它在浏览器中看起来很坏。



我还在Rails 3.0.6 ...
任何提示如何解决这个问题?



以下是其中一个页面:例如: http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html



我也问过Github,但我觉得这会更快。如果我能够解决这个问题,我会更新这两个地方。



更新1 2012年3月24日



感谢您的意见。我设法部分解决了这个问题。
我认为这与Nokogiri无关。正如我在某些评论中提到的,我只需要打开并保存该文件即可使口音弄乱。



最接近我的修复是这样做的: p>

  thefile = File.open(html_file,r)
text = thefile.read
doc = Nokogiri :: HTML(text)
...做任何东西与nokogiri
File.open(html_file,'w'){| f | f.write(doc.to_html)}

原始文件带有iso-8859-1,保存一个在utf-8几乎看起来不错。口音就位。除了大写字母的访问:--P我得到像经济学这样的问号,应该有í(我有口音)



更近我想。如果有人提示大写字母也可能几乎完成了。



干杯。

解决方案

您用于下载文件的方法可能会更改编码,打破文件中的重音符号。尝试这样看看它正常工作:

  require'rubygems'
require'nokogiri'
require' open-uri'

url ='http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html'
doc = Nokogiri :: HTML(open(url))
File.open(1331108705.html,w){| f | f.write(doc.to_html)}
system('open','1331108705.html')#在Mac OS X上,这将打开浏览器中的html文件

您是如何下载文件的?


I'm trying to make some changes to an html page encoded with charset=iso-8859-1

doc = Nokogiri::HTML(open(html_file))

puts doc.to_html messes up all the accents in the page. So if I save it back it looks broken in the browser as well.

I'm still on Rails 3.0.6... Any hints how to fix this problem?

Here's one of the pages suffering from that for example: http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html

I've asked also in Github but I have the feeling this will be faster. I'll update both places if I get a cure for the problem.

UPDATE 1 24 March 2012

Thanks for the comments. I managed to partially solve this issue. I believe this has nothing to do with Nokogiri however. As I mentioned in some comment I just need to open and save the file to get the accents messed up.

The closest to a fix I got is doing this:

thefile = File.open(html_file, "r") 
text =  thefile.read
doc = Nokogiri::HTML(text)
... do any stuff with nokogiri
File.open(html_file, 'w') {|f| f.write(doc.to_html) }

The original file came with iso-8859-1, the save one goes in utf-8 pretty much it looks ok. Accents are in place. Except for the access in the capital letter :-P I get question marks like in Econom�a , there should be í (i with an accent)

Getting closer I think. If someone has a hint to cover the capital letters as well it might be almost done.

Cheers.

解决方案

The method you used to download the file may have changed the encoding, breaking the accents in the file. Try this to see it working correctly:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = 'http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html'
doc = Nokogiri::HTML(open(url))
File.open("1331108705.html", "w") {|f| f.write(doc.to_html)}
system('open', '1331108705.html') # on Mac OS X, this will open the html file in your browser

How did you download the file?

这篇关于打开iso-8859-1编码html与nokogiri混乱的口音的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆