http.get和ISO-8859-1编码的响应 [英] http.get and ISO-8859-1 encoded responses
问题描述
我要写一个RSS提要抓取程序,并且遇到一些字符集问题。
I'm about to write a RSS-feed fetcher and stuck with some charset problems.
加载和解析feed与编码相比相当容易。
我正在加载与 http.get
的feed,我把块放在每个数据事件。
稍后我使用npm-lib feedparser
解析整个字符串,它对给定的字符串起作用。
Loading and parsing the feed was quite easy compared to the encoding.
I'm loading the feed with http.get
and I'm putting the chunks together on every data event.
Later I'm parsing the whole string with the npm-lib feedparser
which works fine with the given string.
很遗憾,我习惯了像 utf8_encode()
在php中的功能,我在node.js中缺少它们,所以我坚持使用Iconv,这是
Sadly I'm used to functions like utf8_encode()
in php and I'm missing them in node.js so I'm stuck with using Iconv which is currently not doing what I want.
没有编码有几个utf8? - 错误的字符集,iconv,字符串被解析错误:/
Without encoding there are several utf8 ?-icons for wrong charset, with iconv, the string is parsed wrong :/
目前我对每个字符串分别进行编码:
Currently I'm encoding every string seperatedly:
//var encoding ≈ ISO-8859-1 etc. (Is the right one, checked with docs etc.)
// Shortend version
var iconv = new Iconv(encoding, 'UTF-8');
parser.on('article', function(article){
var object = {
title : iconv.convert(article.title).toString('UTF-8'),
description : iconv.convert(article.summary).toString('UTF-8')
}
Articles.push(object);
});
我应该开始使用数据缓冲区编码还是稍后使用完整的字符串?
Should I start encoding with data-buffers or later with the complete string?
谢谢!
PS:编码是通过解析xml头来确定的
PS: Encoding is determined with parsing the head of xml
在node.js中编码的模块如何更容易?
How about a module which makes encoding in node.js easier?
推荐答案
https://groups.google.com/group/nodejs/browse_thread/thread/b2603afa31aada9c 。
解决方案似乎是在使用Iconv处理缓冲区之前将响应编码设置为二进制。
The solution seems to be to set the response encoding to binary before processing the Buffer with Iconv.
相关位是
设置response.setEncoding('binary')并在调用Iconv.convert()之前将这些块聚合到一个缓冲区。请注意,encoding = binary表示您的数据回调将接收Buffer对象,而不是字符串。
set response.setEncoding('binary') and aggregate the chunks into a buffer before calling Iconv.convert(). Note that encoding=binary means your data callback will receive Buffer objects, not strings.
更新:这是我的初始回复
您确定您接收的Feed已经正确编码?
Are you sure that the feed you are receiving has been encoded correctly?
我可以看到两个可能的错误:
I can see two possible errors:
- ,但使用
Content-Type
表示charset = UTF-8
。 li>
- 该Feed使用UTF-8编码数据发送,但
Content-Type
头未声明任何内容,默认为ASCII。 / li>
- the feed is being sent with Latin-1-encoded data, but with a
Content-Type
that statescharset=UTF-8
. - the feed is being sent with UTF-8-encoded data but the
Content-Type
header does not state anything, defaulting to ASCII.
您应该使用Wireshark或cURL等实用工具检查Feed的内容和已发送的标题。
You should check the content of your feed and the sent headers with some utility like Wireshark or cURL.
这篇关于http.get和ISO-8859-1编码的响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!