Node.JS刮擦编码? [英] Node.JS scrape encoding?
问题描述
我正在使用获取此页面 Node.JS中的此请求库,并使用 cheerio 解析正文。
在解析的响应体上调用 $。html()
显示该页面的标题属性是:
< title> Le Relais de l'Entrec?te< / title>
...应该是:
< title> Le Relais de l'Entrecôte< / title>
我已经尝试将请求库的选项设置为包含 encoding: 'utf8'
,但似乎没有改变任何东西。
如何保留这些字符?
页面似乎是用iso-8859-1编码的。您需要通过传递 encoding:null
并使用某些东西来告诉请求
来传回未编码的缓冲区例如 node-iconv 进行转换。
重写一个通用的爬虫,你必须弄清楚如何检测你遇到的每个页面的编码,以正确解码它,否则以下内容应该适用于你的情况:
< pre class =lang-js prettyprint-override>
var request = require('request');
var iconv = require('iconv');
request.get({
url:'http://www.relaisentrecote.fr',
encoding:null,
},function(err,res ,body){
var ic = new iconv.Iconv('iso-8859-1','utf-8');
var buf = ic.convert(body);
var utf8String = buf.toString('utf-8');
// ..用utf8String ..
}做某事
I'm fetching this page with with this request library in Node.JS, and parsing the body using cheerio.
Calling $.html()
on the parsed response body reveals that the title attribute for the page is:
<title>Le Relais de l'Entrec?te</title>
... when it should be:
<title>Le Relais de l'Entrecôte</title>
I've tried setting the options for the request library to include encoding: 'utf8'
, but that didn't seem to change anything.
How do I preserve these characters?
The page appears to be encoded with iso-8859-1. You'll need to tell request
to hand you back an un-encoded buffer by passing encoding: null
and use something like node-iconv to convert it.
If you're writing a generalized crawler, you'll have to figure out how to detect the encoding of each page you encounter to decode it correctly, otherwise the following should work for your case:
var request = require('request');
var iconv = require('iconv');
request.get({
url: 'http://www.relaisentrecote.fr',
encoding: null,
}, function(err, res, body) {
var ic = new iconv.Iconv('iso-8859-1', 'utf-8');
var buf = ic.convert(body);
var utf8String = buf.toString('utf-8');
// .. do something with utf8String ..
});
这篇关于Node.JS刮擦编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!