Node.JS刮擦编码? [英] Node.JS scrape encoding?

查看:131
本文介绍了Node.JS刮擦编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用获取此页面 Node.JS中的此请求库,并使用 cheerio 解析正文。



在解析的响应体上调用 $。html()显示该页面的标题属性是:

 < title> Le Relais de l'Entrec?te< / title> 

...应该是:

 < title> Le Relais de l'Entrecôte< / title> 

我已经尝试将请求库的选项设置为包含 encoding: 'utf8',但似乎没有改变任何东西。



如何保留这些字符?

解决方案

页面似乎是用iso-8859-1编码的。您需要通过传递 encoding:null 并使用某些东西来告诉请求来传回未编码的缓冲区例如 node-iconv 进行转换。



重写一个通用的爬虫,你必须弄清楚如何检测你遇到的每个页面的编码,以正确解码它,否则以下内容应该适用于你的情况:



< pre class =lang-js prettyprint-override> var request = require('request');
var iconv = require('iconv');

request.get({
url:'http://www.relaisentrecote.fr',
encoding:null,
},function(err,res ,body){
var ic = new iconv.Iconv('iso-8859-1','utf-8');
var buf = ic.convert(body);
var utf8String = buf.toString('utf-8');
// ..用utf8String ..
}做某事


I'm fetching this page with with this request library in Node.JS, and parsing the body using cheerio.

Calling $.html() on the parsed response body reveals that the title attribute for the page is:

<title>Le Relais de l'Entrec?te</title>

... when it should be:

<title>Le Relais de l'Entrecôte</title>

I've tried setting the options for the request library to include encoding: 'utf8', but that didn't seem to change anything.

How do I preserve these characters?

解决方案

The page appears to be encoded with iso-8859-1. You'll need to tell request to hand you back an un-encoded buffer by passing encoding: null and use something like node-iconv to convert it.

If you're writing a generalized crawler, you'll have to figure out how to detect the encoding of each page you encounter to decode it correctly, otherwise the following should work for your case:

var request = require('request');                                               
var iconv = require('iconv');                                                   

request.get({                                                                   
  url: 'http://www.relaisentrecote.fr',                                         
  encoding: null,                                                               
}, function(err, res, body) {                                                   
  var ic = new iconv.Iconv('iso-8859-1', 'utf-8');                              
  var buf = ic.convert(body);                                                   
  var utf8String = buf.toString('utf-8');  
  // .. do something with utf8String ..                                                                             
});                                                                             

这篇关于Node.JS刮擦编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆