Node.JS刮擦编码？ [英] Node.JS scrape encoding?

查看：131 发布时间：2017/8/16 22:26:04 node.js unicode encoding

本文介绍了Node.JS刮擦编码？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用获取此页面 Node.JS中的此请求库，并使用 cheerio 解析正文。

在解析的响应体上调用 $。html（）显示该页面的标题属性是：

 < title> Le Relais de l'Entrec？te< / title>

...应该是：

 < title> Le Relais de l'Entrecôte< / title>

我已经尝试将请求库的选项设置为包含 encoding： 'utf8'，但似乎没有改变任何东西。

如何保留这些字符？

解决方案

页面似乎是用iso-8859-1编码的。您需要通过传递 encoding：null 并使用某些东西来告诉请求来传回未编码的缓冲区例如 node-iconv 进行转换。

重写一个通用的爬虫，你必须弄清楚如何检测你遇到的每个页面的编码，以正确解码它，否则以下内容应该适用于你的情况：

< pre class =lang-js prettyprint-override>

 var request = require（'request'）; 
 var iconv = require（'iconv'）; 
 
 request.get（{
 url：'http://www.relaisentrecote.fr'，
 encoding：null，
}，function（err，res ，body）{
 var ic = new iconv.Iconv（'iso-8859-1'，'utf-8'）; 
 var buf = ic.convert（body）; 
 var utf8String = buf.toString（'utf-8'）; 
 // ..用utf8String .. 
}做某事

I'm fetching this page with with this request library in Node.JS, and parsing the body using cheerio.

Calling $.html() on the parsed response body reveals that the title attribute for the page is:

<title>Le Relais de l'Entrec?te</title>

... when it should be:

<title>Le Relais de l'Entrecôte</title>

I've tried setting the options for the request library to include encoding: 'utf8', but that didn't seem to change anything.

How do I preserve these characters?

解决方案

The page appears to be encoded with iso-8859-1. You'll need to tell request to hand you back an un-encoded buffer by passing encoding: null and use something like node-iconv to convert it.

If you're writing a generalized crawler, you'll have to figure out how to detect the encoding of each page you encounter to decode it correctly, otherwise the following should work for your case:

var request = require('request');                                               
var iconv = require('iconv');                                                   

request.get({                                                                   
  url: 'http://www.relaisentrecote.fr',                                         
  encoding: null,                                                               
}, function(err, res, body) {                                                   
  var ic = new iconv.Iconv('iso-8859-1', 'utf-8');                              
  var buf = ic.convert(body);                                                   
  var utf8String = buf.toString('utf-8');  
  // .. do something with utf8String ..                                                                             
});

这篇关于Node.JS刮擦编码？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Node.JS刮擦编码？ [英] Node.JS scrape encoding?

问题描述

相关文章

开发方法最新文章

热门教程

热门工具

登录关闭

Node.JS刮擦编码？ [英] Node.JS scrape encoding?

问题描述

相关文章

开发方法最新文章

热门教程

热门工具

登录 关闭

登录关闭