readHTMLTable 和 UTF-8 编码 [英] readHTMLTable and UTF-8 encoding

查看:40
本文介绍了readHTMLTable 和 UTF-8 编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一般有 readHTMLTable 和 XML 包的编码问题.我想从波兰语网站 allegro.pl(类似于 ebay 的拍卖网站)下载一些表格,但在此之后,波兰语字体存在编码问题,即使我在其中使用了 encoding="UTF-8" 或 stringsAsFactors=F读取HTML表格.

I have encoding problem with readHTMLTable and XML package generally. I would like to download some tables from polish site allegro.pl (auction site similar to ebay), but after so, there is a encoding problem with polish fonts, even if I used encoding="UTF-8" or stringsAsFactors=F in readHTMLTable.

代码:

library(XML)
url<-paste("http://allegro.pl/listing.php/search?category=15821&sg=0&p=",1:5,"&string=facebook",sep="")

alldata<-NULL

for (i in 1:5){
dane<-as.data.frame(readHTMLTable(url[i],1,stringsAsFactors = TRUE, encoding="UTF-8")$lista)
alldata<-rbind(alldata,dane)
}

结果:

> head(alldata[,c(2,3)])
                                                        V2                      V3
1     Facebook Fan Page z ANIMACJĄ indywidualny projekt Kup Teraz! 150,00 zł
2 Lubię to! Facebook! OKAZJA!!! 160 FANĂÂ"W!!! ZOBACZ!  Kup Teraz! 10,99 zł
3    125 fanĂÂłw fani like fanpage FACEBOOK polskie konta  Kup Teraz! 10,00 zł
4    Reklama Fanpage 43500+ fanĂÂłw, fani, facebook Efekt  Kup Teraz! 17,99 zł
5       Facebook Fanpage -Stworzenie Profesjonalnego Konta  Kup Teraz! 77,90 zł
6       Facebook Fanpage -Skuteczna Obsługa/Reklama /FV Kup Teraz! 100,00 zł

如果我使用 getURL 或 readLines 没有问题,但我想使用 XML 包,因为它很棒:)

If I use getURL or readLines there is no problem but I want to use XML package, beacuse it's great thing :)

我在使用 htmlParse、xpathApply 或提到的 readHTMLTable 等 XML 包函数时总是出现这个问题.

There's always this problem when I use XML package functions such as htmlParse, xpathApply or mentioned readHTMLTable.

我正在 Rstudio 0.94.110 @ Windows7 上工作.会话信息如下.

I am working on Rstudio 0.94.110 @ Windows7. SessionInfo below.

R version 2.14.0 (2011-10-31)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250    LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] spdep_0.5-41     coda_0.14-6      deldir_0.0-16    maptools_0.8-10  foreign_0.8-46   nlme_3.1-102     Matrix_1.0-1     lattice_0.20-0   boot_1.3-3      
[10] sp_0.9-91        maps_2.2-2       RCurl_1.7-0.1    bitops_1.0-4.1   XML_3.4-2.2      Cairo_1.5-1      car_2.0-11       survival_2.36-10 nnet_7.3-1      
[19] MASS_7.3-16     

loaded via a namespace (and not attached):
[1] grid_2.14.0  tools_2.14.0

推荐答案

有一段时间,我与 XML 包的创建者 Duncan Temple Lang 一起发邮件.昨天(30.01.2012)他在Omegahat网站上上传了新版本的XML包.31 位 R 版本的新版本 3.9-4 消除了这个编码问题!:)

for some time I was mailing with Duncan Temple Lang, the creator of XML package. Yesterday (30.01.2012) he uploaded new version of XML package on Omegahat website. New version 3.9-4 for 31bit version of R remove this encoding problem! :)

下载包形式链接如下:http://www.omegahat.org/R/bin/windows/contrib/2.14/

library(XML)
url<-paste("http://allegro.pl/listing.php/search?category=15821&sg=0&p=",1:5,"&string=facebook",sep="")
doc = htmlParse(url[1], encoding = "UTF-8")
z = as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)$lista)

它有效,所以我们可以关闭这个话题.:)

It works, so we can close this topic. :)

这篇关于readHTMLTable 和 UTF-8 编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆