R：rvest - 不正确的UTF-8，表示编码？ [英] R: rvest - is not proper UTF-8, indicate encoding?

查看：202 发布时间：2017/8/17 0:19:21 r encoding utf-8 web-scraping rvest

本文介绍了R：rvest - 不正确的UTF-8，表示编码？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试Hadley Wickham的新Rvest套餐。

我以前一直使用它，所以我预计一切顺利。

但是，我看到这个错误：

 > TV_Audio_Video_Marca<  -  read_html（page_source [[1]]，encoding =ISO-8859-1）
错误：输入不正确UTF-8，表示编码！ 
字节：0xCD 0x20 0x53 0x2E [9]

正如您在代码中看到的那样使用编码： ISO-8859-1 。在此之前，我使用的是UTF-8，但函数 guess_encoding（page_source [[1]]）表示编码为： ISO-8859 -1 。我尝试了 guess_encoding 提供的所有选项，但没有工作。

有什么问题？ p>

我的代码：

 库（RSelenium）
库rvest）
 #start RSelenium 
 checkForServer（）
 startServer（）
 remDr<  -  remoteDriver（）
 remDr $ open（）
 
＃navigate to your page 
 remDr $ navigate（http://www.linio.com.pe/tv-audio-y-video/televisores/）
 
 #scroll down 5次，等待页面每次加载
（i in 1：5）{
 remDr $ executeScript（paste（scroll（0，，i * 10000，）; ））
 Sys.sleep（3）
} 
 
 #get页面html 
 page_source< -remDr $ getPageSource（）
 
 #parse it 
 
 TV_Audio_Video_Marca<  -  read_html（page_source [[1]]，encoding =UTF-16LE）

更新1

我已经搜索如何使用网页的编码？。

发现这个Makrup W3C的验证工具，但这不是很有帮助：

http://validator.w3.org/check?uri=http://www.w3.org/2003/ 10 / empty / emptydoc.html

解决方案

查看页面来源，他们声称使用UTF-8编码：

 < meta http-equiv =Content-Typecontent =text / HTML; charset = utf-8/>

是的，他们真的使用了不同的编码，我们需要担心，还是可以转换为utf-8，猜测任何错误都可以忽略？

如果你可以快速而肮脏的做法，还有一些潜在的mojibake，你可以使用 iconv 强制使用utf-8：

  TV_Audio_Video_Marca<  -  read_html（icon_（page_source [[1]]，to =UTF-8），encoding =utf8）

一般来说，这是一个坏主意 - 更好地指定它的编码，在这种情况下，也许错误是他们的，所以这快速而肮脏的方法可能就可以了。

I'm trying out the "new" Rvest package from Hadley Wickham.

I've used it in the past, so I'd expected that everything run smoothly.

However, I keep seen this error:

> TV_Audio_Video_Marca <- read_html(page_source[[1]], encoding = "ISO-8859-1")
Error: Input is not proper UTF-8, indicate encoding !
Bytes: 0xCD 0x20 0x53 0x2E [9]

As you see in the code, I've use encoding: ISO-8859-1. Before that I was using "UTF-8", but function guess_encoding(page_source[[1]]) says that the encoding is: ISO-8859-1. I've tried with all the options provided by guess_encoding but none worked.

What is the problem?

My code:

library(RSelenium)
library(rvest)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

#navigate to your page
remDr$navigate("http://www.linio.com.pe/tv-audio-y-video/televisores/")

#scroll down 5 times, waiting for the page to load at each time
for(i in 1:5){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

#get the page html
page_source<-remDr$getPageSource()

#parse it

TV_Audio_Video_Marca <- read_html(page_source[[1]], encoding = "UTF-16LE")

UPDATE 1

I've googled for "How to now the encoding of a web page?".

Found out this Makrup Validation Tool from W3C, but It wasn't of great help:

http://validator.w3.org/check?uri=http://www.w3.org/2003/10/empty/emptydoc.html

解决方案

Looking at the page source, they claim to be using UTF-8 encoding:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

So, the question is, are they really using a different enough encoding we need to worry about, or can we just convert to utf-8, guessing that any errors will be negligible?

If you are happy with a quick and dirty approach, and some potential mojibake, you can just force utf-8 using iconv:

TV_Audio_Video_Marca <- read_html(iconv(page_source[[1]], to = "UTF-8"), encoding = "utf8")

In general, this is a bad idea - better to specify the encoding it's from. In this case, maybe the error is theirs, so this quick and dirty approach might be ok.

这篇关于R：rvest - 不正确的UTF-8，表示编码？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R：rvest - 不正确的UTF-8，表示编码？ [英] R: rvest - is not proper UTF-8, indicate encoding?

问题描述

相关文章

开发方法最新文章

热门教程

热门工具

登录关闭

R：rvest - 不正确的UTF-8，表示编码？ [英] R: rvest - is not proper UTF-8, indicate encoding?

问题描述

相关文章

开发方法最新文章

热门教程

热门工具

登录 关闭

登录关闭