使用UTF-8时最差的编码错误 [英] R rvest encoding errors with UTF-8

查看：90 发布时间：2020/7/13 4:15:51 r encoding utf-8 rvest

本文介绍了使用UTF-8时最差的编码错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从Wikipedia获取此表.文件的来源是UTF-8:

I'm trying to get this table from Wikipedia. The source of the file clamis it's UTF-8:

> <!DOCTYPE html> <html lang="en" dir="ltr" class="client-nojs"> <head>
> <meta charset="UTF-8"/> <title>List of cities in Colombia - Wikipedia,
> the free encyclopedia</title>
> ...

但是，当我尝试使用rvest来获取表格时，它会在应该加重(标准西班牙文)的á，é等字符的地方显示奇怪的字符. 这是我尝试过的:

However, when I try to get the table with rvest it shows weird characters where there should be accented (standard spanish) ones like á, é, etc. This is what I attempted:

theurl <- "https://en.wikipedia.org/wiki/List_of_cities_in_Colombia"
file <- read_html(theurl, encoding = "UTF-8")
tables <- html_nodes(file, "table")
pop <- html_table(tables[[2]])
head(pop)

##   No.         City Population         Department
## 1   1      BogotÃ¡  6.840.116       Cundinamarca
## 2   2    MedellÃn  2.214.494          Antioquia
## 3   3         Cali  2.119.908    Valle del Cauca
## 4   4 Barranquilla  1.146.359         AtlÃ¡ntico
## 5   5    Cartagena    892.545           BolÃvar
## 6   6      CÃºcuta    587.676 Norte de Santander

如其他SO问题中所建议的，我尝试使用以下方法修复编码:

I have attempted to repair the encoding, as suggested in other SO questions, with:

repair_encoding(pop)

## Best guess: UTF-8 (100% confident)
## Error in stringi::stri_conv(x, from = from) : 
##   all elements in `str` should be a raw vectors

我已经测试了几种不同的编码(latin1和guess_encoding()提供的其他编码，但是所有编码都会产生类似的错误结果.

I've tested several different encodings (latin1, and others provided by guess_encoding(), but all of them produce similarly incorrect results.

如何正确加载此表?

使用UTF-8时最差的编码错误 [英] R rvest encoding errors with UTF-8

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用UTF-8时最差的编码错误 [英] R rvest encoding errors with UTF-8

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭