将HTML实体转换为正确的字符R [英] Convert HTML Entity to proper character R

查看:97
本文介绍了将HTML实体转换为正确的字符R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人知道r中的通用函数可以将ä转换为其Unicode字符â吗?我已经看到了一些使用â并将其转换为正常字符的函数.任何帮助,将不胜感激.谢谢.

Does anyone know of a generic function in r that can convert ä to its unicode character â? I have seen some functions that take in â, and convert it to a normal character. Any help would be appreciated. Thanks.

以下是数据记录,我大概有1百万条记录.除了将数据读入庞大的向量中并为每个元素更改记录之外,还有其他更简单的解决方案吗?

Below is a record of data, which I probably have over 1 million records. Is there an easier solution other than reading the data into a massive vector, and for each element, changing the records?

wine/name: 1999 Domaine Robert Chevillon Nuits St. Georges 1er Cru Les Vaucrains
wine/wineId: 43163
wine/variant: Pinot Noir
wine/year: 1999
review/points: N/A
review/time: 1337385600
review/userId: 1
review/userName: Eric
review/text: Well this is awfully gorgeous, especially with a nicely grilled piece of Copper River sockeye. Pine needle and piercing perfume move to a remarkably energetic and youthful palate of pure, twangy, red fruit. Beneath that is a fair amount of umami and savory aspect with a surprising amount of tannin. Lots of goodness here. Still quite young but already rewarding at this stage.

wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!

更新: 使用该函数stri_trans_general函数会将所有Â转换为正确的小写字符,并且需要分配vapply结果以保存更改.

Update: Using the function stri_trans_general function will convert any  to a correct lowercase character, and vapply results need to be assigned to save changes.

#cellartracker-10records is the test file to use  
 tester <- "/Users/petergensler/Desktop/Wine Analysis/cellartracker-10records.txt"
 decode <- function(x) {   xmlValue(getNodeSet(htmlParse(tester), "//p")[[1]]) }

#Using vector, as we want to iterate over the raw file for cleaning
poop <- vapply(tester, decode, character(1), USE.NAMES = FALSE)

#Now use stringi to convert all characters to correct characters poop           
poop <- stringi::stri_trans_general(poop, "Latin-ASCII")
writeLines(poop, "wines.txt")

推荐答案

这是通过 XML 软件包的一种方法:

Here's one way via the XML package:

txt <- "wine/name: 2003 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg Riesling Kabinett"

library("XML")
xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])

> xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

[[1]]位是因为getNodeSet()返回已解析元素的列表,即使在这种情况下只有一个元素也是如此.

The [[1]] bit is because getNodeSet() returns a list of parsed elements, even if there is only one element as is the case here.

此内容来自

This was taken/modified from a reply to the R-Help list by Henrique Dallazuanna in 2010.

如果要对长度大于1的字符向量运行此命令,请lapply()此:

If you want to run this for a character vector of length >1, then lapply() this:

txt <- rep(txt, 2)
decode <- function(x) {
  xmlValue(getNodeSet(htmlParse(x, asText = TRUE), "//p")[[1]])
}
lapply(txt, decode)

,或者如果您希望将其作为矢量,则vapply():

or if you want it as a vector, vapply():

> vapply(txt, decode, character(1), USE.NAMES = FALSE)
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
[2] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

对于多行示例,请使用原始版本,但是如果要将字符矢量再次作为多行文档,则必须将字符向量写回到文件中:

For the multi-line example, use the original version, but you have to write the character vector back out to a file if you want it as a multiline document again:

txt <- "wine/name: 2001 Karth&#228;userhof Eitelsbacher Karth&#228;userhofberg 
Riesling Sp&#228;tlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!"

out <- xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])

这给了我

> out
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg \nRiesling Spätlese\nwine/wineId: 3058\nwine/variant: Riesling\nwine/year: 2001\nreview/points: N/A\nreview/time: 1095120000\nreview/userId: 1\nreview/userName: Eric\nreview/text: Hideously corked!"

如果您使用writeLines()

writeLines(out, "wines.txt")

您将获得一个文本文件,可以使用其他解析代码再次读取该文件:

You'll get a text file, which can be read in again using your other parsing code:

> readLines("wines.txt")
 [1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg "
 [2] "Riesling Spätlese"                                            
 [3] "wine/wineId: 3058"                                            
 [4] "wine/variant: Riesling"                                       
 [5] "wine/year: 2001"                                              
 [6] "review/points: N/A"                                           
 [7] "review/time: 1095120000"                                      
 [8] "review/userId: 1"                                             
 [9] "review/userName: Eric"                                        
[10] "review/text: Hideously corked!"

这是一个文件(来自我的BASH终端)

And it is a file (from my BASH terminal)

$ cat wines.txt 
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg 
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!

这篇关于将HTML实体转换为正确的字符R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆