如何在R中使用gsub删除奇怪的字符? [英] How to remove strange characters using gsub in R?
问题描述
我正在尝试清理一些使用readLines(..., encoding='UTF-8')
加载到内存中的文本.
I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8')
.
如果不指定编码,则会看到各种奇怪的字符,例如:
If I don't specify the encoding, I see all kinds of strange characters like:
> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> 😜ðŸ˜â˜º'"
这是readLines(...,encoding ='UTF-8')之后的样子:
This is what it looks like after readLines(..., encoding='UTF-8'):
> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"
您可以在末尾看到unicode文字:\ u009f,\ u0098等.
You can see the unicode literals at the end: \u009f, \u0098, etc.
我找不到正确的命令和正则表达式来摆脱这些问题.我尝试过:
I can't find the right command and regular expression to get rid of these. I've tried:
gsub('[^[:punct:][:alnum:][\\s]]', '', text)
我也尝试指定unicode字符,但我相信它们会被解释为文本:
I also tried specifying the unicode characters, but I believe they're getting interpreted as text:
gsub('\u009', '', text) # Unchanged
推荐答案
摆脱这些字符的最简单方法是将utf-8转换为ascii:
The easiest way to get rid of these characters is to convert from utf-8 to ascii:
combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')
这篇关于如何在R中使用gsub删除奇怪的字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!