如何在R中使用gsub删除奇怪的字符? [英] How to remove strange characters using gsub in R?

查看:131
本文介绍了如何在R中使用gsub删除奇怪的字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试清理一些使用readLines(..., encoding='UTF-8')加载到内存中的文本.

I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8').

如果不指定编码,则会看到各种奇怪的字符,例如:

If I don't specify the encoding, I see all kinds of strange characters like:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> 😜ðŸ˜â˜º'"

这是readLines(...,encoding ='UTF-8')之后的样子:

This is what it looks like after readLines(..., encoding='UTF-8'):

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

您可以在末尾看到unicode文字:\ u009f,\ u0098等.

You can see the unicode literals at the end: \u009f, \u0098, etc.

我找不到正确的命令和正则表达式来摆脱这些问题.我尝试过:

I can't find the right command and regular expression to get rid of these. I've tried:

gsub('[^[:punct:][:alnum:][\\s]]', '', text)

我也尝试指定unicode字符,但我相信它们会被解释为文本:

I also tried specifying the unicode characters, but I believe they're getting interpreted as text:

gsub('\u009', '', text) # Unchanged

推荐答案

摆脱这些字符的最简单方法是将utf-8转换为ascii:

The easiest way to get rid of these characters is to convert from utf-8 to ascii:

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')

这篇关于如何在R中使用gsub删除奇怪的字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆