如何在R中使用gsub删除奇怪的字符? [英] How to remove strange characters using gsub in R?

查看：131 发布时间：2020/7/13 2:43:14 r unicode utf-8

本文介绍了如何在R中使用gsub删除奇怪的字符?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试清理一些使用readLines(..., encoding='UTF-8')加载到内存中的文本.

I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8').

如果不指定编码，则会看到各种奇怪的字符，例如:

If I don't specify the encoding, I see all kinds of strange characters like:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> ðŸ˜œðŸ˜â˜º'"

这是readLines(...，encoding ='UTF-8')之后的样子:

This is what it looks like after readLines(..., encoding='UTF-8'):

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

您可以在末尾看到unicode文字:\ u009f，\ u0098等.

You can see the unicode literals at the end: \u009f, \u0098, etc.

我找不到正确的命令和正则表达式来摆脱这些问题.我尝试过:

I can't find the right command and regular expression to get rid of these. I've tried:

gsub('[^[:punct:][:alnum:][\\s]]', '', text)

我也尝试指定unicode字符，但我相信它们会被解释为文本:

I also tried specifying the unicode characters, but I believe they're getting interpreted as text:

gsub('\u009', '', text) # Unchanged

如何在R中使用gsub删除奇怪的字符? [英] How to remove strange characters using gsub in R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在R中使用gsub删除奇怪的字符? [英] How to remove strange characters using gsub in R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭