< U + 0092>字符串有问题Unicode字符 [英] Trouble with strings with <U+0092> Unicode characters

查看：174 发布时间：2020/10/29 6:11:10 r unicode encoding

本文介绍了< U + 0092>字符串有问题Unicode字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常大的数据集（70k行，2600列，CSV格式），是通过网络抓取创建的。不幸的是，在某些时候进行预处理，处理等操作时，一些有问题的字符已经以一种奇怪的方式编码，我在处理它们时遇到了问题。

I have a very large dataset (70k rows, 2600 columns, CSV format) that I have created by web scraping. Unfortunately, doing the pre-processing, processing etc. at some point some problematic characters have become encoded in an odd way and I have problems dealing with them.

我有字符串类似于以下内容：

I have strings like the following:

x = "but it doesn<U+0092>t matter"

查找代码，我们可以看到它应该是字符'，实际上应该是'（数据是用户生成的，因此可能包含各种奇数字符）。尽管从那个角色看，似乎其他人对此也有问题（ 1 ， 2 ， 3 ）。它标记为控制字符，不确定是什么，但这也许就是为什么它很难处理的原因。

Looking up the code, we can see that it should be the character ’, which actually should be ' (the data are user-generated so may contain all kinds of odd characters). Although from looking that character, it seems that others also have problems with it (1, 2, 3). It's labelled a control character, not sure what that is, but perhaps that's why it's so hard to deal with.

R中有关Unicode的其他大多数问题都与Unicode中的Unicode有关。像这样的格式 \u0092 。

Most of the other questions about Unicode in R concern Unicode in the format like this \u0092.

我们尝试：

#> x = "but it doesn<U+0092>t matter"
#> Encoding(x)
#[1] "unknown"
#> Encoding(x) = "UTF-8"
#> Encoding(x)
#[1] "unknown"
#> x
#[1] "but it doesn<U+0092>t matter"

因此，这似乎无能为力。

So this does not seem to do anything.

有一个涉及此Unicode格式并尝试将其转换的一些先前问题：

There are a few prior questions that concern this Unicode format and try to convert them:

在R中显示Unicode

R中的gsub进行unicode替换相比，在Windows下与Unix相比会产生不同的结果吗？

Display unicode in R
gsub in R with unicode replacement give different results under Windows compared with Unix?

奇怪的是，他们给出的例子是我的工作，但我的却没有。

Oddly, the example they give work, but mine doesn't.

#> test.string <- "This is a <U+03B1> <U+03B2> <U+03B2> <U+03B3> test <U+03B4> string."
#> Encoding(test.string)
#[1] "unknown"
#> to_true_unicode(test.string)
#[1] "This is a α β β γ test δ string."

但是：

#> x2 = to_true_unicode(x)
#> x2
#[1] "but it doesn\u0092t matter"
#> cat(x2)
#but it doesnt matter
#> Encoding(x2)
#[1] "UTF-8"

因此，从< U + ....> 格式转换为 \u 格式，并使用 cat（）打印不带该符号的字符（或SO上有错误的符号）。

So, it managed to convert to the \u format from the <U+....> format, and using cat() prints the character without that symbol (or a bugged symbol on SO).

这些问题数量有限，所以我也许可以使用搜索替换来解决。但是：

I only have a limited number of these problems, so I could perhaps just use search-replace to solve it. However:

#> #base-r
#> gsub(x = x, pattern = "<U+0092>", replacement = "'")
#[1] "but it doesn<U+0092>t matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x, pattern = "<U+0092>", "'")
#[1] "but it doesn<U+0092>t matter"

因此替换似乎不起作用，但它在 \u 版本上有效：

So replacement does not seem to work, but it does work on the \u versions:

#> #base-r
#> gsub(x = x2, pattern = "\u0092", replacement = "'")
#[1] "but it doesn't matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x2, pattern = "\u0092", "'")
#[1] "but it doesn't matter"

因此，这建议一种有效的方法：1）将< U +> 格式转换为 ＼ code>格式，然后使用搜索替换。


So, this suggests a working method: 1) convert <U+> format to \u format, then use search-replace.
似乎不适用于任何一个版本：
Does not seem to work with either version:
#> stringi::stri_unescape_unicode(x)
#[1] "but it doesn<U+0092>t matter"
#> stringi::stri_unescape_unicode(x2)
#[1] "but it doesn\u0092t matter"

是否存在一些通用的方法来解决此类问题？
Is there some generally applicable way to deal with problems like this?
我的sessionInfo是：
My sessionInfo is:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.0.0

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.2.3   stringi_1.0-1

在Windows 8.1（64位）上通过RStudio运行R（0.99.893，预览）。键盘和时间单位为丹麦语，但其他所有语言均为英语。
Running R via RStudio (0.99.893, preview) on Windows 8.1, 64-bit. Keyboard and time-units are Danish, but everything else is in English.
推荐答案
不确定它是否适用于您，但适用于相同的症状我确实将字符串转换为ascii：
Not sure it will work for you but for the same symptoms i did convert the strings to ascii:
x <- iconv(x, "", "ASCII", "byte")

对于非ascii字符，指示为< xx> ; 以及字节的十六进制代码。
For non ascii chars, the indication is "<xx>" with the hex code of the byte.
然后您可以将十六进制代码g替换为适合您的值。
You can then gsub the hex codes to the values that suit you.

                        这篇关于&lt; U + 0092&gt;字符串有问题Unicode字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

< U + 0092>字符串有问题Unicode字符 [英] Trouble with strings with <U+0092> Unicode characters

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

&lt; U + 0092&gt;字符串有问题Unicode字符 [英] Trouble with strings with &lt;U+0092&gt; Unicode characters

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

< U + 0092>字符串有问题Unicode字符 [英] Trouble with strings with <U+0092> Unicode characters

登录关闭