tidytext :: unnest_tokens是否适用于西班牙字符? [英] Does tidytext::unnest_tokens works with spanish characters?

查看:131
本文介绍了tidytext :: unnest_tokens是否适用于西班牙字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将unnest_tokens与西班牙语文本一起使用.它可以用字母组合词很好地工作,但是用双字母字符可以打破特殊字符.

I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams.

该代码在Linux上运行良好.我在语言环境中添加了一些信息.

The code works fine on Linux. I added some info on the locale.

library(tidytext)
library(dplyr)

df <- data_frame(
  text = "César Moreira Nuñez"
)

# works ok:
df %>% 
  unnest_tokens(word, text)


# # A tibble: 3 x 1
# word
# <chr>
# 1 césar
# 2 moreira
# 3 nuñez

# breaks é and ñ
df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2 )

# # A tibble: 2 x 1
# bigram
# <chr>
# 1 cã©sar moreira
# 2 moreira nuã±ez

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

推荐答案

我们与爱沙尼亚语 .总是有些棘手,因为我永远无法在本地重现问题,因为我无法处理您的问题:

We have chatted with several people who have run into issues with encoding before, with Polish and Estonian. It's always a bit tricky because I can never reproduce the problem locally, as I cannot with your problem:

library(tidytext)
library(dplyr)

df <- data_frame(
  text = "César Moreira Nuñez"
)

df %>% 
  unnest_tokens(word, text)
#> # A tibble: 3 x 1
#>   word   
#>   <chr>  
#> 1 césar  
#> 2 moreira
#> 3 nuñez

df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2 )
#> # A tibble: 2 x 1
#>   bigram       
#>   <chr>        
#> 1 césar moreira
#> 2 moreira nuñez

您说您的代码在Linux上运行良好,这也与其他人的经验保持一致.这似乎始终是Windows编码问题.这与tidytext包甚至是tokenizers包中的代码无关.从我所看到的,我怀疑这与stringi中的C库以及与其他平台相比在Windows中的行为有关.因此,对于任何依赖于stringi的东西(实际上是R中的NLP的全部),您可能会遇到相同的问题.

You say that your code works fine on Linux, and this aligns with others' experience as well. This seems to always be a Windows encoding issue. This isn't related to the code in the tidytext package, or even the tokenizers package; from what I've seen, I suspect this is related to the C libraries in stringi and how they act on Windows compared to other platforms. Because of this, you'll likely have the same problems with anything that depends on stringi (which is practically ALL of NLP in R).

这篇关于tidytext :: unnest_tokens是否适用于西班牙字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆