如何使用 R 处理中文/日文字符 [英] How can I process Chinese/ Japanese characters with R

查看:46
本文介绍了如何使用 R 处理中文/日文字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够使用类似 tm 的包来使用 R 拆分和识别非英文字符(主要是日语/泰语/中文).我想做的是将其转换为某种矩阵像格式,然后运行随机森林/逻辑回归进行文本分类.有没有可能用 tm 或其他 R 包来做到这一点?

I would like to be able to use a tm like package to be able to split and identify non English characters (mainly Japanese/Thai/Chinese) with R. What I would like to do is convert it into some sort of matrix like format and then run a Random Forest/Logistic regression for text classification. Is there any possibility to do this with tm or another R package?

推荐答案

看起来 R 很难在文本中阅读非英文字符.我尝试从网上抓取中文字母,如果字符编码一致,结果可能会有所帮助.

It looks like R has a hard time reading in non-English characters in as text. I tried scraping the Chinese alphabet from the web and got a result that may help, if character encoding is consistent.

### Require package used to parse HTML Contents of a web page
require(XML)
### Open an internet connection
url <- url('http://www.chinese-tools.com/characters/alphabet.html')
### Read in Content line by line
page <- readLines(url, encoding = "UTF-8")
### Parse HTML Code
page <- htmlParse(page)
### Create a list of tables
page <- readHTMLTable(page)
### The alphabet is contained in the third table of the page
alphabet <- as.data.frame(page[3])

您现在有一个美国字母表字符列表,另一列对应于这些字符是如何读入 R 中的.如果在您希望通过文本输入我的原始对象中以相同的方式读取它们,是否有可能使用正则表达式一次一个地搜索这些编码字符?

You now have a list of US Alphabet characters, with another column corresponding to how these characters have been read into R. If they were read in the same way in your original object that you wish to text mine, would it be possible to use Regular Expressions to search for these encoded characters one at a time?

这篇关于如何使用 R 处理中文/日文字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆