R中的特征散列用于文本分类 [英] Feature hashing in R for Text classification

查看：167 发布时间：2018/6/1 19:14:08 r hash hashcode feature-extraction text-classification

本文介绍了R中的特征散列用于文本分类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在R中实现功能散列来帮助我处理文本分类问题，但我不确定是否按照它的方式进行。我的代码的一部分是基于这个职位：映射的哈希函数整数到给定范围？。

我的代码：

  random.data = function（n = 200，wlen = 40，ncol = 10）{
 
 random.word = function（n）{
 paste0（sample（c（letters，0：9），n，TRUE）， （复制（n，random.word（wlen）），ncol = ncol）
} 
 
 feature_hash = function（doc ，N）{
 
 doc = as.matrix（doc）
 library（digest）
 $ b idx = matrix（strtoi（substr（sapply（doc，digest） （1）：函数（r）apply（idx，1，function（v）sum（v））（％）（N + 1），ncol = ncol（doc））
 sapply == b）））
} 
 
 set.seed（1）
 doc = random.data（50，16，5）
 feature_hash（doc，3 ）
 
 [，1] [，2] [，3] 
 [1，] 2 0 1 
 [2，] 2 1 1 
 [3， ] 2 0 1 
 [4，] 0 2 1 
 [5，] 1 1 1 
 [6，] 1 0 1 
 [7，] 1 2 0 
 [8，] 2 0 0 
 [9，] 3 1 0 
 [10，] 2 1 0

所以，我米使用由返回的MD5哈希的最后5个十六进制数字消化基本上转换字符串为整数。问题：

<1> - 有没有可以为我做这件事的包？我还没有找到。
2 - 使用 digest 作为散列函数是否是一个好主意？如果没有，我该怎么办？

PS：我应该测试它在发布之前是否有效，但是我的文件相当大并且需要很多处理时间，所以我认为有人指出我的方向是正确的，因为我确信我做错了！

感谢您对此有帮助！
解决方案
我不知道任何存在的CRAN软件包。

然而，，我为自己写了一个包来做功能哈希。源代码在这里： https://github.com/wush978/FeatureHashing ，但API是不同的。

在我的例子中，我使用它将data.frame转换为 CSRMatrix ，一个自定义的稀疏矩阵包裹。我还实现了一个帮助函数来将 CSRMatrix 转换为 Matrix :: dgCMatrix 。对于文本分类，我猜稀疏矩阵会更合适。

如果您想尝试它，请在这里查看测试脚本： https://github.com/wush978/FeatureHashing/blob/master/tests/ test-conver-to-dgCMatrix.R

请注意，我只在Ubuntu中使用它，所以我不知道它是否适用于Windows或Mac或不。请随时问我有关 https://github.com/wush978/FeatureHashing/issues上的包裹的任何问题。

I'm trying to implement feature hashing in R to help me with a text classification problem, but i'm not sure if i'm doing it the way it should be. Part of my code is based on this post: Hashing function for mapping integers to a given range?.

My code:
random.data = function(n = 200, wlen = 40, ncol = 10){ random.word = function(n){ paste0(sample(c(letters, 0:9), n, TRUE), collapse = '') } matrix(replicate(n, random.word(wlen)), ncol = ncol) } feature_hash = function(doc, N){ doc = as.matrix(doc) library(digest) idx = matrix(strtoi(substr(sapply(doc, digest), 28, 32), 16L) %% (N + 1), ncol = ncol(doc)) sapply(1:N, function(r)apply(idx, 1, function(v)sum(v == r))) } set.seed(1) doc = random.data(50, 16, 5) feature_hash(doc, 3) [,1] [,2] [,3] [1,] 2 0 1 [2,] 2 1 1 [3,] 2 0 1 [4,] 0 2 1 [5,] 1 1 1 [6,] 1 0 1 [7,] 1 2 0 [8,] 2 0 0 [9,] 3 1 0 [10,] 2 1 0
So, i'm basically converting the strings to integers using the last 5 hex digits of the md5 hash returned by digest. Questions:

1 - Is there any package that can do this for me? I haven't found any. 2 - Is it a good idea do use digest as hash function? If not, what can i do?

PS: I should test if it works before posting, but my files are quite big and take a lot of processing time, so i think it's more clever to someone point me in the right direction, because i'm sure i'm doing it wrong!

Thanks for nay help on this!
解决方案
I don't know any existed CRAN package for this.

However, I wrote a package for myself to do feature hashing. The source code is here: https://github.com/wush978/FeatureHashing, but the API is different.

In my case, I use it to convert a data.frame to CSRMatrix, a customized sparse matrix in the package. I also implemented a helper function to convert the CSRMatrix to Matrix::dgCMatrix. For text classification, I guess the sparse matrix will be more suitable.

If you want to try it, please check the test script here: https://github.com/wush978/FeatureHashing/blob/master/tests/test-conver-to-dgCMatrix.R

Note that I only used it in Ubuntu, so I don't know if it works for windows or macs or not. Please feel free to ask me any question of the package on https://github.com/wush978/FeatureHashing/issues.

这篇关于R中的特征散列用于文本分类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R中的特征散列用于文本分类 [英] Feature hashing in R for Text classification

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R中的特征散列用于文本分类 [英] Feature hashing in R for Text classification

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭