R中的特征散列用于文本分类 [英] Feature hashing in R for Text classification

查看:167
本文介绍了R中的特征散列用于文本分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在R中实现功能散列来帮助我处理文本分类问题,但我不确定是否按照它的方式进行。我的代码的一部分是基于这个职位:映射的哈希函数整数到给定范围?



我的代码:

  random.data = function(n = 200,wlen = 40,ncol = 10){

random.word = function(n){
paste0(sample(c(letters,0:9),n,TRUE), (复制(n,random.word(wlen)),ncol = ncol)
}

feature_hash = function(doc ,N){

doc = as.matrix(doc)
library(digest)
$ b idx = matrix(strtoi(substr(sapply(doc,digest) (1):函数(r)apply(idx,1,function(v)sum(v))(%)(N + 1),ncol = ncol(doc))
sapply == b)))
}

set.seed(1)
doc = random.data(50,16,5)
feature_hash(doc,3 )

[,1] [,2] [,3]
[1,] 2 0 1
[2,] 2 1 1
[3, ] 2 0 1
[4,] 0 2 1
[5,] 1 1 1
[6,] 1 0 1
[7,] 1 2 0
[8,] 2 0 0
[9,] 3 1 0
[10,] 2 1 0

所以,我米使用由返回的MD5哈希的最后5个十六进制数字消化基本上转换字符串为整数。问题:

<1> - 有没有可以为我做这件事的包?我还没有找到。
2 - 使用 digest 作为散列函数是否是一个好主意?如果没有,我该怎么办?



PS:我应该测试它在发布之前是否有效,但是我的文件相当大并且需要很多处理时间,所以我认为有人指出我的方向是正确的,因为我确信我做错了!



感谢您对此有帮助!

解决方案

我不知道任何存在的CRAN软件包。



然而, ,我为自己写了一个包来做功能哈希。源代码在这里: https://github.com/wush978/FeatureHashing ,但API是不同的。

在我的例子中,我使用它将data.frame转换为 CSRMatrix ,一个自定义的稀疏矩阵包裹。我还实现了一个帮助函数来将 CSRMatrix 转换为 Matrix :: dgCMatrix 。对于文本分类,我猜稀疏矩阵会更合适。



如果您想尝试它,请在这里查看测试脚本: https://github.com/wush978/FeatureHashing/blob/master/tests/ test-conver-to-dgCMatrix.R



请注意,我只在Ubuntu中使用它,所以我不知道它是否适用于Windows或Mac或不。请随时问我有关 https://github.com/wush978/FeatureHashing/issues上的包裹的任何问题


I'm trying to implement feature hashing in R to help me with a text classification problem, but i'm not sure if i'm doing it the way it should be. Part of my code is based on this post: Hashing function for mapping integers to a given range?.

My code:

random.data = function(n = 200, wlen = 40, ncol = 10){

  random.word = function(n){
    paste0(sample(c(letters, 0:9), n, TRUE), collapse = '')
  } 
  matrix(replicate(n, random.word(wlen)), ncol = ncol)   
}

feature_hash = function(doc, N){

  doc = as.matrix(doc)
  library(digest)

  idx = matrix(strtoi(substr(sapply(doc, digest), 28, 32), 16L) %% (N + 1), ncol = ncol(doc))
  sapply(1:N, function(r)apply(idx, 1, function(v)sum(v == r)))  
}

set.seed(1)
doc = random.data(50, 16, 5)
feature_hash(doc, 3)

       [,1] [,2] [,3]
 [1,]    2    0    1
 [2,]    2    1    1
 [3,]    2    0    1
 [4,]    0    2    1
 [5,]    1    1    1
 [6,]    1    0    1
 [7,]    1    2    0
 [8,]    2    0    0
 [9,]    3    1    0
[10,]    2    1    0

So, i'm basically converting the strings to integers using the last 5 hex digits of the md5 hash returned by digest. Questions:

1 - Is there any package that can do this for me? I haven't found any. 2 - Is it a good idea do use digest as hash function? If not, what can i do?

PS: I should test if it works before posting, but my files are quite big and take a lot of processing time, so i think it's more clever to someone point me in the right direction, because i'm sure i'm doing it wrong!

Thanks for nay help on this!

解决方案

I don't know any existed CRAN package for this.

However, I wrote a package for myself to do feature hashing. The source code is here: https://github.com/wush978/FeatureHashing, but the API is different.

In my case, I use it to convert a data.frame to CSRMatrix, a customized sparse matrix in the package. I also implemented a helper function to convert the CSRMatrix to Matrix::dgCMatrix. For text classification, I guess the sparse matrix will be more suitable.

If you want to try it, please check the test script here: https://github.com/wush978/FeatureHashing/blob/master/tests/test-conver-to-dgCMatrix.R

Note that I only used it in Ubuntu, so I don't know if it works for windows or macs or not. Please feel free to ask me any question of the package on https://github.com/wush978/FeatureHashing/issues.

这篇关于R中的特征散列用于文本分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆