在R中聚集字符串(可能吗?) [英] Clustering strings in R (is it possible?)

查看:263
本文介绍了在R中聚集字符串(可能吗?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中的列当前被视为具有1000+级别的因子。这些是列的值。我想清理这些数据。
一些值是类似-18 + 5 = -13和5 - 18 = -13的字符串,我想聚类分组这些不同于说R3no4。



这是否可能在R?我查看了自然语言处理任务视图 http://cran.r-project。 org / web / views / NaturalLanguageProcessing.html ,但我需要向正确的方向推。



数据集来自 kdd 2010 cup
我想从此列中创建有意义的新列,以帮助创建预测模型。例如,很高兴知道字符串是否包含某个操作,或者如果它不包含操作,而是描述问题。



我的数据框看起来像这样:

  str(data1)
'data.frame':809694 obs。的19个变量:
$ Row:int 1 2 3 4 5 6 7 8 9 10 ...
$ Anon.Student.Id:Factor w / 574 levels02i5jCrfQK,02ZjVTxC34,. 。:7 7 7 7 7 7 7 7 7 7 ...
$ Problem.Hierarchy:Factor w / 138 levelsUnit CTA1_01,Section CTA1_01-1,...:80 80 80 80 80 80 80 80 80 80 ...
$ Problem.Name:Factor w / 1084 levels1PTB02,1PTB03,...:377 377 378 378 378 378 378 378 378 378 ...
$问题。查看:int 1 1 1 1 2 2 3 3 4 4 ...
$ Step.Name:Factor w / 187539 levels - ( - 0.24444444-y)= -0.93333333,..:116742 177541 104443 64186 58776 58892 153246 153078 45114 163923 ...

我最感兴趣的是Step.Name功能,因为它包含最大数量的唯一因子值。



以及步骤名称的一些示例值:

 (1 + 7)/ 4 = x 
[97171](1-sqrt(1 ^ 2-4 * 2 * -6))/ 4 = x
[97172](1-sqrt(1 - ( - 48)))/ 4 = x
[97174](1-sqrt(49))/ 4 = x
[1-771] / 4 = x
[97176] x ^ 2 + 15x + 44 = 0
[97177] a-factor-node
[97178] b-factor-node
[97179] c-factor-node
[97180] num1-factor-node
[97182] num2-factor-node
[97182] den1-factor-node
(-15?sqrt(( - 15)^ 2-4 * 1 * 44) = - $
(-15 + sqrt(( - 15)^ 2-4 * 1 * 44))/ 2 = x
(-15 + sqrt(49))/ 2 = x
(15 + sqrt(49))/ 2 = x
= x
(-15 + 7)/ 2 = x
(-15-sqrt(( - 15)^ 2-4 * 1 * 44))/ 2 = x
[17590](-15-sqrt((-15)^ 2-176))/ 2 = x
[ b $ b [-155](15-sqrt(49))/ 2 = x
[-157] / 2 = x
[97194] 2x ^ 2 + x = 0
[97195] a-factor-node
[97196] b-factor-node
[97197] c-factor-node
[97198] num1-factor-node
[97199] num2-factor-node
[97200] den1-factor-node
(-1?sqrt(( - 1)^ 2-4 * 2 * 4 = x
(-1 + sqrt(( - 1)^ 2-4 * 2 * 0))/ 4 = x
)^ 2-0))/ 4 = x
(-1 + sqrt(( - 1)^ 2))/ 4 = x
4 = x
(-1-sqrt(( - 1)^ 2-4 * 2 * 0))/ 4 = x
)^ 2-0))/ 4 = x
(-1-sqrt(( - 1)^ 2))/ 4 = x
4 = x
[97210] x ^ 2-6x = 0
[97211] a-因子节点
[97212] b-因子节点


解决方案

聚类只是根据一些指标对数据数组中的每个实例进行评分,到这个计算的分数,然后切片成一些数量的段,为每个分配一个标签。



换句话说,你可以聚集任何数据,你可以制定一些有意义的函数来计算每个数据点的相似性w / r / t其他;这通常称为相似性指标



有很多很多,但只有它们的一个小子集对于评估字符串很有用。其中,也许最常用的是 Levenshtein Distance (又名编辑距离)。



此指标以整数表示,每个编辑增加一个单位(+1) - 插入,删除或更改字母 - 将一个字转换成另一个字。



R包 vwr 包括实施:

 > library(vwr)
> levenshtein.distance('cat','hat')
hat
1
> levenshtein.distance('cat','catwalk')
catwalk
4
> levenshtein.distance('catwalk','sidewalk')
人行道
4

> #使用vmr库提供的数据集
> EW = english.words
> ew1 = sample(EW,20)#随机从EW中选择20个字
> #第二个参数是字的向量,返回距离的向量
> dx = levenshtein.distance('cat',ew1)
> dx
furriers graves crooned cursively山形山顶排水管
8 5 6 8 5 8 9
patricians medially beholder chirpiness fluttered bobolink可悲的
8 7 8 9 8 8 8
depredations levenshtein距离虽然Levenshtein距离

可以
用于集中您的数据,无论是否
用于您的数据是我会离开你的问题(即,L / D的主要用例显然是纯粹的。Hamming Distance(与Levenshtein不同)Hamming Distance(与Levenshtein不同)

要求两个字符串等长,因此它不能用于您的数据。)


I have a dataset with a column that is currently being treated as a factor with 1000+ levels. These are values for the column. I would like to clean up this data. Some values are strings like "-18 + 5 = -13" and "5 - 18 = -13", I would like the clustering to group these differently than say "R3no4".

Is this possible in R? I looked at the natural language processing task view http://cran.r-project.org/web/views/NaturalLanguageProcessing.html but I need to be pushed in the right direction.

the dataset is from the kdd 2010 cup I would like to create meaningful new columns from this column to aid in creating a predictive model. for example it would be nice to know if the string contains a certain operation, or if it contains no operations and instead is describing the problem.

my data frame looks like this:

str(data1)
'data.frame':   809694 obs. of  19 variables:
 $ Row                        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Anon.Student.Id            : Factor w/ 574 levels "02i5jCrfQK","02ZjVTxC34",..: 7 7     7 7 7 7 7 7 7 7 ...
 $ Problem.Hierarchy          : Factor w/ 138 levels "Unit CTA1_01, Section CTA1_01-1",..: 80 80 80 80 80 80 80 80 80 80 ...
 $ Problem.Name               : Factor w/ 1084 levels "1PTB02","1PTB03",..: 377 377 378 378 378 378 378 378 378 378 ...
 $ Problem.View               : int  1 1 1 1 2 2 3 3 4 4 ...
 $ Step.Name                  : Factor w/ 187539 levels "-(-0.24444444-y) = -0.93333333",..: 116742 177541 104443 64186 58776 58892 153246 153078 45114 163923 ...

I'm most interested in the Step.Name feature, since it contains the greatest number of unique factor values.

and some example values for step name:

[97170] (1+7)/4 = x                                                               
[97171] (1-sqrt(1^2-4*2*-6))/4 = x                                                
[97172] (1-sqrt(1^2-(-48)))/4 = x                                                 
[97173] (1-sqrt(1-(-48)))/4 = x                                                   
[97174] (1-sqrt(49))/4 = x                                                        
[97175] (1-7)/4 = x                                                               
[97176] x^2+15x+44 = 0                                                            
[97177] a-factor-node                                                             
[97178] b-factor-node                                                             
[97179] c-factor-node                                                             
[97180] num1-factor-node                                                          
[97181] num2-factor-node                                                          
[97182] den1-factor-node                                                          
[97183] (-15?sqrt((-15)^2-4*1*44))/2 = x                                          
[97184] (-15+sqrt((-15)^2-4*1*44))/2 = x                                          
[97185] (-15+sqrt((-15)^2-176))/2 = x                                             
[97186] (-15+sqrt(225-176))/2 = x                                                 
[97187] (-15+sqrt(49))/2 = x                                                      
[97188] (-15+7)/2 = x                                                             
[97189] (-15-sqrt((-15)^2-4*1*44))/2 = x                                          
[97190] (-15-sqrt((-15)^2-176))/2 = x                                             
[97191] (-15-sqrt(225-176))/2 = x                                                 
[97192] (-15-sqrt(49))/2 = x                                                      
[97193] (-15-7)/2 = x                                                             
[97194] 2x^2+x = 0                                                                
[97195] a-factor-node                                                             
[97196] b-factor-node                                                             
[97197] c-factor-node                                                             
[97198] num1-factor-node                                                          
[97199] num2-factor-node                                                          
[97200] den1-factor-node                                                          
[97201] (-1?sqrt((-1)^2-4*2*0))/4 = x                                             
[97202] (-1+sqrt((-1)^2-4*2*0))/4 = x                                             
[97203] (-1+sqrt((-1)^2-0))/4 = x                                                 
[97204] (-1+sqrt((-1)^2))/4 = x                                                   
[97205] (-1+1)/4 = x                                                              
[97206] (-1-sqrt((-1)^2-4*2*0))/4 = x                                             
[97207] (-1-sqrt((-1)^2-0))/4 = x                                                 
[97208] (-1-sqrt((-1)^2))/4 = x                                                   
[97209] (-1-1)/4 = x                                                              
[97210] x^2-6x = 0                                                                
[97211] a-factor-node                                                             
[97212] b-factor-node                                                                

解决方案

Clustering is just scoring each instance in a data array according to some metric, sorting the data array according to this calculated score, then slicing into some number of segments, assigning a label each one.

In other words, you can cluster any data for which you can formulate some meaningful function to calculate similarity of each data point w/r/t the others; this is usually referred to as a similarity metric.

There are a lot of these, but only a small subset of them are useful to evaluate strings. Of these, perhaps the most commonly used is Levenshtein Distance (aka Edit Distance).

This metric is expressed as an integer, and it increments one unit (+1) for each 'edit'--inserting, deleting, or changing a letter--required to transform one word into another. Summing those individual edits (one for each letter) gives you the Levenshtein Distance.

The R Package vwr includes an implementation:

> library(vwr)
> levenshtein.distance('cat', 'hat')
    hat 
    1 
> levenshtein.distance('cat', 'catwalk')
    catwalk 
    4 
> levenshtein.distance('catwalk', 'sidewalk')
    sidewalk 
    4

> # using a data set supplied with the vmr library 
> EW = english.words
> ew1 = sample(EW, 20)     # random select 20 words from EW
> # the second argument is a vector of words, returns a vector of distances
> dx = levenshtein.distance('cat', ew1)
> dx
furriers       graves      crooned    cursively       gabled   caparisons   drainpipes 
    8            5            6            8            5            8            9 
patricians     medially     beholder   chirpiness    fluttered     bobolink   lamentably 
    8            7            8            9            8            8            8 
depredations      alights    unearthed     thimbles    supersede   dissembler 
    10            6            7            8            9           10

While Levenshtein Distance can be used to cluster your data, whether it should be used for your data is a question i'll leave to you (i.e., the primary use case for L/D is clearly pure text data).

(Perhaps the next-most-common similarity metric that operates on strings is Hamming Distance. Hamming Distance (unlike Levenshtein) requires that the two strings be of equal length, hence it won't work for your data.)

这篇关于在R中聚集字符串(可能吗?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆