使用TraMineR计算序列距离时出现大数据(?)问题 [英] Problem with big data (?) during computation of sequence distances using TraMineR

查看:135
本文介绍了使用TraMineR计算序列距离时出现大数据(?)问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用TraMineR运行最佳匹配分析,但是似乎我遇到了数据集大小的问题.我有一个包含就业法则的欧洲国家大数据集.我有57,000多个序列,这些序列长48个单位,由9个不同的州组成. 为了让您有一个分析的思路,这里是序列对象employdat.sts的头部:

I am trying to run an optimal matching analysis using TraMineR but it seems that I am encountering an issue with the size of the dataset. I have a big dataset of European countries which contains employment spells. I have more than 57,000 sequences which are 48 units long and consist of 9 distinct states. In order to get an idea of the analysis, here is the head of sequence object employdat.sts:

[1] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[2] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[3] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...  
[4] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...  
[5] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[6] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...  

在较短的SPS格式中,内容如下:

In a shorter SPS format, this reads as follows:

Sequence               
[1] "(EF,48)"              
[2] "(EF,48)"              
[3] "(ST,48)"              
[4] "(ST,36)-(MS,3)-(EF,9)"
[5] "(EF,48)"              
[6] "(ST,24)-(EF,24)"

将此序列对象传递给seqdist()函数后,我收到以下错误消息:

After passing this sequence object to the seqdist() function, I get the following error message:

employdat.om <- seqdist(employdat.sts, method="OM", sm="CONSTANT", indel=4)    
[>] creating 9x9 substitution-cost matrix using 2 as constant value  
[>] 57160 sequences with 9 distinct events/states  
[>] 12626 distinct sequences  
[>] min/max sequence length: 48/48  
[>] computing distances using OM metric  
Error in .Call(TMR_cstringdistance, as.integer(dseq), as.integer(dim(dseq)),  : negative length vectors are not allowed

此错误与大量不同的长序列有关吗?我正在使用具有4GB RAM的x64机器,并且还在具有8GB RAM的机器上尝试了该机器,该机器再现了错误消息.有人知道解决此错误的方法吗? 此外,使用相同的语法对每个国家/地区进行分析,并为该国家/地区建立索引,效果很好,并且产生了有意义的结果.

Is this error related to the huge number of distinct, long sequences? I am using a x64-machine with 4GB RAM and I have also tried it on a machine with 8-GB RAM which reproduced the error message. Does someone know a way to tackle this error? Besides, analyses for each single country using the same syntax with an index for the country worked well and produced meaningful results.

推荐答案

我以前从未见过此错误代码,但这很可能是由于您的序列数量很高.您至少可以尝试做两件事:

I never saw this error code before, but it might well be due to your high number of sequences. There are at least two things you can try to do:

  • 在seqdist中使用参数"full.matrix=FALSE"(请参阅帮助页面).它将仅计算下三角矩阵,并返回可直接在hclust函数中使用的"dist"对象.
  • 您可以聚合相同的序列(只有12626个不同的序列,而不是57160个序列),计算距离,使用权重对序列进行聚类(根据每个不同的序列在数据集中出现的次数计算),然后将聚类添加回原始数据集中.使用WeightedCluster库可以很容易地做到这一点. WeightedCluster手册的第一附录提供了执行此操作的逐步指南(该过程也在网页 http:中进行了描述://mephisto.unige.ch/weightedcluster ).
  • use the argument "full.matrix=FALSE" in seqdist (see help page). It will compute only the lower triangular matrix and return a "dist" object that can be used directly in the hclust function.
  • You can aggregate identical sequences (you only have 12626 distinct sequences instead of 57160 sequences), compute the distances, cluster the sequences using weights (that are computed according to the number of times each distinct sequence appears in the dataset) and then add the clustering back to your original dataset. This can be made quite easily using the WeightedCluster library. The first appendix of the WeightedCluster Manual provides a step by step guide to do that (the procedure is also described on the webpage http://mephisto.unige.ch/weightedcluster).

希望这会有所帮助.

这篇关于使用TraMineR计算序列距离时出现大数据(?)问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆