PrefixSpan序列提取误解 [英] PrefixSpan sequence extraction misunderstanding

查看:65
本文介绍了PrefixSpan序列提取误解的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在列表中有一组大小为三的元组,它们代表窗口化的序列.我需要的是使用pyspask来获得(鉴于元组的前两个部分)第三个.

I have a set of tuples of size three in a list that represent windowed sequences. What I need is using pyspask to be able to get (given the two first parts of the tuple) the third one.

因此,我需要它根据三个元素的频率创建它们的序列.

So I need it to create sequences of three elements based on their frequency.

这就是我在做什么:

data = [[['a','b','c'],['b','c','d'],['c','d','e'],['d','e','f'],['e','f','g'],['f','g','h'],['a','b','c'],['d','e','f'],['a','b','c'],['b','c','d'],['f','g','h'],['d','e','f'],['b','c','d']]]
rdd = spark.sparkContext.parallelize(data,2)
rdd.cache()
model = PrefixSpan.train( rdd, 0.2, 3)

print(sorted(model.freqSequences().take(100)))

尽管,我希望看到它们遵循字母的顺序和频率,但它们不会.

Although, I would expect to see the sequences and the frequencies o them to follow the alphabet, they don't.

我得到的序列如下:

FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)
FreqSequence(sequence=[[u'g'], [u'c'], [u'c']], freq=1)

未出现在已定义的

中.显然,我在构造功能时存在问题,或者在该算法的目的和功能方面缺少某些东西.

which are not appearing in the defined ones. Obviously there is a problem in the way I have structure my features or I am missing something in the purpose and functionality of this algorithm..

谢谢!

推荐答案

首先让我们看看您的输入内容:

First let's look at your input:

rdd.count()

1

如您所见,您创建的数据集只有一个序列.可以描述为:

As you can see you created a dataset with only one sequence. It can be described as:

<(abc)(bcd)(cde)(def)(efg)(fgh)(abc)(def)(abc)(bcd)(fgh)(def)(bcd)>

所以给定输入,您得到的模式确实是正确的.例如

So patterns you get are indeed correct given the input. For example

FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)

对应于:

...(abc)(def)(abc)...

如果数据集中的每个元素代表单个序列数据,则可能具有以下形状:

If each element of the dataset represents individual sequence data could have the following shape:

rdd = sc.parallelize([
    [['a'], ['b'], ['c']], [['b'], ['c'], ['d']], [['c'], ['d'], ['e']],
    [['d'], ['e'], ['f']], [['e'], ['f'], ['g']], [['f'], ['g'], ['h']],
    [['a'], ['b'], ['c']], [['d'], ['e'], ['f']], [['a'], ['b'], ['c']],
    [['b'], ['c'], ['d']], [['f'], ['g'], ['h']], [['d'], ['e'], ['f']],
    [['b'], ['c'], ['d']]
])

rdd.count()

13

rdd.first()

[['a'], ['b'], ['c']]

其中:

  • 每个元素都是一个列表列表.
  • 每个内部列表代表给定位置的可能替代方案.

数据结构如下:

model = PrefixSpan.train(rdd, 0.2, 3)
model.freqSequences().top(5, key=lambda x: len(x.sequence))

[FreqSequence(sequence=[['d'], ['e'], ['f']], freq=3),
 FreqSequence(sequence=[['b'], ['c'], ['d']], freq=3),
 FreqSequence(sequence=[['a'], ['b'], ['c']], freq=3),
 FreqSequence(sequence=[['f'], ['g']], freq=3),
 FreqSequence(sequence=[['d'], ['f']], freq=3)]

model.freqSequences().top(5, key=lambda x: x.freq)

[FreqSequence(sequence=[['d']], freq=7),
 FreqSequence(sequence=[['c']], freq=7),
 FreqSequence(sequence=[['f']], freq=6),
 FreqSequence(sequence=[['b']], freq=6),
 FreqSequence(sequence=[['b'], ['c']], freq=6)]

这篇关于PrefixSpan序列提取误解的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆