匹配pyspark数据帧在 pandas 中的索引 [英] match index from pyspark dataframe in pandas

查看:67
本文介绍了匹配pyspark数据帧在 pandas 中的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下pyspark数据框(testDF=ldamodel.describeTopics().select("termIndices").toPandas())

I have following pyspark dataframe (testDF=ldamodel.describeTopics().select("termIndices").toPandas())

topic|    termIndices|         termWeights|
+-----+---------------+--------------------+
|    0|    [6, 118, 5]|[0.01205522104545...|
|    1|   [0, 55, 100]|[0.00125521761966...|

我有以下单词列表

['one',
 'peopl',
 'govern',
 'think',
 'econom',
 'rate',
 'tax',
 'polici',
 'year',
 'like',
........]

我正在尝试将vocablist匹配到termIndices匹配到termWeights.

I am trying to match vocablist to termIndices to termWeights.

到目前为止,我有以下内容:

So far I have following:

for i in testDF.items():
    for j in i[1]:
        for m in j:
            t=vocablist[m],m
            print(t)

其结果为:

('tax', 6)
('insur', 118)
('rate', 5)
('peopl', 1)
('health', 84)
('incom', 38)
('think', 3)
('one', 0)
('social', 162)
.......

但是我想要类似

('tax', 6, 0.012055221045453202)
('insur', 118, 0.001255217619666775)
('rate', 5, 0.0032220995010401187)

('peopl', 1,0.008342115226031033)
('health', 84,0.0008332053105123403)
('incom', 38, ......)

任何帮助将不胜感激.

Any help will be appreciated.

推荐答案

我建议您将那些lists分散在termIndicestermWeights列中.完成此操作后,您实际上可以map为其术语名称建立索引,同时使术语权重与每个术语索引保持一致.以下是说明:

I would recommend you spread those lists in the columns termIndices and termWeights downward. Once you've done that, then you can actually map indices to their term names while having the term weights aligned with each term index. The following is an illustration:

df = pd.DataFrame(data={'topic': [0, 1],
                        'termIndices': [[6, 118, 5],
                                        [0, 55, 100]],
                        'termWeights': [[0.012055221045453202, 0.012055221045453202, 0.012055221045453202],
                                        [0.00125521761966, 0.00125521761966, 0.00125521761966]]})

dff = df.apply(lambda s: s.apply(pd.Series).stack().reset_index(drop=True, level=1))

vocablist = ['one', 'peopl', 'govern', 'think', 'econom', 'rate', 'tax', 'polici', 'year', 'like'] * 50

dff['termNames'] = dff.termIndices.map(vocablist.__getitem__)

dff[['termNames', 'termIndices', 'termWeights']].values.tolist()

我希望这会有所帮助.

这篇关于匹配pyspark数据帧在 pandas 中的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆