使用 Python 和 Pandas 进行文本挖掘 [英] Text mining with Python and pandas

查看:40
本文介绍了使用 Python 和 Pandas 进行文本挖掘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是重复的,但我没有找到它...

this maybe is a duplicate, but I had no luck finding it...

我正在使用 Pandas 在 Python 中进行一些文本挖掘.我在 DataFrame 中有单词,旁边有 Porter 和其他一些统计数据.这意味着可以在此 DataFrame 中找到具有完全相同 Porter 词干的相似词.我想将这些相似的词汇总到一个新列中,然后删除与 Porter 词干相关的重复词.

I am working on some text mining in Python with Pandas. I have words in a DataFrame and the Porter stemming next to it with some other statistics. This means similar words having exact same Porter stem can be found in this DataFrame. I would like to aggregate these similar words in a new column then drop the duplicates regarding Porter stem.

import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 'SomeData': ['12', '13', '12', '13', '12']})

pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(list))

我最想拥有的:

# Word      Porter               Merged    SomeData
# bank        bank      [bank, banking]          12
# hold        hold      [hold, holding]          13
# banking     bank      [bank, banking]          12
# holding     hold      [hold, holding]          13
# bank        bank      [bank, banking]          12

删除重复项后:

# Word      Porter               Merged    SomeData
# bank        bank      [bank, banking]          12
# hold        hold      [hold, holding]          13

我尝试使用,但我没有更接近我的目标.

I tried to use, but I came no closer to my goals.

pda.join(pdm, on="Porter", how="left")``

提前感谢您的帮助.

上面修改的代码

推荐答案

你可以应用一个集合而不是一个列表,所以你会自动删除所有重复项:

You can apply a set to this instead of a list, so you are removing all the duplicates automaticly:

import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 
                              'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 
                              'SomeData': ['12', '13', '12', '13', '12']})

pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(set))

这篇关于使用 Python 和 Pandas 进行文本挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆