单词和组元组之间的映射以获得单词的频率 [英] mapping between words and a group tuple to get frequency of words
问题描述
我有一个如下所示的数据框
I have a dataframe that looks like the following
Utterance Frequency
Directions to Starbucks 1045
Show me directions to Starbucks 754
Give me directions to Starbucks 612
Navigate me to Starbucks 498
Display navigation to Starbucks 376
Direct me to Starbucks 201
Navigate to Starbucks 180
在这里,有些数据显示了人们所说的话,以及人们说出这些话的频率.
Here, there is some data that show utterances made by people, and how frequently these were said.
即说出去星巴克的路线" 1045次,说出给我看去星巴克的路线" 754次,等等.
I.e., "Directions to Starbucks" was uttered 1045 times, "Show me directions to Starbucks" was uttered 754 times, etc.
我能够通过以下方式获得所需的输出:
I was able to get the desired output with the following:
df = (df.set_index('Frequency')['Utterance']
.str.split(expand=True)
.stack()
.reset_index(name='Words')
.groupby('Words', as_index=False)['Frequency'].sum()
)
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
但是,我也在尝试寻找以下输出:
However, I'm also trying to look for the following output:
print (df)
Words Frequency
0 Directions 2411
1 Give_Show_Direct_Navigate 2245
2 Display 376
3 Starbucks 3666
4 me 2065
5 navigation 376
6 to 3666
即,我正在尝试找出一种组合某些短语并获得这些单词出现频率的方法.例如,如果讲话者说"Seattles_Best"或"Tullys",那么理想情况下,我会将其添加到"Starbucks",并将其重命名为"coffee_shop"或类似名称.
I.e., I'm trying to figure out a way to combine certain phrases and get the frequency of those words. For example, if the speaker says "Seattles_Best" or "Tullys", then ideally i would add it to "Starbucks" and rename it "coffee_shop" or something like that.
谢谢!
推荐答案
这是坚持上一个问题中的collections.Counter
的一种方法.
Here is one way, sticking with collections.Counter
from your previous question.
您可以在lst
中添加任意数量的元组,以便为您选择的组合附加其他结果.
You can add any number of tuples to lst
to append additional results for combinations of your choice.
from collections import Counter
import pandas as pd
df = pd.DataFrame([['Directions to Starbucks', 1045],
['Show me directions to Starbucks', 754],
['Give me directions to Starbucks', 612],
['Navigate me to Starbucks', 498],
['Display navigation to Starbucks', 376],
['Direct me to Starbucks', 201],
['Navigate to Starbucks', 180]],
columns = ['Utterance', 'Frequency'])
c = Counter()
for row in df.itertuples():
for i in row[1].split():
c[i] += row[2]
res = pd.DataFrame.from_dict(c, orient='index')\
.rename(columns={0: 'Count'})\
.sort_values('Count', ascending=False)
def add_combinations(df, lst):
for i in lst:
words = '_'.join(i)
df.loc[words] = df.loc[df.index.isin(i), 'Count'].sum()
return df.sort_values('Count', ascending=False)
lst = [('Give', 'Show', 'Navigate', 'Direct')]
res = add_combinations(res, lst)
结果
Count
to 3666
Starbucks 3666
Give_Show_Navigate_Direct 2245
me 2065
directions 1366
Directions 1045
Show 754
Navigate 678
Give 612
Display 376
navigation 376
Direct 201
这篇关于单词和组元组之间的映射以获得单词的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!