pandas :仅当另一列中的值匹配时才计算行之间的重叠单词 [英] pandas: calculate overlapping words between rows only if values in another column match
问题描述
我有一个如下所示的数据框,但有很多行:
I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
我使用下面的代码(不是我的解决方案)计算了 jaccard 相似度:
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
并修改了@Amit Amola 比较每两行之间重叠的单词并从中创建一个数据框:
and modified the code given by @Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
由于我的数据集很大,当我运行此代码来比较所有行时,它需要很长时间.所以我想只比较具有相同意图的句子,而不比较具有不同意图的句子.我不确定如何继续这样做
since my dataset is huge, when i run this code to compare all rows, it takes forever. so i would like to instead only compare the sentences which have the same intents and do not compare sentences that have different intents. I am not sure on how to proceed to do only that
推荐答案
IIUC 你只需要迭代 intent
列中的唯一值,然后使用 loc
只抓取对应的行.如果您有两行以上,您仍然需要使用 combinations
来获取相似意图之间的唯一 combinations
.
IIUC you just need to iterate over the unique values in the intent
column and then use loc
to grab just the rows that correspond to that. If you have more than two rows you will still need to use combinations
to get the unique combinations
between similar intents.
from itertools import combinations
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x, y)
print(f"Overlap for ({x}) and ({y}) is {overlap}")
# Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
# Overlap for (i need a cab) and (i would like a new taxi) is 40.0
# Overlap for (call me at 6) and (she called me) is 54.54545454545454
这篇关于 pandas :仅当另一列中的值匹配时才计算行之间的重叠单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!