pandas :仅当另一列中的值匹配时才计算行之间的重叠单词 [英] pandas: calculate overlapping words between rows only if values in another column match

查看:32
本文介绍了 pandas :仅当另一列中的值匹配时才计算行之间的重叠单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框,但有很多行:

I have a dataframe that looks like the following, but with many rows:

import pandas as pd

data = {'intent':  ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}

df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])

我使用下面的代码(不是我的解决方案)计算了 jaccard 相似度:

I have calculated the jaccard similarity using the code below (not my solution):

def lexical_overlap(doc1, doc2): 
    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)


    return intersection

并修改了@Amit Amola 比较每两行之间重叠的单词并从中创建一个数据框:

and modified the code given by @Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:

overlapping_word_list=[]

for val in list(combinations(range(len(data_new)), 2)):
     overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])

由于我的数据集很大,当我运行此代码来比较所有行时,它需要很长时间.所以我想只比较具有相同意图的句子,而不比较具有不同意图的句子.我不确定如何继续这样做

since my dataset is huge, when i run this code to compare all rows, it takes forever. so i would like to instead only compare the sentences which have the same intents and do not compare sentences that have different intents. I am not sure on how to proceed to do only that

推荐答案

IIUC 你只需要迭代 intent 列中的唯一值,然后使用 loc只抓取对应的行.如果您有两行以上,您仍然需要使用 combinations 来获取相似意图之间的唯一 combinations.

IIUC you just need to iterate over the unique values in the intent column and then use loc to grab just the rows that correspond to that. If you have more than two rows you will still need to use combinations to get the unique combinations between similar intents.

from itertools import combinations

for intent in df.intent.unique():
    # loc returns a DataFrame but we need just the column
    rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
    combos = combinations(rows, 2)
    for combo in combos:
        x, y = rows
        overlap = lexical_overlap(x, y)
        print(f"Overlap for ({x}) and ({y}) is {overlap}")

#  Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
#  Overlap for (i need a cab) and (i would like a new taxi) is 40.0
#  Overlap for (call me at 6) and (she called me) is 54.54545454545454

这篇关于 pandas :仅当另一列中的值匹配时才计算行之间的重叠单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆