检查一个数据框中的单词是否出现在另一个数据框中(Python 3,pandas) [英] Check if words in one dataframe appear in another (python 3, pandas)

查看:244
本文介绍了检查一个数据框中的单词是否出现在另一个数据框中(Python 3,pandas)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:我有两个数据框,想要删除它们之间的所有重复项/部分重复项.

Problem: I have two data frames and want to remove any duplicates/partial duplicates between them.

 DF1                 DF2

 **Phrases**         **Phrases**  
 Little Red          Little Red Corvette
 Grow Your           Grow Your Beans
 James Bond          James Dean
 Tom Brady          

我想从DF1中删除"Little Red"和"Grow Your"字样,然后将两个DF合并,以便最终产品看起来像这样:

I want to remove "Little Red" and "Grow Your" phrases from DF1 and then combine the two DF so that the final product looks like:

 DF3
 Little Red Corvette
 Grow Your Beans
 James Bond
 James Dean
 Tom Brady

请注意,如果所有单词都出现在DF2的短语中(例如,Little Red VS Little Red Corvette),我只想从DF1中删除这些短语.如果DF2中出现詹姆斯·迪恩",我不想从DF1中删除詹姆斯·邦德".

Just a note, I only want to remove the phrases from DF1 if ALL the words appear in a phrase in DF2 (e.g. Little Red Vs. Little Red Corvette). I do not want to remove "James Bond" from DF1 if "James Dean" appears in DF2.

推荐答案

我首先对数据帧进行外部合并.我不确定DF1是指发布中的列名还是数据框可变名称,但为简单起见,我假设您有两个带有字符串列的数据框:

I would first do an outer merge on the dataframes. I am not sure whether DF1 refers to the column name or the dataframe varaiable name in your posting, but for simplicity I assume you have two dataframes which have columns with strings:

df1 
#        words
#0  little red
#1   grow your
#2  james bond
#3  tom brandy

df2 
#                 words
#0  little red corvette
#1      grow your beans
#2           james dean
#3               little

接下来,创建一个合并这两个数据的新数据框(使用外部合并).这会照顾重复项

Next, make a new dataframe that merges these two (use an outer merge). This takes care of the duplicates

df3 = pandas.merge( df1, df2, on='words', how='outer')
#                 words
#0           little red
#1            grow your
#2           james bond
#3           tom brandy
#4  little red corvette
#5      grow your beans
#6           james dean
#7               little

接下来,您要使用 方法:

Next you want to use the Series.str.get_dummies method:

dummies = df3.words.str.get_dummies(sep='')
#   grow your  grow your beans  james bond  james dean  little  little red  \
#0          0                0           0           0       1           1   
#1          1                0           0           0       0           0   
#2          0                0           1           0       0           0   
#3          0                0           0           0       0           0   
#4          0                0           0           0       1           1   
#5          1                1           0           0       0           0   
#6          0                0           0           1       0           0   
#7          0                0           0           0       1           0   

#   little red corvette  tom brandy  
#0                    0           0  
#1                    0           0  
#2                    0           0  
#3                    0           1  
#4                    1           0  
#5                    0           0  
#6                    0           0  
#7                    0           0 

请注意,如果一个字符串在words列中不包含其他子字符串,或者是1个或多个子字符串的超字符串,则该列的总和为1-否则,总和为数字>1.现在,您可以使用此dummies数据框查找与子字符串相对应的索引并将其删除:

Notice, if a string contains no other sub-strings in the words column, or if is the super-string of 1 or more sub-strings, then it's column will sum to 1 - otherwise it will sum to a number > 1. Now you can use this dummies dataframe to find the indices corresponding to the sub-strings and remove them:

bad_rows = [where(df3.words==word)[0][0] 
            for word in list(dummies) 
            if dummies[word].sum() > 1 ]  # only substrings will sum to greater than 1
#[1, 7, 0]

df3.drop( df3.index[bad_rows] , inplace=True)
#                 words
#2           james bond
#3           tom brandy
#4  little red corvette
#5      grow your beans
#6           james dean

注意-这可以解决超级字符串中有超过1个子字符串的情况.例如'little''little red'都是超级字符串'little red corvette'的子字符串,因此我假设您只保留超级字符串.

Note- this takes care of the case where you have more than 1 sub-string of a super-string. For instance 'little', 'little red' are both sub-strings of the super-string 'little red corvette', so I assume you would only keep the super-string.

这篇关于检查一个数据框中的单词是否出现在另一个数据框中(Python 3,pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆