如何使用Python Pandas按集合的列进行合并? [英] How to merge by a column of collection using Python Pandas?

查看:429
本文介绍了如何使用Python Pandas按集合的列进行合并?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个堆栈溢出问题列表,分别是A组和B组.它们都有两列,Id和Tag.例如:

I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:

|Id        |Tag
| -------- | --------------------------------------------
|2         |c#,winforms,type-conversion,decimal,opacity

对于A组中的每个问题,我需要在B组中找到具有至少一个重叠标签的所有匹配问题,而与标签位置无关,在A组中查找该问题.例如,这些问题都应该是匹配的问题:

For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:

|Id        |Tag
|----------|---------------------------
|3         |c#
|4         |winforms,type-conversion
|5         |winforms,c#

我的第一个想法是将变量Tag转换为集合变量,并使用Pandas进行合并,因为集合会忽略位置.但是,似乎Pandas不允许将set变量用作键变量.因此,我现在使用for循环在B组上进行搜索.但是,由于我在B组中进行了1300万次观察,因此速度非常慢.

My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.

我的问题是: 1. Python中是否还有其他方法可以按集合的列进行合并,并且可以分辨出重叠标签的数量? 2.如何提高for循环搜索的效率?

My question is: 1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags? 2. How to improve the efficiency of for loop search?

推荐答案

可以使用df.joindf.groupby来实现.

这是我正在使用的设置:

This is the setup I'm working with:

df1 = pd.DataFrame({ 'Id' : [2], 'Tag' : [['c#', 'winforms', 'type-conversion', 'decimal', 'opacity']]}) 

   Id                                                Tag
0   2  [c#, winforms, type-conversion, decimal, opacity]

df2 = pd.DataFrame({ 'Id' : [3, 4, 5], 'Tag' : [['c#'], ['winforms', 'type-conversion'], ['winforms', 'c#']]})  

   Id                          Tag
0   3                         [c#]
1   4  [winforms, type-conversion]
2   5               [winforms, c#]

我们将两个数据框中的右列放平. 帮助:

Let's flatten out the right column in both data frames. This helped:

In [2331]: from itertools import chain

In [2332]: def flatten(df):
      ...:     return pd.DataFrame({"Id": np.repeat(df.Id.values, df.Tag.str.len()),
      ...:                          "Tag": list(chain.from_iterable(df.Tag))})
      ...: 

In [2333]: df1 = flatten(df1)

In [2334]: df2 = flatten(df2)

In [2335]: df1.head()
Out[2335]: 
   Id              Tag
0   2               c#
1   2         winforms
2   2  type-conversion
3   2          decimal
4   2          opacity

df2类似,它也被展平.

现在是魔术.我们将在Tag列上执行join,然后在联接的ID上执行groupby,以查找重叠标签的数量.

Now is the magic. We'll do a join on Tag column, and then groupby on joined IDs to find count of overlapping tags.

In [2337]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index()
Out[2337]: 
   Id_x  Id_y  Tag
0     2     3    1
1     2     4    2
2     2     5    2

输出显示每对标签以及重叠标签的数量. groupby过滤掉没有重叠的对.

The output shows each pair of tags along with the number of overlapping tags. Pairs with no overlaps are filtered out by the groupby.

df.count计算重叠的标签,而df.reset_index只是美化输出,因为groupby将分组的列分配为索引,所以我们将其重置.

The df.count counts overlapping tags, and df.reset_index just prettifies the output, since groupby assigns the grouped column as the index, so we reset it.

要查看匹配的标签,请对上面的内容进行一些修改:

To see matching tags, you'll modify the above slightly:

In [2359]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y'])['Tag'].apply(list).reset_index()
Out[2359]: 
   Id_x  Id_y                          Tag
0     2     3                         [c#]
1     2     4  [winforms, type-conversion]
2     2     5               [c#, winforms]

要过滤掉1个重叠,请将df.query调用链接到第一个表达式:

To filter out 1-overlaps, chain a df.query call to the first expression:

In [2367]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index().query('Tag > 1')
Out[2367]: 
   Id_x  Id_y  Tag
1     2     4    2
2     2     5    2 

这篇关于如何使用Python Pandas按集合的列进行合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆