Python Pandas-如果某些值为空,则合并行 [英] Python Pandas - merge rows if some values are blank

查看:465
本文介绍了Python Pandas-如果某些值为空,则合并行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据集:

I have a dataset that looks a little like this:

ID   Name            Address      Zip    Cost
1    Bob the Builder 123 Main St  12345  
1    Bob the Builder                     $99,999.99
2    Bob the Builder 123 Sub St   54321  $74,483.01
3    Nigerian Prince Area 51      33333  $999,999.99
3    Pinhead Larry   Las Vegas    31333  $11.00
4    Fox Mulder      Area 51             $0.99

丢失数据是可以的,除非很明显可以合并它们.我的意思是代替上面的数据集,我想合并ID和Name相同的行,而其他功能可以填补彼此的空白.例如,上面的数据集将变为:

where missing data is okay, unless it's obvious that they can be merged. What I mean by that is instead of the dataset above, I want to merge the rows where both the ID and Name are the same, and the other features can fill in each other's blanks. For example, the dataset above would become:

ID   Name            Address      Zip    Cost
1    Bob the Builder 123 Main St  12345  $99,999.99
2    Bob the Builder 123 Sub St   54321  $74,483.01
3    Nigerian Prince Area 51      33333  $999,999.99
3    Pinhead Larry   Las Vegas    31333  $11.00
4    Fox Mulder      Area 51             $0.99

我曾经考虑过使用df.groupby(["ID", "Name"]),然后将这些字符串连接起来,因为缺少的值是空字符串,但是运气不好.

I've thought about using df.groupby(["ID", "Name"]) and then concatenating the strings since the missing values are empty strings, but got no luck with it.

数据已从网站上刮下来,因此他们不得不经过大量清理才能最终在这里出现.我想不出一种优雅的方法来解决这个问题!

The data has been scraped off websites, so they've had to go through a lot of cleaning to end up here. I can't think of an elegant way of figuring this out!

推荐答案

我将描述一种算法:

  1. 放置所有填充了所有字段的行.我们不需要碰这些.
  2. 像输入一样创建一个布尔型DataFrame,其中空字段为False,填充字段为True.这是df.notnull().
  3. 对于df.Name.unique()中的每个名称:
  1. Put aside all the rows where all fields are populated. We don't need to touch these.
  2. Create a boolean DataFrame like the input where empty fields are False and populated fields are True. This is df.notnull().
  3. For each name in df.Name.unique():
  1. df[df.Name == name]作为工作集.
  2. 将每对(或元组)布尔行求和,得出布尔矢量,其宽度与输入列的宽度相同,但始终填充的列除外.在示例中,这表示[True, True, False][False, False, True],所以总和为[1, 1, 1].
  3. 如果各处的总和等于1,则可以合并该对(或元组)行.
  1. Take df[df.Name == name] as the working set.
  2. Sum each pair (or tuple) of boolean rows, resulting in a boolean vector the same width as the input columns except those which are always populated. In the example this means [True, True, False] and [False, False, True], so the sum is [1, 1, 1].
  3. If the sum is equal to 1 everywhere, that pair (or tuple) of rows can be merged.

但是这里有很多可能的边缘情况,例如如果您有三行A,B,C,并且可以合并A + B或A + C,该怎么办.如果您可以在实施合并算法之前缩小数据中存在的约束,将会有所帮助.

But there are a ton of possible edge cases here, such as what to do if you have three rows A,B,C and you could merge either A+B or A+C. It will help if you can narrow down the constraints that exist in the data before implementing the merging algorithm.

这篇关于Python Pandas-如果某些值为空,则合并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆