Pandas DataFrame-将具有相同索引的一列值组合到列表中 [英] Pandas DataFrame - Combining one column's values with same index into list

查看:1790
本文介绍了Pandas DataFrame-将具有相同索引的一列值组合到列表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我这个问题已经有一段时间了,但无济于事.这几乎是此处至少还有一个问题的副本,但我不太明白从相关的在线答案中找出要怎么做.

I've been at this issue for awhile to no avail. This is almost a duplicate of at least one other question on here, but I can't quite figure out how to do exactly what I'm looking for from related answers online.

我有一个看起来像这样的Pandas DataFrame(我们称它为df):

I have a Pandas DataFrame (we'll call it df) that looks something like:

Name    Value        Value2
'A'     '8.8.8.8'    'x'
'B'     '6.6.6.6'    'y'
'A'     '6.6.6.6'    'x'
'A'     '8.8.8.8'    'x'

其中Name是索引.我想将其转换为类似这样的内容:

Where Name is the index. I want to convert this to something like that looks like:

Name    Value                     Value2
'A'     ['8.8.8.8', '6.6.6.6']    'x'
'B'     ['6.6.6.6']               'y'

因此,基本上,每个与同一索引对应的Value都应组合成一个列表(或集合或元组),并将该列表作为对应索引的Value.而且,如图所示,在索引相似的行之间,Value2是相同的,因此最后应该保持相同.

So, basically, every Value that corresponds to the same index should be combined into a list (or a set, or a tuple) and that list made to be the Value for the corresponding index. And, as shown, Value2 is the same between like-indexed rows, so it should just stay the same in the end.

我已经(成功地)完成了所有工作,找出了如何使用以下方法将Value列中的每个元素放入列表中:

All I've done (successfully) is figure out how to make each element in the Value column into a list with:

df['Value'] = pd.Series([[val] for val in df['Value']])

在我在本文开头链接的问题中,将列与重复索引合并的推荐方法提供了使用df.groupby(df.index).sum()的解决方案.我知道除了df.index之外我还需要其他内容作为groupby的参数,因为Value列被视为特殊字符,而且我不确定用什么代替sum(),因为这与我的含义不完全相同.寻找.

In the question I linked at the start of this post, the recommended way to combine columns with duplicate indices offers a solution using df.groupby(df.index).sum(). I know I need something besides df.index as an argument to groupby since the Value column is treated as special, and I'm not sure what to put in place of sum() since that's not quite what I'm looking for.

希望很清楚,我在寻找什么,让我知道我是否可以详细说明.我还尝试过自己遍历DataFrame,查找具有相同索引的行,将Values组合到列表中,并相应地更新df.在尝试使这种方法工作了一段时间之后,我想我会寻找一种更像熊猫一样的方式来解决此问题.

Hopefully it's clear what I'm looking for, let me know if there's anything I can elaborate on. I've also tried simply looping through the DataFrame myself, finding rows with the same index, combining the Values into a list and updating df accordingly. After trying to get this method to work for a bit I thought I'd look for a more Pandas-esque way of handling this problem.

作为对dermen的回答的后续,该解决方案行之有效. Values似乎确实正确地连接到一个列表中.我意识到的一件事是unique函数返回了Series,而不是DataFrame.另外,在实际设置中,我确实有比NameValueValue2更多的列.但是我认为我可以通过以下方法成功解决这两个问题:

As a follow up to dermen's answer, that solution kind of worked. The Values did seem to concatenate correctly into a list. One thing I realized was that the unique function returns a Series, as opposed to a DataFrame. Also, I do have more columns in the actual setup than just Name, Value, and Value2. But I think I was able to get around both of the issues successfully with the following:

gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)

第一行为列列表的groupby减去Value列提供了参数,第二行将unique返回的Series转换为与以下列相同的DataFrame df.

Where the first line gives an argument to groupby of the list of columns minus the Value column, and the second line converts the Series returned by unique into a DataFrame with the same columns as df.

但是我认为所有这些都准备就绪(除非有人看到了这个问题),几乎所有事情都按预期工作.不过,这里似乎确实有些偏离.当我尝试使用to_csv将其输出到文件时,顶部有重复的标头(但仅某些标头被重复了,据我所知,没有真正的模式).同样,Value列表被截断,这可能是一个更简单的问题. csv输出曲线如下:

But I think with all of that in place (unless anyone sees an issue with this), almost everything works as intended. There does seem to be something that's a bit off here, though. When I try to output this to a file with to_csv, there are duplicate headers across the top (but only certain headers are duplicated, and there's no real pattern as to which as far as I can tell). Also, the Value lists are truncated, which is probably a simpler issue to fix. The csv output currenlty looks like:

Name    Value                   Value2    Name    Value2
'A'     ['8.8.8.8' '7.7.7.7'    'x'                     
'B'     ['6.6.6.6']             'y'

上面看起来很奇怪,但这恰恰是它在输出中的样子.请注意,与本文开头给出的示例相反,对于A,假定存在超过2个Values(以便我可以说明这一点).当我用实际数据执行此操作时,Value列表在前4个元素之后被截断.

The above looks weird, but that is exactly how it looks in the output. Note that, contrary to the example presented at the start of this post, there are assumed to be more than 2 Values for A (so that I can illustrate this point). When I do this with the actual data, the Value lists get cut off after the first 4 elements.

推荐答案

我认为您正在使用pandas.Series.unique.首先,将'Name'索引设为一列

I think you are looking to use pandas.Series.unique. First, make the 'Name' index a column

df
#     Value2  Value
#Name              
#A         x    8.8
#B         y    6.6
#A         x    6.6
#A         x    8.8

df.reset_index(inplace=True)
#  Name Value2  Value
#0    A      x    8.8
#1    B      y    6.6
#2    A      x    6.6
#3    A      x    8.8

下次调用groupby并调用'Value'系列上的unique函数

Next call groupby and call the unique function on the 'Value' series

gb = df.groupby(('Name','Value2'))
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
#  Name Value2       Value
#0    A      x  [8.8, 6.6]
#1    B      y       [6.6]

最后,如果您想再次将'Name'作为索引,只需执行

Finally, if you want 'Name' as the index again, just do

result.set_index( 'Name', inplace=True)
#     Value2       Value
#Name                   
#A         x  [8.8, 6.6]
#B         y       [6.6]

更新

作为后续措施,请确保在重置索引后重新分配结果

UPDATE

As a follow up, make sure you re-assign result after resetting the index

result = gb['Value'].unique()
type(result)
#pandas.core.series.Series

result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame

另存为CSV(而不是TSV)

您不想在这里使用CSV,因为Value列条目中有逗号.相反,另存为TSV,您仍然使用相同的方法to_csv,只需更改sep arg:

saving as CSV (rather TSV)

You don't want to use CSV here because there are commas in the Value column entries. Rather, save as TSV, you still use the same method to_csv, just change the sep arg:

result.to_csv( 'result.txt', sep='\t')

如果我将result.txt作为TSV加载到EXCEL中,我会得到

If I load result.txt in EXCEL as a TSV, I get

这篇关于Pandas DataFrame-将具有相同索引的一列值组合到列表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆