Pandas DataFrame-将具有相同索引的一列值组合到列表中 [英] Pandas DataFrame - Combining one column's values with same index into list
问题描述
我这个问题已经有一段时间了,但无济于事.这几乎是此处至少还有一个问题的副本,但我不太明白从相关的在线答案中找出要怎么做.
I've been at this issue for awhile to no avail. This is almost a duplicate of at least one other question on here, but I can't quite figure out how to do exactly what I'm looking for from related answers online.
我有一个看起来像这样的Pandas DataFrame(我们称它为df
):
I have a Pandas DataFrame (we'll call it df
) that looks something like:
Name Value Value2
'A' '8.8.8.8' 'x'
'B' '6.6.6.6' 'y'
'A' '6.6.6.6' 'x'
'A' '8.8.8.8' 'x'
其中Name
是索引.我想将其转换为类似这样的内容:
Where Name
is the index. I want to convert this to something like that looks like:
Name Value Value2
'A' ['8.8.8.8', '6.6.6.6'] 'x'
'B' ['6.6.6.6'] 'y'
因此,基本上,每个与同一索引对应的Value
都应组合成一个列表(或集合或元组),并将该列表作为对应索引的Value
.而且,如图所示,在索引相似的行之间,Value2
是相同的,因此最后应该保持相同.
So, basically, every Value
that corresponds to the same index should be combined into a list (or a set, or a tuple) and that list made to be the Value
for the corresponding index. And, as shown, Value2
is the same between like-indexed rows, so it should just stay the same in the end.
我已经(成功地)完成了所有工作,找出了如何使用以下方法将Value
列中的每个元素放入列表中:
All I've done (successfully) is figure out how to make each element in the Value
column into a list with:
df['Value'] = pd.Series([[val] for val in df['Value']])
在我在本文开头链接的问题中,将列与重复索引合并的推荐方法提供了使用df.groupby(df.index).sum()
的解决方案.我知道除了df.index
之外我还需要其他内容作为groupby
的参数,因为Value
列被视为特殊字符,而且我不确定用什么代替sum()
,因为这与我的含义不完全相同.寻找.
In the question I linked at the start of this post, the recommended way to combine columns with duplicate indices offers a solution using df.groupby(df.index).sum()
. I know I need something besides df.index
as an argument to groupby
since the Value
column is treated as special, and I'm not sure what to put in place of sum()
since that's not quite what I'm looking for.
希望很清楚,我在寻找什么,让我知道我是否可以详细说明.我还尝试过自己遍历DataFrame,查找具有相同索引的行,将Values
组合到列表中,并相应地更新df
.在尝试使这种方法工作了一段时间之后,我想我会寻找一种更像熊猫一样的方式来解决此问题.
Hopefully it's clear what I'm looking for, let me know if there's anything I can elaborate on. I've also tried simply looping through the DataFrame myself, finding rows with the same index, combining the Values
into a list and updating df
accordingly. After trying to get this method to work for a bit I thought I'd look for a more Pandas-esque way of handling this problem.
作为对dermen的回答的后续,该解决方案行之有效. Values
似乎确实正确地连接到一个列表中.我意识到的一件事是unique
函数返回了Series
,而不是DataFrame
.另外,在实际设置中,我确实有比Name
,Value
和Value2
更多的列.但是我认为我可以通过以下方法成功解决这两个问题:
As a follow up to dermen's answer, that solution kind of worked. The Values
did seem to concatenate correctly into a list. One thing I realized was that the unique
function returns a Series
, as opposed to a DataFrame
. Also, I do have more columns in the actual setup than just Name
, Value
, and Value2
. But I think I was able to get around both of the issues successfully with the following:
gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)
第一行为列列表的groupby
减去Value
列提供了参数,第二行将unique
返回的Series
转换为与以下列相同的DataFrame
df
.
Where the first line gives an argument to groupby
of the list of columns minus the Value
column, and the second line converts the Series
returned by unique
into a DataFrame
with the same columns as df
.
但是我认为所有这些都准备就绪(除非有人看到了这个问题),几乎所有事情都按预期工作.不过,这里似乎确实有些偏离.当我尝试使用to_csv
将其输出到文件时,顶部有重复的标头(但仅某些标头被重复了,据我所知,没有真正的模式).同样,Value
列表被截断,这可能是一个更简单的问题. csv
输出曲线如下:
But I think with all of that in place (unless anyone sees an issue with this), almost everything works as intended. There does seem to be something that's a bit off here, though. When I try to output this to a file with to_csv
, there are duplicate headers across the top (but only certain headers are duplicated, and there's no real pattern as to which as far as I can tell). Also, the Value
lists are truncated, which is probably a simpler issue to fix. The csv
output currenlty looks like:
Name Value Value2 Name Value2
'A' ['8.8.8.8' '7.7.7.7' 'x'
'B' ['6.6.6.6'] 'y'
上面看起来很奇怪,但这恰恰是它在输出中的样子.请注意,与本文开头给出的示例相反,对于A
,假定存在超过2个Values
(以便我可以说明这一点).当我用实际数据执行此操作时,Value
列表在前4个元素之后被截断.
The above looks weird, but that is exactly how it looks in the output. Note that, contrary to the example presented at the start of this post, there are assumed to be more than 2 Values
for A
(so that I can illustrate this point). When I do this with the actual data, the Value
lists get cut off after the first 4 elements.
推荐答案
我认为您正在使用pandas.Series.unique
.首先,将'Name'
索引设为一列
I think you are looking to use pandas.Series.unique
. First, make the 'Name'
index a column
df
# Value2 Value
#Name
#A x 8.8
#B y 6.6
#A x 6.6
#A x 8.8
df.reset_index(inplace=True)
# Name Value2 Value
#0 A x 8.8
#1 B y 6.6
#2 A x 6.6
#3 A x 8.8
下次调用groupby
并调用'Value'
系列上的unique
函数
Next call groupby
and call the unique
function on the 'Value'
series
gb = df.groupby(('Name','Value2'))
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
# Name Value2 Value
#0 A x [8.8, 6.6]
#1 B y [6.6]
最后,如果您想再次将'Name'
作为索引,只需执行
Finally, if you want 'Name'
as the index again, just do
result.set_index( 'Name', inplace=True)
# Value2 Value
#Name
#A x [8.8, 6.6]
#B y [6.6]
更新
作为后续措施,请确保在重置索引后重新分配结果
UPDATE
As a follow up, make sure you re-assign result after resetting the index
result = gb['Value'].unique()
type(result)
#pandas.core.series.Series
result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame
另存为CSV(而不是TSV)
您不想在这里使用CSV,因为Value
列条目中有逗号.相反,另存为TSV,您仍然使用相同的方法to_csv
,只需更改sep
arg:
saving as CSV (rather TSV)
You don't want to use CSV here because there are commas in the Value
column entries. Rather, save as TSV, you still use the same method to_csv
, just change the sep
arg:
result.to_csv( 'result.txt', sep='\t')
如果我将result.txt作为TSV加载到EXCEL中,我会得到
If I load result.txt in EXCEL as a TSV, I get
这篇关于Pandas DataFrame-将具有相同索引的一列值组合到列表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!