pandas groupby为同一个唯一的ID返回两个组 [英] pandas groupby is returning two groups for the same unique id

查看:215
本文介绍了pandas groupby为同一个唯一的ID返回两个组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  CHROM POS Data01 Data02 .... 

我有一个大熊猫数据框, ..
1 ....................
1 .................. 。
2 ..................
2 ............
scaf_9 .... .........
scaf_9 ............

所以,我在做:

  my_data_grouped = my_data.groupby('CHROM ')

for chr_,my_data_grouped中的数据:
在chr_ $中做某事
从该chr_数据中写入一些东西

在小数据和没有字符串类型 CHROM scaff_9 。但是,对于非常大的数据和 scaff_9 ,我得到了两组 2 。它确实不是一个错误消息,它不会影响计算。问题是当我在文件中按组编写数据时;我得到了两组 2 (不平等地分开)。



我追踪这个问题的由来,因为没有错误信息,并且数据很小,所以它工作的很好。 我唯一的假设是:


  • 总数据框中的行数是否有限制数据框 pandas 模块可以处理。这个问题的解决方法是什么?

  • 在所有的 2 中,大多数被视为整数对象和一些(稍后的部分)作为接近 scaff_9 的字符串对象。这可能吗?



对不起,我只是在这里做出我的假设,而且我不可能知道问题。


我也尝试运行 sort_by(['CHROM'])之前做到 groupby ,但问题仍然存在。



任何可能的修复问题。



谢谢,

解决方案

问题,显然有些空白,所以熊猫分别处理每个组。



解决方案应该首先删除拖曳空白区域:

  df.index = df.index.astype(str).str.strip()

您还可以检查 index 的唯一字符串值:

  a = df.index [df.index.map(type)== str] .unique()。tolist()






如果第一列不是索引:

  df ['CHROM'] = df ['CHROM']。astype(str).str.strip()
$ b = df.loc [df ['CHROM']。map(type)= = str,'CHROM']。unique()。tolist()

编辑:



最后的最终解决方案是简单的 - str like:

  df ['CHROM'] = df [ CHROM']。astype(str)


I have a large pandas dataframe, where I am running groups by operations.

CHROM    POS    Data01    Data02 ......
1        ....................
1        ...................
2        ..................
2        ............
scaf_9   .............
scaf_9   ............

So, i am doing:

 my_data_grouped = my_data.groupby('CHROM')

 for chr_, data in my_data_grouped:
      do something in chr_
      write something from that chr_ data

Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).

It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:

  • Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
  • Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?

Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.

Post Edit: I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.

Any possible fix to the issue.

Thanks,

解决方案

In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.

Solution should be remove traling whitespaces first:

df.index = df.index.astype(str).str.strip()

You can also check unique strings values of index:

a = df.index[df.index.map(type) == str].unique().tolist()


If first column is not index:

df['CHROM'] = df['CHROM'].astype(str).str.strip()

a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()

EDIT:

Last final solution was simplier - casting to str like:

df['CHROM'] = df['CHROM'].astype(str)

这篇关于pandas groupby为同一个唯一的ID返回两个组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆