pandas 没有正确计数行 [英] Pandas not counting rows properly

查看:64
本文介绍了 pandas 没有正确计数行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有这个数据框:

         filename  width  height    class  xmin  ymin  xmax  ymax
0      128782.JPG    640     512    Panel    36   385   119   510
1      128782.JPG    640     512    Panel   124   388   207   510
2      128782.JPG    640     512    Panel   210   390   294   511
3      128782.JPG    640     512    Panel   294   395   380   510
4      128782.JPG    640     512    Panel   379   398   466   511
5      128782.JPG    640     512    Panel   465   402   553   510
6      128782.JPG    640     512     P+SD   552   402   638   510
7      128782.JPG    640     512     P+SD   558   264   638   404
...
...
57170     128782.JPG    640     512     P+SD    36   242   121   383
57171     128782.JPG    640     512  HS+P+SD    36    97   122   242
57172     128782.JPG    640     512     P+SD   214   106   304   250

在类别"列中包含的唯一值是面板","P + SD"和"HS + P + SD".我想用这些值来计算有多少行,所以我尝试了这个:

Which contains in the column called "class" have the unique values "Panel", "P+SD" and "HS+P+SD". I want to count how many rows there are with these values so I tried this:

print(len(split_df[split_df["class"].str.contains('Panel')]))
print(len(split_df[split_df["class"].str.contains('HS+P+SD')]))
print(len(split_df[split_df["class"].str.contains('P+SD')]))

这给了我这个输出:

56988
0
0

这是不正确的,您可以根据上面提供的DataFrame的片段清楚地看到,为什么对Panel的所有内容都正确计数,而对其他两个类"名称却不计数?

This is incorrect as you can clearly see based on the snippet of the DataFrame provided above, why is everything counted properly for Panel but nothing is counted for the other two "class" names?

这是split_df.info的输出:

Here's the output of split_df.info:

RangeIndex: 57172 entries, 0 to 57171
Data columns (total 8 columns):
filename    57172 non-null object
width       57172 non-null int64
height      57172 non-null int64
class       57172 non-null object
xmin        57172 non-null int64
ymin        57172 non-null int64
xmax        57172 non-null int64
ymax        57172 non-null int64
dtypes: int64(6), object(2)
memory usage: 3.5+ MB

我无法为自己的生活弄清楚哪里出了问题.感谢您的帮助.

I cannot for the life of me figure out what is wrong. Any help is appreciated.

推荐答案

pd.Series.str.contains 默认情况下为regex=True.由于+是正则表达式中的特殊字符,因此请使用regex=Falsere.escape\转义:

pd.Series.str.contains has regex=True by default. Since + is a special character in regex, use regex=False, re.escape, or \ escaping:

import re
s = pd.Series(['HS+P+SD', 'AB+CD+EF'])

s.str.contains('HS+P+SD').sum()               # 0
s.str.contains('HS+P+SD', regex=False).sum()  # 1
s.str.contains(re.escape('HS+P+SD')).sum()    # 1
s.str.contains('HS\+P\+SD').sum()             # 1

我想用这些值计算多少行

I want to count how many rows there are with these values

如果这是您的核心问题,并且您不希望'P+SD'计数包含'HS+P+SD',请不要使用str.contains.而是检查是否相等,并使用 value_counts 关于您希望计算的值:

If this is your core problem and you don't want a 'P+SD' count to include 'HS+P+SD', don't use str.contains. Check for equality instead and use value_counts on the values you wish to count:

L = ['Panel', 'HS+P+SD', 'P+SD']
counts = df.loc[df['class'].isin(L), 'class'].value_counts()

或者对于所有计数,只需使用df['class'].value_counts().

Or for all counts just use df['class'].value_counts().

这篇关于 pandas 没有正确计数行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆