pandas 没有正确计数行 [英] Pandas not counting rows properly
问题描述
所以我有这个数据框:
filename width height class xmin ymin xmax ymax
0 128782.JPG 640 512 Panel 36 385 119 510
1 128782.JPG 640 512 Panel 124 388 207 510
2 128782.JPG 640 512 Panel 210 390 294 511
3 128782.JPG 640 512 Panel 294 395 380 510
4 128782.JPG 640 512 Panel 379 398 466 511
5 128782.JPG 640 512 Panel 465 402 553 510
6 128782.JPG 640 512 P+SD 552 402 638 510
7 128782.JPG 640 512 P+SD 558 264 638 404
...
...
57170 128782.JPG 640 512 P+SD 36 242 121 383
57171 128782.JPG 640 512 HS+P+SD 36 97 122 242
57172 128782.JPG 640 512 P+SD 214 106 304 250
在类别"列中包含的唯一值是面板","P + SD"和"HS + P + SD".我想用这些值来计算有多少行,所以我尝试了这个:
Which contains in the column called "class" have the unique values "Panel", "P+SD" and "HS+P+SD". I want to count how many rows there are with these values so I tried this:
print(len(split_df[split_df["class"].str.contains('Panel')]))
print(len(split_df[split_df["class"].str.contains('HS+P+SD')]))
print(len(split_df[split_df["class"].str.contains('P+SD')]))
这给了我这个输出:
56988
0
0
这是不正确的,您可以根据上面提供的DataFrame的片段清楚地看到,为什么对Panel的所有内容都正确计数,而对其他两个类"名称却不计数?
This is incorrect as you can clearly see based on the snippet of the DataFrame provided above, why is everything counted properly for Panel but nothing is counted for the other two "class" names?
这是split_df.info的输出:
Here's the output of split_df.info:
RangeIndex: 57172 entries, 0 to 57171
Data columns (total 8 columns):
filename 57172 non-null object
width 57172 non-null int64
height 57172 non-null int64
class 57172 non-null object
xmin 57172 non-null int64
ymin 57172 non-null int64
xmax 57172 non-null int64
ymax 57172 non-null int64
dtypes: int64(6), object(2)
memory usage: 3.5+ MB
我无法为自己的生活弄清楚哪里出了问题.感谢您的帮助.
I cannot for the life of me figure out what is wrong. Any help is appreciated.
推荐答案
pd.Series.str.contains
默认情况下为regex=True
.由于+
是正则表达式中的特殊字符,因此请使用regex=False
,re.escape
或\
转义:
pd.Series.str.contains
has regex=True
by default. Since +
is a special character in regex, use regex=False
, re.escape
, or \
escaping:
import re
s = pd.Series(['HS+P+SD', 'AB+CD+EF'])
s.str.contains('HS+P+SD').sum() # 0
s.str.contains('HS+P+SD', regex=False).sum() # 1
s.str.contains(re.escape('HS+P+SD')).sum() # 1
s.str.contains('HS\+P\+SD').sum() # 1
我想用这些值计算多少行
I want to count how many rows there are with these values
如果这是您的核心问题,并且您不希望'P+SD'
计数包含'HS+P+SD'
,请不要使用str.contains
.而是检查是否相等,并使用 value_counts
关于您希望计算的值:
If this is your core problem and you don't want a 'P+SD'
count to include 'HS+P+SD'
, don't use str.contains
. Check for equality instead and use value_counts
on the values you wish to count:
L = ['Panel', 'HS+P+SD', 'P+SD']
counts = df.loc[df['class'].isin(L), 'class'].value_counts()
或者对于所有计数,只需使用df['class'].value_counts()
.
Or for all counts just use df['class'].value_counts()
.
这篇关于 pandas 没有正确计数行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!