Pandas.read_excel有时错误地将布尔值读取为1/0 [英] Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's

查看:226
本文介绍了Pandas.read_excel有时错误地将布尔值读取为1/0的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一个非常大的Excel文件读取到DataFrame中.该文件包含字符串,整数,浮点数和布尔数据,以及丢失的数据和完全空的行.值得注意的是,某些单元格值是从单元格公式和/或VBA导出的-尽管从理论上讲应该不会影响任何内容.

I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.

正如标题所述,pandas有时将布尔值读取为float或int 1和0,而不是True和False.它似乎与空行的数量和其他数据的类型有关.为了简单起见,我只链接一个复制问题的2页Excel文件. 布尔_1.xlsx

As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated. Boolean_1.xlsx

代码如下:

import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)

这是印刷品.主要注意行ZBA,这两个表中的值均相同,但数据帧中的值不同:

Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:

  Name stuff  Unnamed: 1 Unnamed: 2 Unnamed: 3
0         AFD          a        dsf        ads
1         DFA          1          2          3
2         DFD      123.3       41.1       13.7
3        IIOP        why        why        why
4         NaN        NaN        NaN        NaN
5         ZBA      False      False       True 

   Name adslfa  Unnamed: 1  Unnamed: 2  Unnamed: 3
0        asdf         6.0         3.0         6.0
1         NaN         NaN         NaN         NaN
2         NaN         NaN         NaN         NaN
3         NaN         NaN         NaN         NaN
4         NaN         NaN         NaN         NaN
5         ZBA         0.0         0.0         1.0

我还能够在我实际上正在处理的大文件中(是​​)获得整数1和0的输出,但是无法轻松地复制它.

I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.

是什么原因导致这种不一致,有没有办法强迫熊猫读取应该读取的布尔值?

What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?

推荐答案

Read_excel将基于具有值的列中的第一行来确定每列的dtype.如果该列的第一行为空,则Read_excel将继续到下一行,直到找到一个值.

Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.

在Sheet1中,具有B,C和D列中的值的第一行包含字符串.因此,所有后续行将被视为这些列的字符串.在这种情况下,FALSE = False

In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False

在Sheet2中,具有B,C和D列中的值的第一行包含整数.因此,所有后续行将被视为这些列的整数.在这种情况下,FALSE = 0.

In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.

这篇关于Pandas.read_excel有时错误地将布尔值读取为1/0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆