为什么pandas.read_fwf没有按照指示跳过空白行? [英] Why is pandas.read_fwf not skipping the blank line as instructed?

查看:60
本文介绍了为什么pandas.read_fwf没有按照指示跳过空白行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取固定宽度的格式(完整源代码文件)中缺少的数据,因此 pandas.read_fwf 非常有用.标头后面有一个空行,因此我要传递 skip_blank_lines = True ,但这似乎没有效果,因为第一个条目仍然充满NaN/NaT:

I'm reading a fixed width format (full source file) full of missing data, so pandas.read_fwf comes in handy. There is an empty line after the header, so I'm passing skip_blank_lines=True, but this appears to have no effect, as the first entry is still full of NaN/NaT:

import io
import pandas

s="""USAF   WBAN  STATION NAME                  CTRY ST CALL  LAT     LON      ELEV(M) BEGIN    END

007018 99999 WXPOD 7018                                  +00.000 +000.000 +7018.0 20110309 20130730
007026 99999 WXPOD 7026                    AF            +00.000 +000.000 +7026.0 20120713 20170822
007070 99999 WXPOD 7070                    AF            +00.000 +000.000 +7070.0 20140923 20150926
008260 99999 WXPOD8270                                   +00.000 +000.000 +0000.0 20050101 20100920
008268 99999 WXPOD8278                     AF            +32.950 +065.567 +1156.7 20100519 20120323
008307 99999 WXPOD 8318                    AF            +00.000 +000.000 +8318.0 20100421 20100421
008411 99999 XM20                                                                 20160217 20160217
008414 99999 XM18                                                                 20160216 20160217
008415 99999 XM21                                                                 20160217 20160217
008418 99999 XM24                                                                 20160217 20160217
010000 99999 BOGUS NORWAY                  NO      ENRS                           20010927 20041019
010010 99999 JAN MAYEN(NOR-NAVY)           NO      ENJA  +70.933 -008.667 +0009.0 19310101 20200111
010013 99999 ROST                          NO                                     19861120 19880105
010014 99999 SORSTOKKEN                    NO      ENSO  +59.792 +005.341 +0048.8 19861120 20200110
"""

print(pandas.read_fwf(io.StringIO(s), parse_dates=["BEGIN", "END"],
      skip_blank_lines=True))

这将导致:

USAF     WBAN         STATION NAME  ... ELEV(M)      BEGIN        END
0       NaN      NaN                  NaN  ...     NaN        NaT        NaT
1    7018.0  99999.0           WXPOD 7018  ...  7018.0 2011-03-09 2013-07-30
2    7026.0  99999.0           WXPOD 7026  ...  7026.0 2012-07-13 2017-08-22
3    7070.0  99999.0           WXPOD 7070  ...  7070.0 2014-09-23 2015-09-26
4    8260.0  99999.0            WXPOD8270  ...     0.0 2005-01-01 2010-09-20
5    8268.0  99999.0            WXPOD8278  ...  1156.7 2010-05-19 2012-03-23
6    8307.0  99999.0           WXPOD 8318  ...  8318.0 2010-04-21 2010-04-21
7    8411.0  99999.0                 XM20  ...     NaN 2016-02-17 2016-02-17
8    8414.0  99999.0                 XM18  ...     NaN 2016-02-16 2016-02-17
9    8415.0  99999.0                 XM21  ...     NaN 2016-02-17 2016-02-17
10   8418.0  99999.0                 XM24  ...     NaN 2016-02-17 2016-02-17
11  10000.0  99999.0         BOGUS NORWAY  ...     NaN 2001-09-27 2004-10-19
12  10010.0  99999.0  JAN MAYEN(NOR-NAVY)  ...     9.0 1931-01-01 2020-01-11
13  10013.0  99999.0                 ROST  ...     NaN 1986-11-20 1988-01-05
14  10014.0  99999.0           SORSTOKKEN  ...    48.8 1986-11-20 2020-01-10

[15 rows x 11 columns]

行0仍然具有所有列的值.我原本以为行0是第一个非空数据行,从007018开始.为什么 skip_blank_lines = True 似乎没有作用?如何告诉熊猫跳过空白行?我在做错什么吗?

Row 0 still has values for all columns. I was expecting row 0 to be the first non-empty data row, starting with 007018. Why does skip_blank_lines=True appear to have no effect? How can I tell pandas to skip the blank line? Am I doing something wrong?

推荐答案

代码中缺少的一个细节是您未传递 widths 参数.

One missing detail in your code is that you failed to pass widths parameter.

但这还不是全部.另一个问题是,不幸的是 read_fwf 包含这样的错误:忽略 skip_blank_lines 参数.

But this is not all. Another problem is that unfortunately, read_fwf contains such a bug that it ignores skip_blank_lines parameter.

要解决此问题,请定义以下类,其中包含 readline 方法跳过空行:

To cope with it, define the following class, containing readline method skipping empty lines:

class LineFilter(io.TextIOBase):
    def __init__(self, iterable):
        self.iterable = iterable

    def readline(self):
        while True:
            line = next(self.iterable).strip()
            if line:
                return line

然后运行:

df = pd.read_fwf(LineFilter(io.StringIO(s)), widths=[7, 6, 30, 8, 6, 8, 9, 8, 9, 9],
    parse_dates=["BEGIN", "END"], na_filter=False)

如您所见,我添加了 na_filter = False 来阻止空字符串为 NaN 值.

As you can see, I added na_filter=False to block conversion of empty strings to NaN values.

这篇关于为什么pandas.read_fwf没有按照指示跳过空白行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆