Pandas read_csv dtype前导零 [英] Pandas read_csv dtype leading zeros

查看:155
本文介绍了Pandas read_csv dtype前导零的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我在NOAA的站点代码csv文件中阅读,看起来像这样:

 USAF WBAN,STATION NAME,CTRY,FIPS,STATE,CALL,LAT,LON,ELEV(.1M),BEGIN b006852,99999,SENT,SW,SZ,,,+ 46817,+ 010350,+ 14200, b007005,99999,CWOS 07005,,,, -  99999, -  999999, -  99999,20120127,20120127 $ b  

前两列包含气象站的代码,有时它们有前导零。当pandas在不指定dtype的情况下导入它们时,它们变成整数。这不是真的那么大,因为我可以循环通过dataframe索引,并替换为像%06d%i ,因为他们总是六位数,但你知道...这是懒惰的人的方式。



使用以下代码获取csv:

  file = urllib .urlopen(rftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV)
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()

这很好,但是当我去尝试和阅读它使用这:

  import pandas作为pd 
df = pd.io.parsers.read_csv(Station Codes.csv,dtype = {'USAF':np.str,'WBAN':np.str})
/ pre>

  import pandas as pd 
df = pd.io.parsers.read_csv(Station Codes.csv,dtype = {'USAF':str,'WBAN':str})

我得到一个讨厌的错误信息:

 文件C:\Python27 \lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py,第401行,在解析器
_f
return _read(filepath_or_buffer,kwds)
文件C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers .py,第216行,在_read
中返回parser.read()
文件C:\ Python27\lib\site-packages\pandas-0.11.0-py2.7- win32.egg\pandas\io\parsers.py,第633行,在读取
ret = self._engine.read(nrows)
文件C:\Python27\lib\\ \\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py,行957,在读取
data = self._reader.read(nrows )
pandas._parser.TextReader.read(pandas\src\parser.c:5931)中的文件parser.pyx,第654行
文件parser.pyx,第676行,在pandas._parser.TextReader._read_low_memory(pandas\src\parser.c:6148)
文件parser.pyx,行752,在pandas._parser.TextReader._read_rows(pandas \src\ parser.c:6962)
文件parser.pyx,第837行,位于pandas._parser.TextReader._convert_column_data(pandas\src\parser.c:7898)
文件parser.pyx ,行887,在pandas._parser.TextReader._convert_tokens(pandas\src\parser.c:8483)
文件parser.pyx,行953,在pandas.parser.TextReader._convert_with_dtype(pandas \src\parser.c:9535)
文件parser.pyx,行1283,在pandas._parser._to_fw_string(pandas\src\parser.c:14616)
TypeError:数据类型不明白

这是一个相当大的csv(31k行),所以也许有什么关系它?

解决方案

当解析一个带有序列号的文件时,这个问题导致了各种各样的麻烦。由于未知原因,00794和000794是两个不同的序列号。我最终想出了

  converters = {'serial_number':lambda x:str(x)} 


So I'm reading in a station codes csv file from NOAA which looks like this:

"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"

The first two columns contain codes for weather stations and sometimes they have leading zeros. When pandas imports them without specifying a dtype they turn into integers. It's not really that big of a deal because I can loop through the dataframe index and replace them with something like "%06d" % i since they are always six digits, but you know... that's the lazy mans way.

The csv is obtained using this code:

file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()

which is all well and good but when I go and try and read it using this:

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})

or

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})

I get a nasty error message:

File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser
_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read
    return parser.read()
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148)
  File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962)
  File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898)
  File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483)
  File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535)
  File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616)
TypeError: data type not understood

It's a pretty big csv (31k rows) so maybe that has something to do with it?

解决方案

This problem caused me all sorts of headaches when parsing a file with serial numbers. For unknown reasons 00794 and 000794 are two distinct serial numbers. I eventually came up with

converters={'serial_number': lambda x: str(x)}

这篇关于Pandas read_csv dtype前导零的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆