Pandas read_csv dtype前导零 [英] Pandas read_csv dtype leading zeros

查看：155 发布时间：2017/2/24 17:09:12 python string csv pandas

本文介绍了Pandas read_csv dtype前导零的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我在NOAA的站点代码csv文件中阅读，看起来像这样：

 USAF WBAN，STATION NAME，CTRY，FIPS，STATE，CALL，LAT，LON，ELEV（.1M），BEGIN b006852，99999，SENT，SW，SZ，，，+ 46817，+ 010350，+ 14200， b007005，99999，CWOS 07005，，，， -  99999， -  999999， -  99999，20120127，20120127 $ b

前两列包含气象站的代码，有时它们有前导零。当pandas在不指定dtype的情况下导入它们时，它们变成整数。这不是真的那么大，因为我可以循环通过dataframe索引，并替换为像％06d％i ，因为他们总是六位数，但你知道...这是懒惰的人的方式。

使用以下代码获取csv：

  file = urllib .urlopen（rftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV）
 output = open（'Station Codes.csv'，'wb'）
 output.write（file.read（））
 output.close（）

这很好，但是当我去尝试和阅读它使用这：

  import pandas作为pd 
 df = pd.io.parsers.read_csv（Station Codes.csv，dtype = {'USAF'：np.str，'WBAN'：np.str}）
  / pre> 
 
 或
  import pandas as pd 
 df = pd.io.parsers.read_csv（Station Codes.csv，dtype = {'USAF'：str，'WBAN'：str}）
  
我得到一个讨厌的错误信息：
 文件C：\Python27 \lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py，第401行，在解析器
 _f 
 return _read（filepath_or_buffer，kwds）
文件C：\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers .py，第216行，在_read 
中返回parser.read（）
文件C：\ Python27\lib\site-packages\pandas-0.11.0-py2.7- win32.egg\pandas\io\parsers.py，第633行，在读取
 ret = self._engine.read（nrows）
文件C：\Python27\lib\\ \\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py，行957，在读取
 data = self._reader.read（nrows ）
 pandas._parser.TextReader.read（pandas\src\parser.c：5931）中的文件parser.pyx，第654行
文件parser.pyx，第676行，在pandas._parser.TextReader._read_low_memory（pandas\src\parser.c：6148）
文件parser.pyx，行752，在pandas._parser.TextReader._read_rows（pandas \src\ parser.c：6962）
文件parser.pyx，第837行，位于pandas._parser.TextReader._convert_column_data（pandas\src\parser.c：7898）
文件parser.pyx ，行887，在pandas._parser.TextReader._convert_tokens（pandas\src\parser.c：8483）
文件parser.pyx，行953，在pandas.parser.TextReader._convert_with_dtype（pandas \src\parser.c：9535）
文件parser.pyx，行1283，在pandas._parser._to_fw_string（pandas\src\parser.c：14616）
 TypeError：数据类型不明白
  
这是一个相当大的csv（31k行），所以也许有什么关系它？
解决方案
当解析一个带有序列号的文件时，这个问题导致了各种各样的麻烦。由于未知原因，00794和000794是两个不同的序列号。我最终想出了
  converters = {'serial_number'：lambda x：str（x）} 
  
 
So I'm reading in a station codes csv file from NOAA which looks like this:
"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"
The first two columns contain codes for weather stations and sometimes they have leading zeros.  When pandas imports them without specifying a dtype they turn into integers.  It's not really that big of a deal because I can loop through the dataframe index and replace them with something like "%06d" % i since they are always six digits, but you know... that's the lazy mans way.  


The csv is obtained using this code:
file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()
which is all well and good but when I go and try and read it using this:
import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})
or
import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})
I get a nasty error message:
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser
_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read
    return parser.read()
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148)
  File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962)
  File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898)
  File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483)
  File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535)
  File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616)
TypeError: data type not understood
It's a pretty big csv (31k rows) so maybe that has something to do with it?
 解决方案 
This problem caused me all sorts of headaches when parsing a file with serial numbers. For unknown reasons 00794 and 000794 are two distinct serial numbers. I eventually came up with
converters={'serial_number': lambda x: str(x)}


                        
这篇关于Pandas read_csv dtype前导零的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Pandas read_csv dtype前导零 [英] Pandas read_csv dtype leading zeros

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pandas read_csv dtype前导零 [英] Pandas read_csv dtype leading zeros

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭