由于read_csv空格分隔不恒定,因此无法制作数据框 [英] Unable to make dataframe because read_csv whitespace separation not constant

查看:25
本文介绍了由于read_csv空格分隔不恒定,因此无法制作数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将此文本文件(philadelphia.txt)转换为熊猫数据框:

I am trying to make this text file (philadelphia.txt) into a pandas dataframe:

STATION           STATION_NAME                                       DATE     TAVG     TMAX     TMIN     
----------------- -------------------------------------------------- -------- -------- -------- -------- 
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970605 -9999    74       47       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970606 -9999    68       50       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970608 -9999    72       50       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970609 -9999    83       47       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970610 -9999    86       55       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970611 -9999    88       61       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970612 -9999    83       70       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970613 -9999    80       66       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970614 -9999    80       64       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970615 -9999    77       55       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970616 -9999    79       49

但是,如果我使用

data = pd.read_csv('philadelphia.txt', sep="\s+", header=0)

它创建了正确的标题,但随后遇到了拆分站名称数据的问题.我希望它包含在列名"STATION_NAME"下,但是sep ="\ s +"在空格处将其拆分,但出现错误.

It makes a correct header, but then runs into the issue of splitting the station name data. I want it to be contained under the column name "STATION_NAME", but sep="\s+" splits it at the spaces and I get an error.

pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11

如何将数据分成6列,而又不将电台名称拆分为单个单词?

How do I separate the data into 6 columns, without splitting the station name into individual words?

我还希望能够传入具有不同站点名称的其他文本文档,例如(yellowknife.txt).

I also want to be able to pass in other text documents with different station names such as (yellowknife.txt).

STATION           STATION_NAME                                       DATE     TMAX     TMIN     
----------------- -------------------------------------------------- -------- -------- -------- 
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130117 -21      -35      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130118 -15      -21      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130119 -17      -29      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130120 -18      -28      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130121 -21      -34      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130122 -16      -30      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130123 -17      -28      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130124 -5       -17      

推荐答案

使用 read_fwf()方法:

In [7]: df = pd.read_fwf(r'/path/to/file.csv').drop(0)

In [8]: df
Out[8]:
              STATION                                STATION_NAME      DATE   TAVG TMAX TMIN
1   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970605  -9999   74   47
2   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970606  -9999   68   50
3   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970608  -9999   72   50
4   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970609  -9999   83   47
5   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970610  -9999   86   55
6   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970611  -9999   88   61
7   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970612  -9999   83   70
8   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970613  -9999   80   66
9   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970614  -9999   80   64
10  GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970615  -9999   77   55
11  GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970616  -9999   79   49

列:

In [9]: df.columns.tolist()
Out[9]: ['STATION', 'STATION_NAME', 'DATE', 'TAVG', 'TMAX', 'TMIN']

这篇关于由于read_csv空格分隔不恒定,因此无法制作数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆