由于read_csv空格分隔不恒定,因此无法制作数据框 [英] Unable to make dataframe because read_csv whitespace separation not constant
问题描述
我正在尝试将此文本文件(philadelphia.txt)转换为熊猫数据框:
I am trying to make this text file (philadelphia.txt) into a pandas dataframe:
STATION STATION_NAME DATE TAVG TMAX TMIN
----------------- -------------------------------------------------- -------- -------- -------- --------
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970605 -9999 74 47
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970606 -9999 68 50
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970608 -9999 72 50
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970609 -9999 83 47
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970610 -9999 86 55
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970611 -9999 88 61
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970612 -9999 83 70
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970613 -9999 80 66
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970614 -9999 80 64
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970615 -9999 77 55
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970616 -9999 79 49
但是,如果我使用
data = pd.read_csv('philadelphia.txt', sep="\s+", header=0)
它创建了正确的标题,但随后遇到了拆分站名称数据的问题.我希望它包含在列名"STATION_NAME"下,但是sep ="\ s +"在空格处将其拆分,但出现错误.
It makes a correct header, but then runs into the issue of splitting the station name data. I want it to be contained under the column name "STATION_NAME", but sep="\s+" splits it at the spaces and I get an error.
pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11
如何将数据分成6列,而又不将电台名称拆分为单个单词?
How do I separate the data into 6 columns, without splitting the station name into individual words?
我还希望能够传入具有不同站点名称的其他文本文档,例如(yellowknife.txt).
I also want to be able to pass in other text documents with different station names such as (yellowknife.txt).
STATION STATION_NAME DATE TMAX TMIN
----------------- -------------------------------------------------- -------- -------- --------
GHCND:CA002204101 YELLOWKNIFE A CA 20130117 -21 -35
GHCND:CA002204101 YELLOWKNIFE A CA 20130118 -15 -21
GHCND:CA002204101 YELLOWKNIFE A CA 20130119 -17 -29
GHCND:CA002204101 YELLOWKNIFE A CA 20130120 -18 -28
GHCND:CA002204101 YELLOWKNIFE A CA 20130121 -21 -34
GHCND:CA002204101 YELLOWKNIFE A CA 20130122 -16 -30
GHCND:CA002204101 YELLOWKNIFE A CA 20130123 -17 -28
GHCND:CA002204101 YELLOWKNIFE A CA 20130124 -5 -17
推荐答案
使用 read_fwf()方法:
In [7]: df = pd.read_fwf(r'/path/to/file.csv').drop(0)
In [8]: df
Out[8]:
STATION STATION_NAME DATE TAVG TMAX TMIN
1 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970605 -9999 74 47
2 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970606 -9999 68 50
3 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970608 -9999 72 50
4 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970609 -9999 83 47
5 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970610 -9999 86 55
6 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970611 -9999 88 61
7 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970612 -9999 83 70
8 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970613 -9999 80 66
9 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970614 -9999 80 64
10 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970615 -9999 77 55
11 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970616 -9999 79 49
列:
In [9]: df.columns.tolist()
Out[9]: ['STATION', 'STATION_NAME', 'DATE', 'TAVG', 'TMAX', 'TMIN']
这篇关于由于read_csv空格分隔不恒定,因此无法制作数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!