将.csv文件读取到pandas数据帧,并从换行符标识数据节 [英] Read .csv file to pandas data frame and identify data sections from line breaks

查看:924
本文介绍了将.csv文件读取到pandas数据帧,并从换行符标识数据节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个.csv文件,其中2+空白行表示一个新的数据部分。但是先验我不知道每一节中有多少行。有没有办法直接读到一个熊猫数据框架,并停在第一个2+空白行?

I have a .csv file where 2+ blank lines represent a new data section. But a priori I don't know how many lines are in each section. Is there a way to read directly to a pandas data frame and stop at the first 2+ blank lines?

数据如下(.CSV文件从谷歌趋势结果,这里截断)。

The data are as follows (.csv files from Google Trends results, here truncated).

Web Search interest: zts
Worldwide; 2004 - present

Interest over time
Week,zts
2004-01-04 - 2004-01-10,0
2004-01-11 - 2004-01-17,80


Top regions for zts
Region,zts
Slovakia,100
Slovenia,23


Top cities for zts
City,zts
Bratislava (Slovakia),100
Wroclaw (Poland),39



Top searches for zts
focus zts,100
ford zts,90



Rising searches for zts
2002 focus zts,Breakout
battery tester,Breakout

现在我使用 csv.reader ),并循环遍历所有行,并保留与第一列中的日期正则表达式匹配且具有两列的行。

Now I use csv.reader() and loop over all the rows and retain rows that match a date regex in the first column and have two columns. But this seems hackish.

如果我使用 pandas.read_csv(input_file,header = 4)使用日期正则表达式以后找到正确的部分),那么当最后一个部分有三列(这里它不,但它可以)失败。

If I use something like pandas.read_csv(input_file, header=4) (then use a date regex later on to find the correct section), then it fails when the last section has three columns (here it doesn't, but it can).

有没有办法停止我的 pandas.read_csv()后第一个块没有先验知道行数?理想情况下,我想将这个.csv解析为五个数据帧(每个数据部分一个),但在这一刻我很高兴抓住第一部分。

Is there a way to stop my pandas.read_csv() after the first block without a priori knowing the number of rows? Ideally I would like to parse this .csv into five data frames (one for each data section), but at this point I'm happy grabbing the first section.

推荐答案

也可以使用正则表达式。对于这样的情况,它们工作得很好。

You can also use regular expressions. They work quite well for situations like this.

import re
from cStringIO import StringIO

csv1 = """right,top,bottom
4,5,6
6,7,8
"""

csv2 = """up,down,left
1,2,3
7,6,5
"""

csv3 = """a,b,c
1,2,3
4,5,6
"""

join_n = randint(2, 6, size=2)
raw = [csv1, csv2, csv3]
csvs = []

for csv, n in zip(raw, join_n):
    csvs.append(csv + '\n' * n)

csvs.append(csv3)
csvs = ''.join(csvs)

splitsville = re.split('\n{2,}', csvs)

dfs = []

for sp in splitsville:
    dfs.append(read_csv(StringIO(sp)))


final_df = concat(dfs, axis=1)

print final_df


b $ b

产生:

yields:

   right  top  bottom  up  down  left  a  b  c
0      4    5       6   1     2     3  1  2  3
1      6    7       8   7     6     5  4  5  6

c> c> / code> s,但通常这是一个有用的下一步,所以你不必保持操作列表 DataFrame s。

NOTE: You don't necessarily have to concat the list of DataFrames, but often that's a useful next step so that you don't have to keep operating on a list of DataFrames.

这篇关于将.csv文件读取到pandas数据帧,并从换行符标识数据节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆