跳过未知行数以读取标头python pandas [英] skipping unknown number of lines to read the header python pandas

查看:150
本文介绍了跳过未知行数以读取标头python pandas的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用python pandas读取的excel数据:

i have an excel data that i read in with python pandas:

import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t' )

模拟数据如下:

unwantedjunkline1
unwantedjunkline2
unwantedjunkline3
 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

在这种情况下,数据在命中标题之前包含3条垃圾行(我不想读入的行),有时它包含4条或更多条垃圾行.所以在这种情况下,我读了数据:

the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data :

data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 )

数据如下:

 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

但是每次不需要的行数都不同时,有没有一种方法可以使用pandas读取表文件而不使用'skiprows ='而是使用一些与标题匹配的命令来读取它知道要从标题开始读取吗?因此,我不必单击打开文件即可计算文件每次包含多少行,然后手动更改"skiprows ="选项.

But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.

推荐答案

如果您知道标头开头为:

If you know what the header startswith:

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

演示:

In [18]: cat test.txt
1,2
3,4
The,header
foo,bar
foobar,foo
In [19]: df = skip_to("test.txt","The,header", sep=",")

In [20]: df
Out[20]: 
      The header
0     foo    bar
1  foobar    foo

通过调用.tell,我们可以跟踪指针指向前一行的位置,因此当我们点击标题时,我们会回到该行,然后将文件对象传递给熊猫.

By calling .tell we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas.

或者如果他们都是以共同点开始的话,就使用垃圾:

Or using the junk if they all started with something in common:

def skip_to(fle, junk,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while cur_line.startswith(junk):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

 df = skip_to("test.txt", "junk",sep="\t")

这篇关于跳过未知行数以读取标头python pandas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆