检查标头是否存在于Python pandas中 [英] Check if header exists with Python pandas

查看:75
本文介绍了检查标头是否存在于Python pandas中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题.有没有可能查看文件中是否存在列标题或跳过行的方式?说我有一组文件.一个在第一行带有标题,另一个在第二行带有标题,第一行后面跟随一些无用的文本,另一个没有标题.我想跳过列标题之前的所有行,或者检测是否甚至存在一个行而不在代码中指定"skiprows".有许多硬编码的方法可以做到这一点.我使用过正则表达式和替换等,但是我正在寻找一个涵盖所有基础的更通用的想法.我什至发出了原始输入提示,使您可以输入要跳过的行数.该方法有效,但我希望不需要依赖用户输入的内容,而只需自己检测列标题即可.我只是在寻找一些想法(如有).我主要在处理csv类型的文件,并希望使用Python做到这一点.

I have a question. Is there a possible way to see if a column header exists in a file, or skip rows until? Say I have a group of files. One with a header on the first row, another with the header on the second row following some useless text on the first row, and another that has no header. I want to skip all rows before the column header or detect if one even exists without specifying "skiprows" in the code. There are a number of hard coded ways to do this. I have used regexes and replaces etc., but I am looking for a more universal idea that covers all bases. I have even made a raw input prompt that allows you to enter the amount of rows you want to skip. That method worked, but I want something that will not have to rely on user input and just detect column headers on its own. I am just looking for a few ideas if any. I am working mainly csv type files and would like to do this with Python.

推荐答案

csv.Sniffer具有has_header()函数,如果第一行似乎是标题,则该函数应返回True.使用它的过程是,首先从顶部删除所有空行,直到第一个非空行,然后运行csv.Sniffer.has_header().我的经验是,标头必须位于has_header()的第一行中才能返回True,并且如果标头字段的数量与其扫描范围中至少一行的数据字段的数量不匹配,则它将返回False.由用户设置. 1024或2048是典型的扫描范围.我试图将其设置得更高,甚至可以读取整个文件,但是如果它不在第一行中,它仍然无法识别标头.我所有的测试都是使用Python 2.7.10完成的.

csv.Sniffer has a has_header() function that should return True if the first row appears to be a header. A procedure for using it would be to first remove all empty rows from the top until the first non-empty row and then run csv.Sniffer.has_header(). My experience is that the header must be in the first line for has_header() to return True and it will return False if the number of header fields do not match the number of data fields for at least one row in its scan range which must be set by the user. 1024 or 2048 are typical scan ranges. I tried to set it much higher even so the entire file would be read, but it still failed to recognize the header if it was not in the first line. All my testing was done using Python 2.7.10.

这里是在脚本中使用csv.Sniffer的示例,该脚本首先确定文件是否具有可识别的标头,如果没有将其重命名,则使用原始名称创建一个新的空文件,然后打开重命名的文件以进行读取和用于写入的新文件,并将重命名的文件内容写入新文件(不包括前导空白行).最后,它会重新测试新文件的标头,以确定删除空白行是否有所作为.

Here is an example of using csv.Sniffer in a script that first determines if a file has a recognizable header and if not renames it, creates a new, empty file with the original name, then opens the renamed file for reading and the new file for writing and writes the renamed file contents to the new file excluding leading blank lines. Finally it retests the new file for a header to determine if removing the blank lines made a difference.

import csv
from datetime import datetime
import os
import re
import shutil
import sys
import time

common_delimeters = set(['\' \'', '\'\t\'', '\',\''])

def sniff(filepath):
   with open(filepath, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(2048))
        delimiter = repr(dialect.delimiter)
        if delimiter not in common_delimeters:
            print filepath,'has uncommon delimiter',delimiter
        else:
            print filepath,'has common delimiter',delimiter
        csvfile.seek(0)
        if csv.Sniffer().has_header(csvfile.read(2048)):
            print filepath, 'has a header'
            return True
        else:
            print filepath, 'does not have a header'
            return False

def remove_leading_blanks(filepath):
    # test filepath  for header and delimiter
    print 'testing',filepath,'with sniffer'
    has_header = sniff(filepath)
    if has_header:
        print 'no need to remove leading blank lines if any in',filepath
        return True
    # make copy of filepath appending current date-time to its name
    if os.path.isfile(filepath):
        now = datetime.now().strftime('%Y%d%m%H%M%S')
        m = re.search(r'(\.[A-Za-z0-9_]+)\Z',filepath)
        bakpath = ''
        if m != None:
            bakpath = filepath.replace(m.group(1),'') + '.' + now + m.group(1)
        else:
            bakpath = filepath + '.' + now       
        try:
            print 'renaming', filepath,'to', bakpath
            os.rename(filepath, bakpath)
        except:
            print 'renaming operation failed:', sys.exc_info()[0]
            return False
       print 'creating a new',filepath,'from',bakpath,'minus leading blank lines'
        # now open renamed file and copy it to original filename
        # except for leading blank lines
        time.sleep(2)
        try:
            with open(bakpath) as o, open (filepath, 'w') as n:
                p = False
                for line in o:
                    if p == False:
                        if line.rstrip():
                            n.write(line)
                            p = True
                        else:
                            continue
                    else:
                        n.write(line)
        except IOError as e:
            print 'file copy operation failed: %s' % e.strerror   
            return False
        print 'testing new',filepath,'with sniffer'       
        has_header = sniff(filepath)
        if has_header:
            print 'the header problem with',filepath,'has been fixed'
        return True
        else:
            print 'the header problem with',filepath,'has not been fixed'
            return False

给出这个标头实际上位于第11行的csv文件:

Given this csv file where the header is actually on line 11:

header,better,leader,fodder,blather,super
1,2,3,,,
4,5,6,7,8,9
3,4,5,6,7,
2,,,,,

remove_leading_blanks()确定它没有标题,然后删除开头的空白行并确定它确实具有标题. 这是其控制台输出的痕迹:

remove_leading_blanks() determined that it did not have headers, then removed the leading blank lines and determined that it did have headers. Here is the trace of its console output:

testing test1.csv with sniffer...
test1.csv has uncommon delimiter '\r'
test1.csv does not have a header
renaming test1.csv to test1.20153108142923.csv
creating a new test1.csv from test1.20153108142923.csv minus leading blank lines
testing new test1.csv with sniffer
test1.csv has common delimiter ','
test1.csv has a header
the header problem with test1.csv has been fixed
done ok

虽然这可能在很多时间都有效,但是由于标头及其位置的变化太大,通常看起来不可靠.但是,也许总比没有好.

While this may work a lot of the time, generally it does not appear reliable due to too much variation in headers and their placement. However, maybe its better than nothing.

请参见 csv.Sniffer PyMOTW的csv –以逗号分隔的值文件对csv模块进行了很好的教程审查,其中包含方言的详细信息.

See csv.Sniffer, csv.py and _csv.c for more info. PyMOTW's csv – Comma-separated value files has a good tutorial review of the csv module with details on Dialects.

这篇关于检查标头是否存在于Python pandas中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆