python在数据中具有NUL字节的CSV文件的Dictread [英] python Dictread of CSV file with NUL bytes in data

查看:365
本文介绍了python在数据中具有NUL字节的CSV文件的Dictread的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这是给列ABCD列C列中的一个字段将具有数据像

p>

,引用字符Some DataNUL更多数据NUL数据结尾报价字符



当我用LIBRE Office Calc打开它,NUL字符不会出现在显示屏上,如果我手动保存,它们就会消失。我可以在vi中看到NUL的字符,可以用tr或者用vi替换它们,但是我希望能够使用python程序自动处理。



DictReader进程是



对于infile中的行:它抛出异常,因此except除外,因此不会返回到下一行(或允许我将NUL字符更改为空格或嵌入逗号并处理该行)。



幸运的是,数据似乎有其他无效,所以我可能会跳过它任何事件。但是,问题是如何告诉Python去下一行。

解决方案

所以这有点丑陋,但它似乎工作。您可以像正常一样读取一行,清理有问题的字节,然后使用StringIO对象将其传递给DictReader。这是代码,假设你的csv有一个标题记录(如果没有,它应该更简单):

 #! / usr / bin / env python 

import StringIO
import csv
import ipdb

fin = open('somefilewithnulls','rb')
fout = StringIO.StringIO()
reader = csv.DictReader(fout)

while True:
#为第一个录音准备StringIO与第一个
#两行,所以DictReader可以创建头
line = fin.readline()if fin.tell()else fin.readline()+ fin.readline()
如果不是len(行):
break

#在将其传递给DictReader之前清理该行
line = line.replace('\x00','')

fout。写(行)
fout.seek(-len(行),1)

rec = reader.next()
打印rec


I have a CSV file which has NUL byte embedded within some data.

That is given columns A B C D one of the fields in column C would have data like

, quote character"Some Data" NUL "More Data" NUL "End of data" quote character,

When I open it with LIBRE Office Calc, the NUL characters do not appear in the display and if I save it by hand, they go away. I can see the NUL characters in vi and could remove or replace them with tr or by hand in vi, but I want to be able to handle it with the python program automatically.

The DictReader process is

for row in infile: which throws the exception and the except is therefore outside the loop and would not go back to get the next line (or allow me to change the NUL character to a space or embedded comma and process that line).

Luckily, the data appears to have other invalidations so I would probably skip it in any event. However, the question would be how do I tell Python to go to the next line.

解决方案

So this is a bit ugly, but it seems to work. You can read a line like normal, clean the offending bytes, then use a StringIO object to pass it to DictReader. Here's the code, assuming your csv has a header record (it should be more simple if you don't):

#!/usr/bin/env python

import StringIO
import csv 
import ipdb

fin = open('somefilewithnulls', 'rb')
fout = StringIO.StringIO()
reader = csv.DictReader(fout)

while True:
    # for the first record prep StringIO with the first
    # two lines so DictReader can create header
    line = fin.readline() if fin.tell() else fin.readline() + fin.readline()
    if not len(line):
        break

    # clean the line before passing it to DictReader
    line = line.replace('\x00', '') 

    fout.write(line)
    fout.seek(-len(line), 1)

    rec = reader.next()
    print rec

这篇关于python在数据中具有NUL字节的CSV文件的Dictread的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆