从具有数百万条记录的大型CSV文件中删除不需要的不可打印字符-在Python 3或2.7中 [英] Remove unwanted non-printable characters from large CSV files with millions of records -in Python 3 or 2.7

查看:139
本文介绍了从具有数百万条记录的大型CSV文件中删除不需要的不可打印字符-在Python 3或2.7中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

示例文件我收到的大型CSV文件以(逗号或| |或^分隔)拥有数百万条记录.
某些字段具有不可打印的字符,例如CR | LF,这些字符已转换为字段结尾.这是在Windows10中.

sample fileI receive large CSV files delimited with (comma or | or ^) with millions of records.
Some of the fields have non-printable character like CR|LF which translated as end of field. This is in windows10.

我需要编写python才能通过文件并删除字段中的CR | LF.但是,我无法删除所有内容,因为这样行将被合并.

I need to write python to go thru the file and remove CR|LF in the fields. However, I cant remove all because then lines will be merged.

我已经在这里浏览了几篇关于如何删除不可打印内容的文章.我想写一个熊猫数据框,然后检查每个字段的CR | LF并将其删除.似乎有点复杂.如果您有执行此操作的快速代码,将对您有很大帮助.

I have gone thru several postings on here on how to remove non-printable. My thought to write a panda dataframe, then check every field for CR|LF and remove it. It seems a bit complicated. If you have a quick code how to do this, it will be great help.

谢谢.

示例文件:

record1, 111. texta, textb CR|LF
record2, 111. teCR|LF
xta, textb CR|LF
record3, 111. texta, textb CR|LF

示例输出文件应为:

record1, 111. texta, textb CR|LF
record2, 111. texta, textb CR|LF
record3, 111. texta, textb CR|LF

CR =回车= x0d LF =换行= x0a

CR = carriage Return = x0d LF = Line Feed = x0a

推荐答案

在文件上运行此脚本(例如,将其命名为fix_csv.py)以对其进行清理:

Run this script (e.g. name it fix_csv.py) on your file to sanitize it:

#!/usr/bin/env python3

import sys
import os

if len(sys.argv) < 3:
    sys.stderr.write('Please give the input filename and an output filename.\n')
    sys.exit(1)

# set the correct number of fields
nf = 3
# set the delimiter
delim = ','

inpf = sys.argv[1]
outf = sys.argv[2]

newline = os.linesep

with open(inpf, 'r') as inf, open(outf, 'w') as of:
    cache = []
    for line in inf:
        line = line.strip()
        ls = line.split(delim)
        if len(ls) < nf or cache:
            if not cache:
                cache = cache + ls
            elif cache:
                cache[-1] += ls[0]
                cache = cache + ls[1:]
            if len(cache) == nf:
                of.write(f'{delim}'.join(cache) + newline)
                cache = []
        else:
            of.write(line + newline)

像这样称呼

./fix_csv input.dat output.dat

输出:

record1, 111. texta, textb
record2, 111. texta, textb
record3, 111. texta, textb

这篇关于从具有数百万条记录的大型CSV文件中删除不需要的不可打印字符-在Python 3或2.7中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆