从CSV文件的字符串列中删除换行符 [英] Remove New Line from CSV file's string column

查看:793
本文介绍了从CSV文件的字符串列中删除换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多个字段的CSV文件.很少有字段(字符串)的数据跨越到多行.我想将这些多行汇总为一行.

I have a CSV file with multiple fields. There are few fields(string) for which data got spans to multiple lines. I want to aggregate those multiple lines into one line.

输入数据:

1, "asdsdsdsds", "John"
2, "dfdhifdkinf
dfjdfgkdnjgknkdjgndkng
dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"

预期输出:

1, "asdsdsdsds", "John"
2, "dfdhifdkinf dfjdfgkdnjgknkdjgndkng dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"

SO 中会问相同的问题较早.但是,解决方案是使用电源外壳实现的.是否可以使用python或pandas或pyspark达到相同的目的.

The same question is asked in SO earlier. However the solution is achieved using power shell. Is it possible to achieve the same using python or pandas or pyspark.

每当数据跨越多行时,肯定会用双引号引起来.

Whenever the data spans multiple lines it will be in double quotes for sure.

我尝试过的事情

即使有些字段跨越了多行,我仍然可以使用pandas和pyspark读取数据.

I can able to read the the data without any issues using pandas and pyspark even though there are fields whose got spanned to multiple lines.

熊猫:

pandas_df = pd.read_csv("file.csv")

PySpark

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true') \
        .option("delimiter", ",").option("escape", '\\').option("escape", ':').\
    option("parserLib", "univocity").option("multiLine", "true").load("file.csv")

csv文件中可以有n个字段,并且此数据跨度可以位于任何字段中.

There can be n number of fields in the csv file and this data span can be in any field.

推荐答案

def weird_gen(s):
    s = [s]
    while s:
        *x, a = s[0].split(',', 2)
        y, *s = a.split('\n', 1)
        yield ', '.join(z.strip().replace('\n', ' ') for z in x + [y])

print('\n'.join(weird_gen(open('bad.csv').read())))

1, "asdsdsdsds", "John"
2, "dfdhifdkinf dfjdfgkdnjgknkdjgndkng dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"

这篇关于从CSV文件的字符串列中删除换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆