对于不规则的分隔符,如何使pandas read_csv中的分隔符更灵活wrt空格? [英] How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

查看:85
本文介绍了对于不规则的分隔符,如何使pandas read_csv中的分隔符更灵活wrt空格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过使用read_csv方法从文件中读取数据来创建数据帧.但是,分隔符不是很规则:一些列由制表符(\t)分隔,另一些则由空格分隔.而且,某些列可以用2或3个或更多空格,甚至可以由空格和制表符的组合分隔(例如3个空格,两个制表符和1个空格).

I need to create a data frame by reading in data from a file, using read_csv method. However, the separators are not very regular: some columns are separated by tabs (\t), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).

有没有办法告诉大熊猫正确对待这些文件?

Is there a way to tell pandas to treat these files properly?

顺便说一句,如果我使用Python,则不会出现此问题.我使用:

By the way, I do not have this problem if I use Python. I use:

for line in file(file_name):
   fld = line.split()

它运行完美.不管字段之间是否有2或3个空格.即使空格和制表符的组合也不会引起任何问题.熊猫可以做同样的事情吗?

And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

推荐答案

来自文档,您可以使用正则表达式或delim_whitespace:

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd
>>> for line in open("whitespace.csv"):
...     print repr(line)
...     
'a\t  b\tc 1 2\n'
'd\t  e\tf 3 4\n'
>>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4
>>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)
   0  1  2  3  4
0  a  b  c  1  2
1  d  e  f  3  4

这篇关于对于不规则的分隔符,如何使pandas read_csv中的分隔符更灵活wrt空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆