如何在Pandas.read_csv中使用方括号作为引号字符 [英] How to use square brackets as a quote character in Pandas.read_csv

查看:427
本文介绍了如何在Pandas.read_csv中使用方括号作为引号字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个文本文件,如下所示:

Let's say I have a text file that looks like this:

Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]

我想做的是用pandas.read_csv读入,但是第二行将引发错误.这是我当前正在使用的代码:

What I'd like to be able to do is read that in with pandas.read_csv, but the second row will throw an error. Here is the code I'm currently using:

import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)

我尝试将quotechar设置为"[",但是显然这会吃掉所有行,直到下一个开括号并添加一个右括号会导致找到长度为2的字符串"错误.任何见识将不胜感激.谢谢!

I've tried to set quotechar to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. Any insight would be greatly appreciated. Thanks!

提供了三种主要的解决方案:1)为数据框提供长名称,以允许读取所有数据,然后对数据进行后处理; 2)在方括号中查找值并用引号引起来或3)用分号替换前n个逗号.

There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons.

总的来说,我认为选项3通常不是一个可行的解决方案(尽管对我的数据来说还不错),因为a)如果我在包含逗号的一列中引用了值,该怎么办?b)如果我的列与方括号不是最后一列吗?剩下的是解决方案1和2.我认为解决方案2更具可读性,但是解决方案1的运行效率仅为1.38秒,而解决方案2的运行时间为3.02秒.这些测试是在一个包含18列,超过208,000行的文本文件上运行的.

Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. The tests were run on a text file containing 18 columns and more than 208,000 rows.

推荐答案

我认为您可以replace在文件的每一行中,的前3个出现位置;,然后在中使用参数sep=";" href ="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html" rel ="nofollow"> read_csv :

I think you can replace first 3 occurence of , in each line of file to ; and then use parameter sep=";" in read_csv:

import pandas as pd
import io

with open('file2.csv', 'r') as f:
    lines = f.readlines()
    fo = io.StringIO()
    fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
    fo.seek(0)    

df = pd.read_csv(fo, sep=';')
print df
   Item        Date   Time                            Location
0     1  01/01/2016  13:41                 [45.2344:-78.25453]
1     2  01/03/2016  19:11  [43.3423:-79.23423,41.2342:-81242]
2     3  01/10/2016  01:27                 [51.2344:-86.24432]

或者可以尝试这种复杂的方法,因为主要问题是lists中的值之间的分隔符,与其他列值的分隔符相同.

Or can try this complicated approach, because main problem is, separator , between values in lists is same as separator of other column values.

因此您需要后期处理:

import pandas as pd
import io

temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
      0           1      2                    3               4  \
0  Item        Date   Time             Location             NaN   
1     1  01/01/2016  13:41  [45.2344:-78.25453]             NaN   
2     2  01/03/2016  19:11   [43.3423:-79.23423  41.2342:-81242   
3     3  01/10/2016  01:27  [51.2344:-86.24432]             NaN   

                 5   6   7   8   9  
0              NaN NaN NaN NaN NaN  
1              NaN NaN NaN NaN NaN  
2  41.2342:-81242] NaN NaN NaN NaN  
3              NaN NaN NaN NaN NaN  

#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None

#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
              Location             NaN              NaN
1  [45.2344:-78.25453]             NaN              NaN
2   [43.3423:-79.23423  41.2342:-81242  41.2342:-81242]
3  [51.2344:-86.24432]             NaN              NaN

#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)

#subset of desired columns
print df[['Item','Date','Time','Location']]
  Item        Date   Time                                           Location
1    1  01/01/2016  13:41                                [45.2344:-78.25453]
2    2  01/03/2016  19:11  [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3    3  01/10/2016  01:27                                [51.2344:-86.24432]

这篇关于如何在Pandas.read_csv中使用方括号作为引号字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆