如何在Pandas.read_csv中使用方括号作为引号字符 [英] How to use square brackets as a quote character in Pandas.read_csv
问题描述
假设我有一个文本文件,如下所示:
Let's say I have a text file that looks like this:
Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]
我想做的是用pandas.read_csv
读入,但是第二行将引发错误.这是我当前正在使用的代码:
What I'd like to be able to do is read that in with pandas.read_csv
, but the second row will throw an error. Here is the code I'm currently using:
import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)
我尝试将quotechar
设置为"[",但是显然这会吃掉所有行,直到下一个开括号并添加一个右括号会导致找到长度为2的字符串"错误.任何见识将不胜感激.谢谢!
I've tried to set quotechar
to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. Any insight would be greatly appreciated. Thanks!
提供了三种主要的解决方案:1)为数据框提供长名称,以允许读取所有数据,然后对数据进行后处理; 2)在方括号中查找值并用引号引起来或3)用分号替换前n个逗号.
There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons.
总的来说,我认为选项3通常不是一个可行的解决方案(尽管对我的数据来说还不错),因为a)如果我在包含逗号的一列中引用了值,该怎么办?b)如果我的列与方括号不是最后一列吗?剩下的是解决方案1和2.我认为解决方案2更具可读性,但是解决方案1的运行效率仅为1.38秒,而解决方案2的运行时间为3.02秒.这些测试是在一个包含18列,超过208,000行的文本文件上运行的.
Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. The tests were run on a text file containing 18 columns and more than 208,000 rows.
推荐答案
我认为您可以replace
在文件的每一行中,
的前3个出现位置;
,然后在中使用参数sep=";"
href ="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html" rel ="nofollow"> read_csv
:
I think you can replace
first 3 occurence of ,
in each line of file to ;
and then use parameter sep=";"
in read_csv
:
import pandas as pd
import io
with open('file2.csv', 'r') as f:
lines = f.readlines()
fo = io.StringIO()
fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
fo.seek(0)
df = pd.read_csv(fo, sep=';')
print df
Item Date Time Location
0 1 01/01/2016 13:41 [45.2344:-78.25453]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242]
2 3 01/10/2016 01:27 [51.2344:-86.24432]
或者可以尝试这种复杂的方法,因为主要问题是lists
中的值之间的分隔符,
与其他列值的分隔符相同.
Or can try this complicated approach, because main problem is, separator ,
between values in lists
is same as separator of other column values.
因此您需要后期处理:
import pandas as pd
import io
temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
0 1 2 3 4 \
0 Item Date Time Location NaN
1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN
2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242
3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN
5 6 7 8 9
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 41.2342:-81242] NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None
#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
Location NaN NaN
1 [45.2344:-78.25453] NaN NaN
2 [43.3423:-79.23423 41.2342:-81242 41.2342:-81242]
3 [51.2344:-86.24432] NaN NaN
#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)
#subset of desired columns
print df[['Item','Date','Time','Location']]
Item Date Time Location
1 1 01/01/2016 13:41 [45.2344:-78.25453]
2 2 01/03/2016 19:11 [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3 3 01/10/2016 01:27 [51.2344:-86.24432]
这篇关于如何在Pandas.read_csv中使用方括号作为引号字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!