快速的方式来转置和concat csv文件在python? [英] fast way to transpose and concat csv files in python?
问题描述
我试图转置相同格式的多个文件,并将它们合并为一个大的CSV文件。我想使用numpy转置作为一个真正快速的方式做,但它以某种方式跳过我需要的所有标题。这些是我的文件:
I am trying to transpose multiple files of the same format and concatinating them into 1 big CSV file. I wanted to use numpy for transposing as its a really fast way of doing it but it somehow skips all my headers which i need. These are my files:
testfile1.csv
time,topic1,topic2,country
2015-10-01,20,30,usa
2015-10-02,25,35,usa
testfile2.csv
time,topic3,topic4,country
2015-10-01,40,50,uk
2015-10-02,45,55,uk
这是我的代码转置和合并所有csv文件到1个大文件:
This is my code to transpose and merge all csv files into 1 big file:
from numpy import genfromtxt
import csv
file_list=['testfile1.csv','testfile2.csv']
def transpose_append(csv_file):
my_data = genfromtxt(item, delimiter=',',skip_header=0)
print my_data, "my_data, not transposed"
if i == 0:
transposed_data = my_data.T
print transposed_data, "transposed_data"
for row in transposed_data:
print row, "row from first file"
csv_writer.writerow([row])
else:
transposed_data = my_data.T
for row in transposed_data:
print row, "row from second file"
csv_writer.writerow([row][:1])
with open("combined_transposed_file.csv", 'wb') as outputfile:
csv_writer = csv.writer(outputfile)
for i,item in enumerate(file_list):
transpose_append(item)
outputfile.close()
输出一个打印。它显示移调工作有点,但它缺少我的标题:
This is the output of a print. It show transposing work somewhat but its missing my headers:
[[ nan nan nan nan]
[ nan 20. 30. nan]
[ nan 25. 35. nan]] my_data, not transposed
[[ nan nan nan]
[ nan 20. 25.]
[ nan 30. 35.]
[ nan nan nan]] transposed_data
这是我的预期输出: / p>
This is my expected output:
,2015-10-01,2015-10-02,country
topic1,20,25,usa
topic2,30,35,usa
topic3,40,45,uk
topic4,50,55,uk
推荐答案
在 genfromtxt
中有各种处理标题的方法。默认是将它们视为数据的一部分:
There are various ways of handling headers in genfromtxt
. The default is to treat them as part of the data:
In [6]: txt="""time,topic1,topic2,country
...: 2015-10-01,20,30,usa
...: 2015-10-02,25,35,usa"""
In [7]: data=np.genfromtxt(txt.splitlines(),delimiter=',',skip_header=0)
In [8]: data
Out[8]:
array([[ nan, nan, nan, nan],
[ nan, 20., 30., nan],
[ nan, 25., 35., nan]])
但是因为默认的dtype是float,所有字符串都显示为 nan
。
But since the default dtype is float, the strings all appear as nan
.
您可以将它们视为标题 - 结果是一个结构化数组。标题现在显示在 data.dtype.names
列表中。
You can treat them as headers - the result is a structured array. The headers now appear in the data.dtype.names
list.
In [9]: data=np.genfromtxt(txt.splitlines(),delimiter=',',names=True)
In [10]: data
Out[10]:
array([(nan, 20.0, 30.0, nan), (nan, 25.0, 35.0, nan)],
dtype=[('time', '<f8'), ('topic1', '<f8'), ('topic2', '<f8'), ('country', '<f8')])
使用 dtype = None
,您可以选择dtype。基于第一行中的字符串,它会加载 S10
的所有内容。
With dtype=None
, you let it choose the dtype. Based on the strings in the 1st line, it loads everything as S10
.
In [11]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None)
In [12]: data
Out[12]:
array([['time', 'topic1', 'topic2', 'country'],
['2015-10-01', '20', '30', 'usa'],
['2015-10-02', '25', '35', 'usa']],
dtype='|S10')
此矩阵可以转置并打印或写入csv文件:
This matrix can be transposed, and printed or written to a csv file:
In [13]: data.T
Out[13]:
array([['time', '2015-10-01', '2015-10-02'],
['topic1', '20', '25'],
['topic2', '30', '35'],
['country', 'usa', 'usa']],
dtype='|S10')
由于我使用 genfromtxt
加载,我可以使用 savetxt
保存:
Since I'm using genfromtxt
to load, I could use savetxt
to save:
In [26]: with open('test.txt','w') as f:
np.savetxt(f, data.T, delimiter=',', fmt='%12s')
np.savetxt(f, data.T, delimiter=';', fmt='%10s') # simulate a 2nd array
....:
In [27]: cat test.txt
time, 2015-10-01, 2015-10-02
topic1, 20, 25
topic2, 30, 35
country, usa, usa
time;2015-10-01;2015-10-02
topic1; 20; 25
topic2; 30; 35
country; usa; usa
这篇关于快速的方式来转置和concat csv文件在python?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!