当所有其他列都保证相同时,只从CSV文件中读取所选列 [英] Reading selected column only from CSV file, when all other columns are guaranteed to be identical
问题描述
我有一堆CSV文件,我试图连接成一个单一的csv文件。 CSV文件由单个空格分隔,如下所示:
'initial','pos' 'ratio'
'chr','106681','+','0.06'
'chr','106681','+','0.88'
'chr' 106681','+','0.01'
'chr','106681','+','0.02'
可以看到,除了 ratio
,所有的值都是相同的。我创建的连接文件将如下所示:
'filename','initial','pos' ,'ratio1','ratio2','ratio3'
'jon','chr','106681','+','0.06','0.88','0.01'
$因此,基本上,不是遍历每个文件,只存储初始$ c $>的一个值c>, pos
, orientation
,但比率的所有值
并更新连接文件中的表。这证明比我更混乱,虽然它会是。我有以下代码片段读取csv文件: concatenated_file = open('josh.csv',rb )
reader = csv.reader(concatenated_file)
读取行:
打印行
它提供:
['chrom','pos','strand' ,'meth_ratio']
['chr2','106681786','+','0.06']
['chr2','106681796','+','0.88']
['chr2','106681830','+','0.01']
['chr2','106681842','+','0.02']
如果有人能告诉我如何存储初始
, pos
, orientation
(因为它们保持不变),但
解决方案这是一个带有 pandas.read_csv()。我们甚至可以删除引号:
import pandas as pd
csva = pd.read_csv ('a.csv',header = 0,quotechar =',delim_whitespace = True)
csva ['ratio']
0 0.06
1 0.88
2 0.01
3 0.02
名称:ratio,dtype:float64
几个点:
- 其实你的分隔符是逗号+空格。在这个意义上,它不是纯粹的vanilla CSV。请参见如何使read_csv中的分隔符更加灵活?< a>
- 请注意,我们通过设置
quotechar ='
- 如果您真的坚持保存内存(不要),您可以在执行read_csv之后删除
csva
的所有其他列,而不是ratio。请参阅pandas文档。
I have a bunch of CSV files that Im trying to concatenate into one single csv file . The CSV files are separated by a single space and look like this:
'initial', 'pos', 'orientation', 'ratio'
'chr', '106681', '+', '0.06'
'chr', '106681', '+', '0.88'
'chr', '106681', '+', '0.01'
'chr', '106681', '+', '0.02'
As you can see, all the values are the same except for the ratio
. The concatenated file I am creating will look like this:
'filename','initial', 'pos', 'orientation', 'ratio1','ratio2','ratio3'
'jon' , 'chr', '106681', '+', '0.06' , '0.88' ,'0.01'
So basically, ill be iterating through each file, storing only one value of the initial
, pos
, orientation
but all the values of the ratio
and updating the table in the concatenated file. This is proving much more confusing than i though it would be. I have the following piece of code to read the csv files:
concatenated_file = open('josh.csv', "rb")
reader = csv.reader(concatenated_file)
for row in reader:
print row
which gives:
['chrom', 'pos', 'strand', 'meth_ratio']
['chr2', '106681786', '+', '0.06']
['chr2', '106681796', '+', '0.88']
['chr2', '106681830', '+', '0.01']
['chr2', '106681842', '+', '0.02']
It would be really helpful if some one can show me how to store only one value of the initial
, pos
, orientation
(because they remain same) but all the values of the ratio
解决方案 This is a one-liner with pandas.read_csv(). And we can even drop the quoting too:
import pandas as pd
csva = pd.read_csv('a.csv', header=0, quotechar="'", delim_whitespace=True)
csva['ratio']
0 0.06
1 0.88
2 0.01
3 0.02
Name: ratio, dtype: float64
A couple of points:
- actually your separator is comma + whitespace. In that sense it's not plain-vanilla CSV. See "How to make separator in read_csv more flexible?"
- note we dropped the quoting on numeric fields, by setting
quotechar="'"
- if you really insist on saving memory (don't), you can drop all other columns of
csva
than 'ratio', after you do the read_csv. See the pandas doc.
这篇关于当所有其他列都保证相同时,只从CSV文件中读取所选列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!