在python中的列中具有相同值的计数行 [英] count lines with same value in column in python
问题描述
我试图在python中重现R aggregate()
函数,但没有连接。对于每一行,我只想计算给定列中具有相似值的行的出现次数。
I'm trying to reproduce the R aggregate()
function in python but without concatenating. For each line, I just want to count the number of occurrences of lines with a similar value in a given column.
我试图从这里的一段代码中工作:
http://timotheepoisot.fr/2011/12/01/the-aggregate-function-in-python/
I'm trying to work it out from a piece of code taken here: http://timotheepoisot.fr/2011/12/01/the-aggregate-function-in-python/
我实现的修改由 ###
指示。我现在遇到的问题是,第一列[0]包含字符串,代码似乎只适用于浮动。
The modifications I implemented are indicated by ###
. The problem I am currently having is that the first column [0] contains character strings and the code seems to work only with floats.
import numpy as np
import scipy as sp
def MSD(vec):
return [np.mean(vec),np.std(vec)]
def aggregate(df,by=0,to=1,func=np.sum):
Dat = []
# ColBy = df.T[by]
ColBy = int(df.T[by][3:]) ### my attempt to read only the numbers in the first column's character strings
ColTo = df.T[to]
UniqueBy = np.sort(np.unique(ColBy))
for ub in UniqueBy:
uTo = ColTo[ColBy==ub]
Out = func(uTo)
# Dat.append(np.concatenate(([ub],Out)))
Dat.append([ub],Out) ### because I do not want to concatenate
return Dat
test_df = np.loadtxt('in_test.txt')
Agr = aggregate(test_df,0,3,MSD)
sp.savetxt("out_test.txt", Agr)
这是错误讯息:
Traceback (most recent call last):
File "count_same_reads.py", line 30, in <module>
test_df = np.loadtxt('in_test.txt')
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 796, in loadtxt
items = [conv(val) for (conv, val) in zip(converters, vals)]
ValueError: could not convert string to float: Tag19184
我的数据是制表符分隔的,主要包含字符串,第3列除外,其中我想写出行的出现次数。
My data is tab-delimited, containing mostly strings, except for column 3 in which I want to write the number of occurrences of lines.
这里是测试数据:
Tag19184 CTAAC hffef 1 a 36 - chr1 10006 0 36M 36
Tag19184 CTAAC hffef 1 a 36 - chr1 10012 0 36M 36
Tag19184 CTAAC hffef 1 a 36 - chr1 10018 0 36M 36
Tag19184 CTAAC hffef 1 a 36 - chr1 10024 0 36M 36
Tag19184 CTAAC hffef 1 a 36 - chr1 10030 0 36M 36
Tag19184 CTAAC hffef 1 a 36 - chr1 10036 0 36M 36
Tag19184 CTAAC hffef 1 a 36 - chr1 10042 0 36M 36
Tag20198 CTAAC hffef 1 a 36 - chr1 10048 0 36M 36
Tag20198 CTAAC hffef 1 a 36 - chr1 10054 0 36M 36
Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36
结果应如下所示:
Tag19184 CTAAC hffef 7 a 36 - chr1 10006 0 36M 36
Tag19184 CTAAC hffef 7 a 36 - chr1 10012 0 36M 36
Tag19184 CTAAC hffef 7 a 36 - chr1 10018 0 36M 36
Tag19184 CTAAC hffef 7 a 36 - chr1 10024 0 36M 36
Tag19184 CTAAC hffef 7 a 36 - chr1 10030 0 36M 36
Tag19184 CTAAC hffef 7 a 36 - chr1 10036 0 36M 36
Tag19184 CTAAC hffef 7 a 36 - chr1 10042 0 36M 36
Tag20198 CTAAC hffef 2 a 36 - chr1 10048 0 36M 36
Tag20198 CTAAC hffef 2 a 36 - chr1 10054 0 36M 36
Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36
不是那么好python还。欢迎任何建议。
As you can probably tell, I'm not so good at python yet. Any advice would be welcome.
PS。数据已经按列[0]排序。
PS. The data is already sorted by column [0].
推荐答案
我会建议 pandas
,特别是在基因组数据的情况下,数据的大小可能相当大:
I will suggest pandas
, especially in your case of genomic data, the size of the data may be quite large:
In [44]:
#you can read you data by pandas.read_csv()
import pandas as pd
print df
v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
0 Tag19184 CTAAC hffef 1 a 36 - chr1 10006 0 36M 36
1 Tag19184 CTAAC hffef 1 a 36 - chr1 10012 0 36M 36
2 Tag19184 CTAAC hffef 1 a 36 - chr1 10018 0 36M 36
3 Tag19184 CTAAC hffef 1 a 36 - chr1 10024 0 36M 36
4 Tag19184 CTAAC hffef 1 a 36 - chr1 10030 0 36M 36
5 Tag19184 CTAAC hffef 1 a 36 - chr1 10036 0 36M 36
6 Tag19184 CTAAC hffef 1 a 36 - chr1 10042 0 36M 36
7 Tag20198 CTAAC hffef 1 a 36 - chr1 10048 0 36M 36
8 Tag20198 CTAAC hffef 1 a 36 - chr1 10054 0 36M 36
9 Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36
In [45]:
#if we want to group by the first 3 fields
df.groupby(['v0','v1','v2']).transform(sum).v3
Out[45]:
0 7
1 7
2 7
3 7
4 7
5 7
6 7
7 2
8 2
9 1
Name: v3, dtype: int64
In [46]:
#all it takes is just one line
df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3
print df
v0 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
0 Tag19184 CTAAC hffef 7 a 36 - chr1 10006 0 36M 36
1 Tag19184 CTAAC hffef 7 a 36 - chr1 10012 0 36M 36
2 Tag19184 CTAAC hffef 7 a 36 - chr1 10018 0 36M 36
3 Tag19184 CTAAC hffef 7 a 36 - chr1 10024 0 36M 36
4 Tag19184 CTAAC hffef 7 a 36 - chr1 10030 0 36M 36
5 Tag19184 CTAAC hffef 7 a 36 - chr1 10036 0 36M 36
6 Tag19184 CTAAC hffef 7 a 36 - chr1 10042 0 36M 36
7 Tag20198 CTAAC hffef 2 a 36 - chr1 10048 0 36M 36
8 Tag20198 CTAAC hffef 2 a 36 - chr1 10054 0 36M 36
9 Tag45093 CTAAC hffef 1 a 36 - chr1 10060 0 36M 36
这篇关于在python中的列中具有相同值的计数行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!