如何使计数值列表的结果变得像数据框的功能一样一目了然 [英] How to make the results of a list of counted values become one-hot like features of a dataframe

查看:67
本文介绍了如何使计数值列表的结果变得像数据框的功能一样一目了然的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框:

v1        v2       v3
+         S10      tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt
+        AMPC      tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa
+        AROH      gtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg

我对 v3执行转换code>将字符串分成两个字母,并获得每个字母对的出现次数,如下所示:

I perform a transformation on v3 to split the strings each 2 letters and get the count of the ocurrences of each pair of letters like this:

lapply(df$v3, function(x) oligonucleotideFrequency(DNAString(x), width = 2))

这是v3中第一个字符串的转换结果:

this is the output of this transformation for the first string in v3:

AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
3  2  2  4  1  0  6  3  0  6  4  7  7  2  5  4 

现在我拥有v3字符串中每对字母的所有值计数,但是每个计数都是分开的,因此我t不提供全球价值。现在,我想做的是使每对字母成为数据框的一个特征,其中每个特征的值将是每对字母在同一字符串中的出现次数。

Now i have all the value counts for each pair of letters in the strings of v3, but each count is separated and it does not provide a global value. Now what i would like to do, is to make each pair of letters become a feature of the dataframe where the value of each feature would be the number of occurrencies of each pair into the same string.

应该是这样的:

v1        v2     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 
+         S10     3  2  2  4  1  0  6  3  0  6  4  7  7  2  5  4                        
+        AMPC     3  4  1  4  5  2  4  4  2  4  1  5  3  5  6  3 
+        AROH     2  4  4  4  3  3  2  4  2  4  1  3  7  1  3  9

我如何获得此结果?

预先感谢

推荐答案

library(Biostrings)

dat <- read.table(text = "v1        v2       v3
'+'         'S10'      'tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt'
'+'        'AMPC'      'tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa'
'+'        'AROH'      'gtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg'",
stringsAsFactors = FALSE, header = TRUE)

# Count the nucleotide number  
lst1 <- lapply(dat$v3, function(x) oligonucleotideFrequency(DNAString(x), width = 2))
# Transpose the vector and convert to a data frame for each element
lst2 <-  lapply(lst1, function(x) as.data.frame(t(x)))
# Comebine all single data frame into one data frame row-wise
dat2 <- do.call(rbind, lst2)
# Comebine with the original data frame column-wise
dat3 <- cbind(dat, dat2)
# Remvoe the v3 column
dat3$v3 <- NULL
dat3
#   v1   v2 AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
# 1  +  S10  3  2  2  4  1  0  6  3  0  6  4  7  7  2  5  4
# 2  + AMPC  3  4  1  4  5  2  4  4  2  4  1  5  3  5  6  3
# 3  + AROH  2  4  4  4  3  3  2  4  2  4  1  3  7  1  3  9

这篇关于如何使计数值列表的结果变得像数据框的功能一样一目了然的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆