数据帧处理 [英] Dataframe processing

查看:152
本文介绍了数据帧处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我通过匹配< - read.table(Match.txt,sep =,fill = T,stringsAsFactors = FALSE,quote =标题= F),如下所示:

I have a dataframe, which I read by Match <- read.table("Match.txt", sep="", fill =T, stringsAsFactors = FALSE, quote = "", header = F) and looks like this:

> ab
           V1       V2  V3                       V4 V5    V6 V7    V8 V9               V10
1  Inspecting sequence  ID chr1:173244300-173244500       NA       NA                     
2   V$ATF3_Q6        |  19                      (-)  | 0.877  | 0.622  |    aagtccCATCAggg
3   V$ATF3_Q6        |  34                      (-)  | 0.788  | 0.655  |    agggaaCGACAcag
4   V$ATF3_Q6        | 102                      (+)  | 0.738  | 0.685  |    cccTGAGCttagga
5  V$CEBPB_01        |  24                      (+)  | 0.950  | 0.882  |    ccatcagGGAAGgg
72   V$YY1_01        | 117                      (+)  | 0.996  | 0.984  | acttCCCATcttttaag
73 Inspecting sequence  ID chr1:173244350-173244550       NA       NA                     
74  V$ATF3_Q6        |  52                      (+)  | 0.738  | 0.685  |    cccTGAGCttagga
75  V$ATF3_Q6        | 160                      (+)  | 0.862  | 0.687  |    gtcTGACCtggaga
76 V$CEBPB_01        |  57                      (+)  | 0.966  | 0.958  |    agcttagGAAACtt

它包含百万次这样的重复,第一行是:检查顺序ID chr1:173244300-173244500 然后一些值可以在上面看到。我想处理它,记住以下事项:

It contains million of such repetition, where first line is: Inspecting sequence ID chr1:173244300-173244500 and then some value as can be seen above. I want to process it keeping following things in mind:


  1. 提取第一行,将其分解为 - 所以我会得到三列,如: chr1 173244300 173244500

  2. 第4列应包含V1 $ Row2第1个元素,分为 $ _ 采取第二个索引,这将是 ATF3 ,像这样我有30个确定(让我们称之为名称)的情况下,有些将被观察,而其他的不在每种情况下(1个案例从行1到行72,第73行的第二个开始)。

  3. 如果该名称出现在1个案例中,那么值 B 将会分配给该列,如果不是值 U 将被分配

  1. Extract the first line, break it on : and - so I will get three columns like: chr1 173244300 173244500
  2. The 4th column should contain the V1$Row2 1st element, splitted on $ and _ and just take the 2nd index which will be ATF3, like this I have 30 definite (lets call them names) cases, some will be observed while others not in each case (1 case is from Row 1 to row 72, second start from row 73).
  3. If that name appears in 1 case then value B will be assigned to that column, if not value U will be assigned

所以基于在我的输入上,我想得到以下输出:

So based on my input, I want to get the following output:

chr     start       stop        ATF3  CEBPB  YY1    ..(All which appear e.g from row 1 to 72, ignoring duplicates)
chr1    173244300   173244500   B     B      B  
chr1    173244350   173244550   B     B      U

我需要在标题中修改no.of列我知道他们是32个这样的名字),所以如果它们出现在一个案例中 B 将被分配,否则 U 将会分配。

I want a fix no.of column in the header (I know they are 32 such names) so if they appear in one case B will be assigned, otherwise U will be assigned.

如果有人可以帮助我这样做,这将是一个很大的帮助。

If anybody can help me in doing this, it will be a great help.

这里是此示例数据框的dput:

Here is the dput of this sample dataframe:

> ab <- dput(Match[c(1:5,72:76), ])
structure(list(V1 = c("Inspecting", "V$ATF3_Q6", "V$ATF3_Q6", 
"V$ATF3_Q6", "V$CEBPB_01", "V$YY1_01", "Inspecting", "V$ATF3_Q6", 
"V$ATF3_Q6", "V$CEBPB_01"), V2 = c("sequence", "|", "|", "|", 
"|", "|", "sequence", "|", "|", "|"), V3 = c("ID", "19", "34", 
"102", "24", "117", "ID", "52", "160", "57"), V4 = c("chr1:173244300-173244500", 
"(-)", "(-)", "(+)", "(+)", "(+)", "chr1:173244350-173244550", 
"(+)", "(+)", "(+)"), V5 = c("", "|", "|", "|", "|", "|", "", 
"|", "|", "|"), V6 = c(NA, 0.877, 0.788, 0.738, 0.95, 0.996, 
NA, 0.738, 0.862, 0.966), V7 = c("", "|", "|", "|", "|", "|", 
"", "|", "|", "|"), V8 = c(NA, 0.622, 0.655, 0.685, 0.882, 0.984, 
NA, 0.685, 0.687, 0.958), V9 = c("", "|", "|", "|", "|", "|", 
"", "|", "|", "|"), V10 = c("", "aagtccCATCAggg", "agggaaCGACAcag", 
"cccTGAGCttagga", "ccatcagGGAAGgg", "acttCCCATcttttaag", "", 
"cccTGAGCttagga", "gtcTGACCtggaga", "agcttagGAAACtt")), .Names = c("V1", 
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10"), row.names = c(1L, 
2L, 3L, 4L, 5L, 72L, 73L, 74L, 75L, 76L), class = "data.frame")


推荐答案

给您的输入文件此问题 /c/tmp.txt

这个awk脚本保存为 SO-38563400.awk

And this awk script saved as SO-38563400.awk:

BEGIN {
 OFS="\t" # Set the output separator
 i=0 # Just to init the counter and be sure to start at 1 later
}
 {
 #print $0
 }
/Inspecting sequence ID/ { # Changing sequence, initialize new entry with start and end
  split($4,arr,"[:-]") # split the string in fields, split on : and -
  seq[i++,"chr"]=arr[1] # Save the chr part and increase the sequence beforehand
  seq[i,"start"]=arr[2] # save the start date
  seq[i,"end"]=arr[3] # Save the end date
}

/V[$][^_]+_.*/ { # V line type,
  split($1,arr,"[$_]") # Split on $ and underscore
  seq[i,arr[2]]="B" # This has been seen, setting to B
  seq[i,"print"]=1
  names[arr[2]]++ # Save the name for output
  # (and count occurences, just for fun, well mainly because an int is cheaper to store)
  # Main reason is it allow a quicker access toa rray keys ant END block
}

END {
  head=sprintf("char%sstart%sstop",OFS,OFS,OFS)
  for (h in names) {
    head=sprintf("%s%s%s",head,OFS,h)
  }
  print(head)
  for (l=1; l<i; l++) { # loop over each line/sequence
    line=sprintf("%s%s%s%s%s",seq[l,"chr"],OFS,seq[l,"start"],OFS,seq[l,"end"])
    for (h in names) {
      if (seq[l,h]=="B") line=sprintf("%s%s%s",line,OFS,"B")
      else line=sprintf("%s%s%s",line,OFS,"U")
    }
    if (seq[l,"print"]) print line
  }
}

传递这个命令:

awk -f SO-38563400.awk /c/tmp.txt > /c/Rtable.txt

提供:

$ cat /c/Rtable.txt
char    start   stop    STAT3   ATF3    TEAD4   GATA3   JUND    HNF4A   FOXA2   MAX     CEBPB   SPI1    GABPA   CMYC    P300    E2F1    CTCF    ATF2
chr22   16049850        16050050        B       B       U       B       U       B       B       U       U       U       U       U       B       B       U       B
chr22   16049900        16050100        B       B       B       B       B       B       B       B       B       B       B       B       B       B       B       B

然后阅读r:

> x <- read.table("/c/Rtable.txt", sep="\t",  stringsAsFactors = FALSE, header=T)
> x
char    start     stop STAT3 ATF3 TEAD4 GATA3 JUND HNF4A FOXA2 MAX CEBPB SPI1 GABPA CMYC P300 E2F1 CTCF ATF2
1 chr22 16049850 16050050     B    B     U     B    U     B     B   U     U    U     U    U    B    B    U    B
2 chr22 16049900 16050100     B    B     B     B    B     B     B   B     B    B     B    B    B    B    B    B

请忽略使用 / c / 路径的设置,这可能在Windows或linux,在Windows下有一个 awk 的端口,我建议使用linux作为大文件,这是由于文件流中的操作系统容量。

Please disregard the setup with /c/ paths, this could work on windows or linux, there's port of awk under windows, I suggest using linux for large files due to the operating system capacities on file streaming.

我们可以在打印结果之前不读取整个文件来节省更多的内存,但这需要一套固定的名称,但是你们已经太懒了,只能自己提取名字,只是给我发了一堆条目,练习留给您适应,将列表放在BEGIN块中,将其用作每个seq的条目,并在每个新的seq上打印以前的结果在处理之前。

We can save far more memory by not reading the whole file before printing results, but this need a fixed set of "names" but you've been too lazy to extract the names by yourself and just sent me a bunch of entries, exercice is left to you to adapt, make the list in BEGIN block, use it as entries for each seq, and on each new seq print the previous result before processing.

我希望下次你要花一些时间来烘烤一个正确的问题,你会明白你必须为别人做一些努力来帮助你特别是在一些意见提示之后,要求您改进问题。

I hope next time you'll take some time to bake a proper question and that you'll understand you have to make some efforts for others to help you, specially after a flow of comments asking you to improve your question.

这篇关于数据帧处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆