从宽格式重塑到长格式 [英] Reshaping from wide to long format

查看:117
本文介绍了从宽格式重塑到长格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用unix将制表符分隔的文件从短/宽格式转换为长格式,与R中的reshape函数类似.我希望为起始文件中的每一行创建三行.列4当前包含3个值,以逗号分隔.我希望每个起始行的第1、2和3列都相同,但使第4列成为初始第4列的值之一.此示例可能比我口头描述的更清楚:

I am trying to use unix to transform a tab delimited file from a short/wide format to long format, in a similar way as the reshape function in R. I hope to create three rows for each row in the starting file. Column 4 currently contains 3 values separated by commas. I hope to keep columns 1, 2, and 3 the same for each starting row, but have column 4 be one of the values from the initial column 4. This example probably makes it more clear than I can describe verbally:

current file:  
A1  A2  A3  A4,A5,A6  
B1  B2  B3  B4,B5,B6  
C1  C2  C3  C4,C5,C6  

goal:  
A1  A2  A3  A4  
A1  A2  A3  A5  
A1  A2  A3  A6  
B1  B2  B3  B4  
B1  B2  B3  B5  
B1  B2  B3  B6  
C1  C2  C3  C4  
C1  C2  C3  C5  
C1  C2  C3  C6  

作为刚熟悉这种语言的人,我最初的想法是使用sed来查找逗号,以换取硬性报酬

As someone just becoming familiar with this language, my initial thought was to use sed to find the commas replace with a hard return

sed 's/,/&\n/' data.frame

我真的不确定如何包含1-3列的值.我对此工作寄予厚望,但我唯一想到的就是尝试使用{print $ 1,$ 2,$ 3}插入列值.

I am really not sure how to include the values for columns 1-3. I had low hopes of this working, but the only thing I could think of was to try inserting the column values with {print $1, $2, $3}.

sed 's/,/&\n{print $1, $2, $3}/' data.frame

令我惊讶的是,输出看起来像这样:

Not to my surprise, the output looked like this:

A1  A2  A3  A4  
{print $1, $2, $3}  A5  
{print $1, $2, $3}  A6  
B1  B2  B3  B4  
{print $1, $2, $3}  B5  
{print $1, $2, $3}  B6  
C1  C2  C3  C4  
{print $1, $2, $3}  C5  
{print $1, $2, $3}  C6  

似乎一种方法可能是存储第1-3列的值,然后将其插入.我不确定如何存储值,我认为可能需要使用以下脚本的改编,但是我很难理解所有组件.

It seems like an approach might be to store the values of columns 1-3 and then insert them. I am not really sure how to store the values, I think that it may involve using an adaptation of the following script, but I am having a hard time understanding all of the components.

NR==FNR{a[$1, $2, $3]=1}

预先感谢您对此的想法.

Thanks in advance for your thoughts on this.

推荐答案

您可以为此编写一个简单的read循环,并使用大括号扩展来解析逗号分隔的字段:

You can a write simple read loop for this and use brace expansion for parsing the comma delimited field:

#!/bin/bash

while read -r f1 f2 f3 c1; do
  # split the comma delimited field 'c1' into its constituents
  for c in ${c1//,/ }; do
     printf "$f1 $f2 $f3 $c\n"
  done
done < input.txt

输出:

A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6

这篇关于从宽格式重塑到长格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆