从宽格式重塑到长格式 [英] Reshaping from wide to long format
问题描述
我正在尝试使用unix将制表符分隔的文件从短/宽格式转换为长格式,与R中的reshape函数类似.我希望为起始文件中的每一行创建三行.列4当前包含3个值,以逗号分隔.我希望每个起始行的第1、2和3列都相同,但使第4列成为初始第4列的值之一.此示例可能比我口头描述的更清楚:
I am trying to use unix to transform a tab delimited file from a short/wide format to long format, in a similar way as the reshape function in R. I hope to create three rows for each row in the starting file. Column 4 currently contains 3 values separated by commas. I hope to keep columns 1, 2, and 3 the same for each starting row, but have column 4 be one of the values from the initial column 4. This example probably makes it more clear than I can describe verbally:
current file:
A1 A2 A3 A4,A5,A6
B1 B2 B3 B4,B5,B6
C1 C2 C3 C4,C5,C6
goal:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
作为刚熟悉这种语言的人,我最初的想法是使用sed来查找逗号,以换取硬性报酬
As someone just becoming familiar with this language, my initial thought was to use sed to find the commas replace with a hard return
sed 's/,/&\n/' data.frame
我真的不确定如何包含1-3列的值.我对此工作寄予厚望,但我唯一想到的就是尝试使用{print $ 1,$ 2,$ 3}插入列值.
I am really not sure how to include the values for columns 1-3. I had low hopes of this working, but the only thing I could think of was to try inserting the column values with {print $1, $2, $3}.
sed 's/,/&\n{print $1, $2, $3}/' data.frame
令我惊讶的是,输出看起来像这样:
Not to my surprise, the output looked like this:
A1 A2 A3 A4
{print $1, $2, $3} A5
{print $1, $2, $3} A6
B1 B2 B3 B4
{print $1, $2, $3} B5
{print $1, $2, $3} B6
C1 C2 C3 C4
{print $1, $2, $3} C5
{print $1, $2, $3} C6
似乎一种方法可能是存储第1-3列的值,然后将其插入.我不确定如何存储值,我认为可能需要使用以下脚本的改编,但是我很难理解所有组件.
It seems like an approach might be to store the values of columns 1-3 and then insert them. I am not really sure how to store the values, I think that it may involve using an adaptation of the following script, but I am having a hard time understanding all of the components.
NR==FNR{a[$1, $2, $3]=1}
预先感谢您对此的想法.
Thanks in advance for your thoughts on this.
推荐答案
您可以为此编写一个简单的read
循环,并使用大括号扩展来解析逗号分隔的字段:
You can a write simple read
loop for this and use brace expansion for parsing the comma delimited field:
#!/bin/bash
while read -r f1 f2 f3 c1; do
# split the comma delimited field 'c1' into its constituents
for c in ${c1//,/ }; do
printf "$f1 $f2 $f3 $c\n"
done
done < input.txt
输出:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
这篇关于从宽格式重塑到长格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!