将大文件拆分为 n 个文件,保留前 7 列 + 后 3 列直到第 n 列 [英] Split huge file into n files keeping first 7 columns + next 3 columns until column n
问题描述
我有一个带有列名的巨大数据框:
I have a huge data frame with columns names:
A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,...,GT_n,N_n,E_n
使用 unix/bash 或 python,我想用以下列生成 n 个单独的文件:
Using unix/bash or python, I want to produce n individual files with the following columns:
A,B,C,D,F,G,H,GT_a,N_a_,E_a
A,B,C,D,F,G,H,GT_b,N_b_,E_b
A,B,C,D,F,G,H,GT_c,N_c_,E_c
....
A,B,C,D,F,G,H,GT_n,N_n_,E_n
每个文件应该被调用:a.txt, b.txt, c.txt,...,n.txt
Each file should be called: a.txt, b.txt, c.txt,...,n.txt
推荐答案
这里有几个使用 bash
工具的解决方案.
Here are a couple of solutions with bash
tools.
1.猛击
在bash
循环中使用cut
.这将引发n
个进程并解析文件n
次.
Using cut
inside a bash
loop.This will raise n
processes and parse the file n
times.
更新,对于我们在列名中不仅有一系列字母作为 _id 的情况,还有许多字符串 id,在前 7 行之后每 3 行重复相同的 ID.我们必须首先读取文件的标题并提取它们,例如一个快速的解决方案是使用 awk
并每隔 8、11 等列将它们打印到 bash 数组中.
Update for the case we don't have just a sequence of letters as _ids in column names, but many string ids, repeating the same every 3 lines after the first 7 lines. We have to first read the header of the file and extract them, e.g. a quick solution is to use awk
and print them every 8th, 11th, etc column into the bash array.
#!/bin/bash
first=7
#ids=( {a..n} )
ids=( $( head -1 "$1" | awk -F"_" -v RS="," -v f="$first" 'NR>f && (NR+1)%3==0{print $2}' ) )
for i in "${!ids[@]}"; do
cols="1-$first,$((first+1+3*i)),$((first+2+3*i)),$((first+3+3*i))"
cut -d, -f"$cols" "$1" > "${ids[i]}.txt"
done
用法:bash test.sh 文件
2.awk
或者你可以使用 awk
.这里我只自定义了输出的数量,其他的也可以像第一个方案一样完成.
Or you can use awk
. Here I customize just the number of outputs, but the others can also be done like in the first solution.
BEGIN { FS=OFS=","; times=14 }
{
for (i=1;i<=times;i++) {
print $1,$2,$3,$4,$5,$6,$7,$(5+3*i),$(6+3*i),$(7+3*i) > sprintf("%c.txt",i+96)
}
}
用法:awk -f test.awk 文件
.
这个解决方案应该很快,因为它解析文件一次.但是不应该这样使用,对于大量输出文件,它可能会抛出打开的文件太多"的错误信息.错误.对于字母的范围,应该没问题.
This solution should be fast, as it parses the file once. But it shouldn't be used like this, for a large number of output files, it could throw a "too many files open" error. For the range of the letters, it should be ok.
这篇关于将大文件拆分为 n 个文件,保留前 7 列 + 后 3 列直到第 n 列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!