通过特定的定界符删除字符串 [英] Remove strings by a specific delimiter

查看:81
本文介绍了通过特定的定界符删除字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的文件中没有几列,其中第二列带有:"定界符,我想删除第二列中的第一,第三和第四字符串,并将第二字符串保留在该列中.但是我有正常的定界符空间,所以我不知道.

I have few columns in a file, in which the second column has ":" delimiter and I would like to remove the first, third and fourth strings in the second column and left the second string in that column. But I have the normal delimiter space, so I have no idea.

input:

--- 22:16050075:A:G 16050075 A G
--- 22:16050115:G:A 16050115 G A
--- 22:16050213:C:T 16050213 C T
--- 22:16050319:C:T 16050319 C T
--- 22:16050527:C:A 16050527 C A

desired output:

--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A

Wrong:
cat df.txt | awk -F: '{print $1, $3, $6, $7, $8}'

--- 22 A
--- 22 G
--- 22 C
--- 22 C
--- 22 C

但是我做不正确. awk和sed命令可以做到吗?

but I can not do it right. can awk and sed command can do it?

谢谢.

推荐答案

只需将$2上与POSIX兼容的split()函数用作

Just use the POSIX compatible split() function on $2 as

awk '{split($2,temp,":"); $2=temp[2];}1' file
--- 16050075 16050075 A G
--- 16050115 16050115 G A
--- 16050213 16050213 C T
--- 16050319 16050319 C T
--- 16050527 16050527 C A

在分隔符:上拆分第2列,将$2值更新为所需元素(temp[2]),然后打印其余字段({}1根据FS并打印出来.)

Split the column 2 on de-limiter :, update the $2 value to the required element (temp[2]) and print the rest of the fields ({}1 re-constructs all individual fields based on FS and prints it).

建议使用多个定界符,因为它会更改各个字段的绝对位置,而split()使其易于保留位置并仅提取所需的值.

Recommend this over using multiple de-limiters, as it alters the absolute position of the individual fields, while split() makes it easy to retain the position and just extract the required value.

要更新要求添加新列,只需执行

For your updated requirement to add a new column, just do

awk '{split($2,temp,":"); $2=temp[1] FS temp[2];}1' file
--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A


或者,如果您具有GNU awk/gawk,则可以将其gensub()用于基于正则表达式的提取(使用POSIX字符类[[:digit]])作为


Alternatively if you have GNU awk/gawk you can use its gensub() for a regex (using POSIX character class [[:digit]]) based extraction as

awk '{$2=gensub(/^([[:digit:]]+):([[:digit:]]+).*$/,"\\1 \\2","g",$2);}1' file
--- 22 16050075 16050075 A G
--- 22 16050115 16050115 G A
--- 22 16050213 16050213 C T
--- 22 16050319 16050319 C T
--- 22 16050527 16050527 C A

gensub(/^([[:digit:]]+):([[:digit:]]+).*$/,"\\1 \\2","g",$2)部分仅捕获由:分隔的前两个字段,并带有捕获组\\1\\2,并打印其余字段.

The gensub(/^([[:digit:]]+):([[:digit:]]+).*$/,"\\1 \\2","g",$2) part captures only the first two fields de-limited by : with the capturing groups \\1 and \\2 and printing the rest of the fields as such.

这篇关于通过特定的定界符删除字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆