使用grep和awk匹配的文本 [英] Matching text using grep or awk

查看:212
本文介绍了使用grep和awk匹配的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有grep和awk的问题。我想这是因为我的输入文件包含的文字,看起来像code。

输入文件中包含的ID名称,看起来像这样:

  SNORD115-40
MIR432
RNU6-2

参考文件看起来是这样的:

  ENSEMBL基因ID HGNC符号
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000266661
ENSG00000243133
ENSG00000207447 RNU6-2

我想从我的源文件中的ID名称与我的参考文件相匹配,并打印出相应的ensg ID号,使输出文件看起来像这样:

  ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

我曾尝试这个循环:

  EXEC<源文件
而读线

grep的-w $线reference.file>输出文件
DONE

我也试过用awk与参考文件玩耍

 的awk'NF == 2 {$打印0}备查文件
awk的NF大于2 {打印$ 0}'备查文件

但我只拿到grep'd ID之一。

任何建议或这样做的更简单的方法将是巨大的。


解决方案

$比fgrep -f source.file reference.file
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

fgrep一样等同于的grep -F

-F,--fixed串
          国米preT模式作为固定字符串列表,相隔
          换行,其中任何一个是被匹配。 (-F被指定
          POSIX)。

-f 选项是拍摄 PATTERN 从文件:

-f FILE,--file = FILE
          从文件获取模式,每行一个。空文件
          包含零模式,因此匹配无关。 (-f是
          由POSIX指定)。

由于在评论中指出,这可能会产生误报,如果在 reference.file 一个ID中包含 source.file 作为一个子字符串。您可以构建的grep 飞更明确的模式与 SED

的grep -f≤(sed的的/.*/&安培; $ /'input.file)reference.file

但这样的模式是间preTED定期EX pressions,而不是固定的字符串,这是潜在的脆弱(尽管可能是确定的,如果该ID只包含字母数字字符)。更好的方法,虽然(感谢@sidharthcnadhan),是使用 -w 选项:

-w,--word,正则表达式
          只选择那些包含行的形式整体匹配
          话。该测试是匹配的子字符串必须是
          在该行的开头,或通过非字pceded $ P $
          组成字符。同样地,它必须是在端
          线或随后的非字组成字符。
          字组成字符是字母,数字和
          下划线。

所以最后回答你的问题是:

的grep -Fwf source.file reference.file

I am having problems with grep and awk. I think it's because my input file contains text that looks like code.

The input file contains ID names and looks like this:

SNORD115-40
MIR432
RNU6-2

The reference file looks like this:

Ensembl Gene ID HGNC symbol
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000266661
ENSG00000243133
ENSG00000207447 RNU6-2

I want to match the ID names from my source file with my reference file and print out the corresponding ensg ID numbers so that the output file looks like this:

ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

I have tried this loop:

exec < source.file
while read line
do
grep -w $line reference.file > outputfile
done

I've also tried playing around with the reference file using awk

awk 'NF == 2 {print $0}' reference file
awk 'NF >2 {print $0}' reference file

but I only get one of the grep'd IDs.

Any suggestions or easier ways of doing this would be great.

解决方案

$ fgrep -f source.file reference.file 
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

fgrep is equivalent to grep -F:

   -F, --fixed-strings
          Interpret  PATTERN  as  a  list  of  fixed strings, separated by
          newlines, any of which is to be matched.  (-F  is  specified  by
          POSIX.)

The -f option is for taking PATTERN from a file:

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

As noted in the comments, this can produce false positives if an ID in reference.file contains an ID in source.file as a substring. You can construct a more definitive pattern for grep on the fly with sed:

grep -f <( sed 's/.*/ &$/' input.file) reference.file

But this way the patterns are interpreted as regular expressions and not as fixed strings, which is potentially vulnerable (although may be OK if the IDs only contain alphanumeric characters). The better way, though (thanks to @sidharthcnadhan), is to use the -w option:

   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

So the final answer to your question is:

grep -Fwf source.file reference.file

这篇关于使用grep和awk匹配的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆