使用 grep 或 awk 匹配文本 [英] Matching text using grep or awk

查看:15
本文介绍了使用 grep 或 awk 匹配文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 grep 和 awk 时遇到问题.我认为这是因为我的输入文件包含看起来像代码的文本.

I am having problems with grep and awk. I think it's because my input file contains text that looks like code.

输入文件包含 ID 名称,如下所示:

The input file contains ID names and looks like this:

SNORD115-40
MIR432
RNU6-2

参考文件如下所示:

Ensembl Gene ID HGNC symbol
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000266661
ENSG00000243133
ENSG00000207447 RNU6-2

我想将我的源文件中的 ID 名称与我的参考文件进行匹配,并打印出相应的 ensg ID 编号,以便输出文件如下所示:

I want to match the ID names from my source file with my reference file and print out the corresponding ensg ID numbers so that the output file looks like this:

ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

我试过这个循环:

exec < source.file
while read line
do
grep -w $line reference.file > outputfile
done

我也尝试过使用 awk 处理参考文件

I've also tried playing around with the reference file using awk

awk 'NF == 2 {print $0}' reference file
awk 'NF >2 {print $0}' reference file

但我只得到了一个 grep 的 ID.

but I only get one of the grep'd IDs.

任何建议或更简单的方法都会很棒.

Any suggestions or easier ways of doing this would be great.

推荐答案

$ fgrep -f source.file reference.file 
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2

fgrep 等价于 grep -F:

   -F, --fixed-strings
          Interpret  PATTERN  as  a  list  of  fixed strings, separated by
          newlines, any of which is to be matched.  (-F  is  specified  by
          POSIX.)

-f 选项用于从文件中获取 PATTERN:

The -f option is for taking PATTERN from a file:

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.   The  empty file
          contains zero patterns, and therefore matches nothing.   (-f  is
          specified by POSIX.)

如评论中所述,如果 reference.file 中的 ID 包含 source.file 中的 ID 作为子字符串,则这可能会产生误报.您可以使用 sed 即时为 grep 构建更明确的模式:

As noted in the comments, this can produce false positives if an ID in reference.file contains an ID in source.file as a substring. You can construct a more definitive pattern for grep on the fly with sed:

grep -f <( sed 's/.*/ &$/' input.file) reference.file

但是这样模式被解释为正则表达式而不是固定字符串,这可能容易受到攻击(尽管如果 ID 只包含字母数字字符可能没问题).不过,更好的方法(感谢@sidharthcnadhan)是使用 -w 选项:

But this way the patterns are interpreted as regular expressions and not as fixed strings, which is potentially vulnerable (although may be OK if the IDs only contain alphanumeric characters). The better way, though (thanks to @sidharthcnadhan), is to use the -w option:

   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.  Similarly, it must be either at the  end
          of  the  line  or  followed by a non-word constituent character.
          Word-constituent  characters  are  letters,  digits,   and   the
          underscore.

所以你的问题的最终答案是:

So the final answer to your question is:

grep -Fwf source.file reference.file

这篇关于使用 grep 或 awk 匹配文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆