匹配以awk args传递的多行上的多个正则表达式 [英] Matching multiple regexs on multiple lines passed as awk args

查看:117
本文介绍了匹配以awk args传递的多行上的多个正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图遍历一个大目录并对每个文件运行不同的正则表达式以提取以下数据;

I'm trying to iterate through a large directory and run different regexs against each file to pull out the following data;

  1. 文件名
  2. 模式匹配
  3. 匹配的行
  4. 出现次数

由于@anubhava,我能够获得一个脚本,该脚本将跨多行搜索一个正则表达式并返回我需要的数据.

Thanks to @anubhava I was able to get a script that would search for one regex across multiple lines and return the data I needed.

此后,我尝试修改(并切入)脚本以匹配文件中的多个正则表达式,并返回所有正则表达式的数据.我可能会在一个文件中寻找多达8个正则表达式模式.我现在正试图使其与脚本中硬编码的正则表达式一起使用,但最终我想将正则表达式模式作为args传递给脚本,并对每个模式运行match命令.

I've since tried to adapt (and butchered) the script to match more than one regex in the file and return the data for all the regex's. I could potentially be looking for up to 8 regex patterns in one file. I was trying to get it to work with the regex hardcoded in the script for now but eventually I would like to pass the regex patterns in as args to the script and run the match command against each pattern.

这是目前的awk脚本,但它引发以下错误;

This is the awk script at the present but it is throwing the following error;

fatal: match: third argument is not an array+

脚本;

#!/usr/bin/awk -f

BEGIN {print ARGV[1], "(Filename)"}
{
    RS = "\r?\n" 
    filemsg= "new File() Found on line "
    fismmsg= "FileInputStream Found on line "
   while(match($0, /new[[:blank:]]+File\(/, /FileInputStream/)) {
      nf = match($0, /new[[:blank:]]+File\(/)
      fis = match($0, /FileInputStream/)
      if (nf != ""){
        print filemsg NR
        ++n
      }
      else if (fis != "") {
        print fismmsg NR
        ++m
      }
      $0 = substr($0, RSTART+RLENGTH)
   }
}
/new[[:blank:]]*$/ {
   p = NR
   next
}
/FileInputStream/ {
  l = NR
  next
}
p && NF {
   if (/^[[:blank:]]*File\(/) {
      print filemsg p, "&", NR
      ++n
   }
   p = 0
}
l && NF {
   if (/FileInputStream/) {
      print fismmsg p, "&", NR
      ++m
    }
}
END {
   if (n > 0) {
     print n, "(number of occurrences of new File() pattern)\n"
   }
   else if (m > 0) {
     print m, "(number of occurrences of FileInputStream pattern)\n"
   }
   else {
     print "No occurrences of new File() or FileInputStream\n"
   }
}

毫无疑问,我正在做一些非常愚蠢的事情.

I've no doubt I'm doing something really dumb.

理想情况下,我将每个正则表达式作为var传入,并在ARGV上迭代以在当前硬编码值所在的行中使用,但这也引发了一个问题,即如何将arg拆分为能够在多行上使用我们添加^ [[:: blank:]]之类的字符,以检查模式其余部分之前的行上是否有空格.

Ideally I would pass each regex in as a var and iterate over the ARGV's to use in line where the hardcoded values currently are but that also raises the question on how would you split that arg to be able to use over multi line as we add the likes of ^[[:blank:]] to check for blank spaces on a line before the rest of the pattern.

更新

示例输入为;

awk -v regex1="new[[:blank:]]+File\(" -v regex2="FileInputStream" -v regex3="org\\.apache\\.commons\\.net\\.ftp\\."-f parameterisedRegexAWKScript.awk "$file" >> "output.txt"'

示例输出为;

./modules/configuration/config/rules/somerule.gr (Filename)
No occurrences of new File() 

./modules/configuration/upgrade/contact/somecontact.gs (Filename)

No occurrences of new File() 

./modules/configuration/entity/someentity.gsx (Filename)
No occurrences of new File() 

./modules/configuration/FTP/newFileTest.txt (Filename)
new File() Found on line 15
new File() Found on line 18
new File() Found on line 28
new File() Found on line 37
new File() Found on line 53
5 (number of occurrences of new File() pattern)

./modules/configuration/FTP/test.txt (Filename)
new File() Found on line 3
new File() Found on line 4 & 8
new File() Found on line 10
new File() Found on line 10
4 (number of occurrences of new File() pattern)

./modules/configuration/personaldata/someperson.gs (Filename)
No occurrences of new File() 

./modules/configuration/processes/someprocess.gs (Filename)
No occurrences of new File() 

./originalAwkScript.txt (Filename)
new File() Found on line 6
new File() Found on line 29
new File() Found on line 32
3 (number of occurrences of new File() pattern)

更新2

test.tx的内容

Contents of test.tx

new
File()
new File()
new



File()
File() new
new File() test new File(Test)
FileInputStream

同一文件夹中另一个示例文件的内容;

Contents of another sample file in the same folder;

    protected function buildDocumentsPath(documentRootDir : String, documentTmpDir : String) {
    if (DocumentsPathParameter.HasContent) {
      DemoDocumentsPath = getAbsolutePath(DocumentsPathParameter, documentRootDir)
      if (!new test 
      File(DemoDocumentsPath).equals(new File(DocumentsPathParameter))) {
          Logger.DOCUMENT.warn((typeof this).RelativeName)
          DocumentsPath = getAbsolutePath(DocumentsPathParameter, documentTmpDir)
          var file = new File(DocumentsPath)
          if (!file.exists() && file.isDirectory()) {
              file.mkdirs()
          }
      } 
    }

  }

但是输入文件可以是任何Java类,对此没有什么特殊要求.

But the input files could be any java class, nothing special about them.

要求摘要;本质上,我正在尝试使用bash命令解析大型目录,该命令使用awk脚本搜索不同的正则表达式.这些正则表达式可以出现在类的多行中,我需要捕获问题顶部列出的所有数据.我有不同的搜索类别,因此例如在FTP中,我正在寻找"new File(","FileInputStream","org.apache.commons.net.ftp",java.nio.file"的出现,因此每个都有一个正则表达式,但是还有其他类别,例如print(具有不同的regex)等.因此,理想情况下,我希望能够将我要搜索的任何正则表达式作为参数传递到awk脚本中,并将检索到的数据存储在文件中.

Summary of requirement; Essentially I'm trying to parse through a large directory using a bash command that uses an awk script to search for different regexs. Those regex's can occur over multiple lines in the classes and I need to capture all the data listed at the top of the question. I have different category of searches, so for example in FTP I'm looking for occurrences 'new File(', 'FileInputStream', 'org.apache.commons.net.ftp', java.nio.file', so there is a regex for each but there are other categories such as print (which has a different regex) etc. So Ideally I want to be able to pass whichever regex I'm searching for into the awk script as params and store the retrieved data in a file.

推荐答案

错误消息 match:第三个参数不是数组表示您正在调用 match()具有三个参数的函数,并且第三个参数不是预期的数组.

The error message match: third argument is not an array means that you are calling the match() function with three arguments, and that the third one is not an array as expected.

这是对带有三个参数的 match()的唯一调用:

This is the only call to match() with three arguments:

match($0, /new[[:blank:]]+File\(/, /FileInputStream/)

从下一行开始,您要匹配两个正则表达式之一.您的行应为:

Judging by the next lines, you want to match either of the two regexes. Your line should then be:

match($0, /new[[:blank:]]+File\(|FileInputStream/)

这篇关于匹配以awk args传递的多行上的多个正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆