如何匹配多个文件中的多个正则表达式模式并创建日志文件? [英] How to match multiple regex patterns in multiple files and create a log file?

查看:74
本文介绍了如何匹配多个文件中的多个正则表达式模式并创建日志文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想搜索文件(* .txt)中的一些正则表达式模式,这些文件位于文件夹中,该文件夹的路径是我在文本框中给出的,该文件夹包含其他子文件夹,其中包含12345-2031格式的txt文件-30201\2031\30201\txt\110.txt并且如果模式甚至在一个文件中匹配,则在日志文件上写入一个字符串,该文件在文件夹中创建,该文件夹的路径我在文本中给出框,然后它移动到下一个正则表达式,依此类推。



我遇到的问题是日志文件只显示第一个匹配的模式,而不显示其他模式是否在任何文件中匹配。基本上发生的事情是程序打开说第一个文件并搜索第一个模式并找到匹配,因此它会写入与该模式相关联的文本check figure link,这就是日志文件中的所有内容,但是相同的文件(和一些其他文件)匹配第二个和第三个模式,但它不显示与这些模式相关的文本,如检查表链接和检查部分链接。



有人可以帮忙吗?



我尝试过:



  Dim  patterns = 列表(  字符串())来自{
({ 检查图链接 (?< !>)(图| fig \。| figs \。|数字)(\d +)}),
({ 检查表链接 (?<!>)(table | tab \。| tabs \。| tables)(\d +)}),
({ 检查部分链接 (?<!>)(部分| sec \。| sect \。|部分)(\d +)}),
({ 检查空格 < / inline> w +}),}

Dim output = from pattern patterns.AsParallel
regEx = 正则表达式(模式( 1 ),RegexOptions.Compiled)
来自tFile Directory.Enumer ateFiles(TextBox1.Text, * .txt,SearchOption.AllDirectories)
where tFile 喜欢 * \#* \#* \\ \\#* \txt \#* .txt AndAlso regEx.IsMatch(File.ReadAllText(tFile))
1
选择模式( 0

File.WriteAllLines(TextBox1.Text.TrimEnd( \ c)& \ Checklist.log,输出)
MsgBox( 处理完成

解决方案

所以看起来你想根据模式找到哪些文件有错误。

你所拥有的将在日志中给你最多4行指示文件层次结构中某处存在哪种类型的错误。

这似乎非常缺乏信息!

不会最好知道(至少)哪些文件有错误?

知道哪些错误不是更好那些文件?



这是我的建议。

(这可能完全取决于你的要求,但你有什么对于几乎零信息内容感觉很多工作。)



这种结构化并行化的方式非常低效,因为它读取每个模式的每个文件,因此每个文件可能被扫描五次!

并行化文件名。

这意味着需要在执行循环之前创建和编译 Regex 实例。

您还阅读每次整个文件。您显示的模式不会跨越行,因此每次匹配每行一个以避免可能的大量IO。

我已经将每个文件检查提取到函数中,因此它可以消除冗余IO和 Regex 匹配。

我还简化了 Regex 模式。 br />
类似(我的VB生锈)

 '  已编辑的MTH:根据您的评论...  
Dim patterns = < span class =code-keyword>新列表( 字符串())来自{
({ 检查图链接 (?<!>)(?:数字?| figs?\。)\d +}),
({ 检查表格链接 (?<!>)(?:tables?| tabs?\。)\ d +}),
({ 检查部分链接 (?<!>)(?:sections?| sect?\。)\d +} ),
({ 检查空间 < / inline> \w +})}

Dim compiledPatterns = 字典( 正则表达式,字符串
对于 每个 pat 作为 字符串()模式中
compiledPatterns.Add(新正则表达式(pat( 1 ),RegexOptions.Compiled),pat( 0 ))
下一步

' 编辑:设置转置下面收集的信息。
' 这将是棘手的结合并行化。
' 所以它是作为顺序传递完成的info。
Dim pathsByMessage = 字典( Of 字符串,列表( 字符串))
对于 每个 pat 作为 字符串()模式中
pathsByMessage.Add(pat( 0 ),< span class =code-keyword>新列表( 字符串))
下一步

Dim filteredFilenames =来自tFile Directory.EnumerateFiles(TextBox1.Text, * .txt ,SearchOption.AllDirectories)
其中tFile 喜欢 * \#* \#* \#* \ txt \#* .txt

昏暗 output =来自tFile filteredFilenames.AsParallel
检查= CheckFile(tFile,compiledPatterns)
其中c hecks.Any
选择路径= tFile,Messages =检查

' 编辑:现在转置此
' 即使查询仍在并行运行,也可以按顺序*处理*!
对于 每个 pm 输出
对于 每个 msg pm.Messages
pathsByMessage(msg).Add(pm.Path)
下一步
下一步

File.WriteAllLines(TextBox1.Text.TrimEnd(< span class =code-string> \ c)& \ Checklist.log
来自pbm pathsByMessage
其中pbm.Value.Any
选择 字符串 .Format( {0}== {1},pbm.Key,vbNewLine& String .Join (vbNewLine,pbm.Value)))
MsgBox( 流程完成





 功能 CheckFile (tFile  As   String ,compiledPatterns  As  Dictionary ( 正则表达式,字符串))作为列表(  字符串
' 这样可以消除冗余检查
Dim 消息作为 列表(< span class =code-keyword> 字符串
Dim 检查= HashSet( 正则表达式)(compiledPatterns.Keys)

对于 每个 File.ReadLines(tFile)
如果 checks.Any 那么
退出 对于
结束 如果
对于 每个 re 中检查.ToList
如果 re.IsMatch(line)那么
messages.Add( compiledPatterns(re))
checks.Remove(re)
结束 如果
下一步
下一步

返回消息
结束 功能


I want to search some regex patterns in files (*.txt) which are inside a folder whose path I'have given in a text box, and the folder contains other sub-folders with txt files in the form 12345-2031-30201\2031\30201\txt\110.txt and if the pattern matches even in one file, then a string is written on a log file which is created inside the folder whose path I've given in the text box and then it moves on to the next regex and so on.

The problem I'm having is the log file is only showing the first matched pattern and not showing whether other patterns have matched in any file or not. Basically what is happening is that the program opens say the first file and searches the first pattern and it finds a match, so it writes the text associated with that pattern "check figure link" and that's all that is in the log file, but the same file(and some other files) do match the second and the third pattern but it does not show the texts associated with those patterns like "check table link" and "check section link".

Can anyone help?

What I have tried:

Dim patterns = New List(Of String()) From {
({"Check figure link", "(?<!>)(figure|fig\.|figs\.|figures) (\d+)"}),
({"Check table link", "(?<!>)(table|tab\.|tabs\.|tables) (\d+)"}),
 ({"Check section link", "(?<!>)(section|sec\.|sect\.|section) (\d+)"}),
 ({"Check space", "</inline>w+"}),}

    Dim output = From pattern In patterns.AsParallel
                 Let regEx = New Regex(pattern(1), RegexOptions.Compiled)
                 From tFile In Directory.EnumerateFiles(TextBox1.Text, "*.txt", SearchOption.AllDirectories)
                 Where tFile Like "*\#*\#*\#*\txt\#*.txt" AndAlso regEx.IsMatch(File.ReadAllText(tFile))
                 Take 1
                 Select pattern(0)

    File.WriteAllLines(TextBox1.Text.TrimEnd("\"c) & "\Checklist.log", output)
    MsgBox("Process Complete")

解决方案

So it looks like you want to find which files have errors according to the patterns.
What you have will give you up to 4 lines in the log indicating which type of errors exist somewhere in the file hierarchy.
This seems very uninformative!
Wouldn't it be better to know (at least) which files have errors?
Wouldn't it be better to know which errors are in each of those files?

Here's what I suggest.
(This may be totally off from your requirements, but what you have just feels like a lot of work for nearly zero information content.)

The way you have this structured the parallelization is very inefficient, as it reads each file for each pattern, so each file is potentially scanned five times!
Parallelize across the filenames.
This then means that the Regex instances need to be created and compiled ahead of the execution loops.
You also read the whole file each time. The patterns that you've shown do not span lines, so match against each line one at a time to avoid potentially lots of IO.
I've extracted the per-file checking to a function so it can eliminate redundant IO and Regex matching.
I also simplified the Regex patterns.
Something like (my VB is rusty):

'Edited MTH: based on your comment...
Dim patterns = New List(Of String()) From {
     ({"Check figure link", "(?<!>)(?:figures?|figs?\.) \d+"}),
     ({"Check table link", "(?<!>)(?:tables?|tabs?\.) \d+"}),
     ({"Check section link", "(?<!>)(?:sections?|sect?\.) \d+"}),
     ({"Check space", "</inline>\w+"})}

Dim compiledPatterns = New Dictionary(Of Regex, String)
For Each pat As String() In patterns
    compiledPatterns.Add(New Regex(pat(1), RegexOptions.Compiled), pat(0))
Next

'Edit: Setup to "transpose" the information collected below.
'This would be tricky to do combined with parallelization.
'So it is done as a sequential pass through the collected info.
Dim pathsByMessage = New Dictionary(Of String, List(Of String))
For Each pat As String() In patterns
    pathsByMessage.Add(pat(0), New List(Of String))
Next

Dim filteredFilenames = From tFile In Directory.EnumerateFiles(TextBox1.Text, "*.txt", SearchOption.AllDirectories)
             Where tFile Like "*\#*\#*\#*\txt\#*.txt"

Dim output = From tFile In filteredFilenames.AsParallel
               Let checks = CheckFile(tFile, compiledPatterns)
               Where checks.Any
               Select Path = tFile, Messages = checks

'Edit: Now "transpose" this
'It's OK to process this *sequentially*, even if the query is still running parallelized!
For Each pm In output
    For Each msg In pm.Messages
        pathsByMessage(msg).Add(pm.Path)
    Next
Next

File.WriteAllLines(TextBox1.Text.TrimEnd("\"c) & "\Checklist.log",
                   From pbm In pathsByMessage
                     Where pbm.Value.Any
                     Select String.Format("""{0}""=={1}", pbm.Key, vbNewLine & String.Join(vbNewLine, pbm.Value)))
MsgBox("Process Complete")


And

Function CheckFile(tFile As String, compiledPatterns As Dictionary(Of Regex, String)) As List(Of String)
    'structure this to eliminate redundant checking
    Dim messages As New List(Of String)
    Dim checks = New HashSet(Of Regex)(compiledPatterns.Keys)

    For Each line In File.ReadLines(tFile)
        If Not checks.Any Then
            Exit For
        End If
        For Each re In checks.ToList
            If re.IsMatch(line) Then
                messages.Add(compiledPatterns(re))
                checks.Remove(re)
            End If
        Next
    Next

    Return messages
End Function


这篇关于如何匹配多个文件中的多个正则表达式模式并创建日志文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆