如何匹配多个文件中的多个正则表达式模式并创建日志文件? [英] How to match multiple regex patterns in multiple files and create a log file?
问题描述
我想搜索文件(* .txt)中的一些正则表达式模式,这些文件位于文件夹中,该文件夹的路径是我在文本框中给出的,该文件夹包含其他子文件夹,其中包含12345-2031格式的txt文件-30201\2031\30201\txt\110.txt并且如果模式甚至在一个文件中匹配,则在日志文件上写入一个字符串,该文件在文件夹中创建,该文件夹的路径我在文本中给出框,然后它移动到下一个正则表达式,依此类推。
我遇到的问题是日志文件只显示第一个匹配的模式,而不显示其他模式是否在任何文件中匹配。基本上发生的事情是程序打开说第一个文件并搜索第一个模式并找到匹配,因此它会写入与该模式相关联的文本check figure link,这就是日志文件中的所有内容,但是相同的文件(和一些其他文件)匹配第二个和第三个模式,但它不显示与这些模式相关的文本,如检查表链接和检查部分链接。
有人可以帮忙吗?
我尝试过:
Dim patterns = 新列表( 字符串())来自{
({ 检查图链接, (?< !>)(图| fig \。| figs \。|数字)(\d +)}),
({ 检查表链接, (?<!>)(table | tab \。| tabs \。| tables)(\d +)}),
({ 检查部分链接, (?<!>)(部分| sec \。| sect \。|部分)(\d +)}),
({ 检查空格, < / inline> w +}),}
Dim output = from pattern 在 patterns.AsParallel
让 regEx = 新正则表达式(模式( 1 ),RegexOptions.Compiled)
来自tFile 在 Directory.Enumer ateFiles(TextBox1.Text, * .txt,SearchOption.AllDirectories)
where tFile 喜欢 * \#* \#* \\ \\#* \txt \#* .txt AndAlso regEx.IsMatch(File.ReadAllText(tFile))
取 1
选择模式( 0 )
File.WriteAllLines(TextBox1.Text.TrimEnd( \ c)& \ Checklist.log,输出)
MsgBox( 处理完成)
所以看起来你想根据模式找到哪些文件有错误。
你所拥有的将在日志中给你最多4行指示文件层次结构中某处存在哪种类型的错误。
这似乎非常缺乏信息!
不会最好知道(至少)哪些文件有错误?
知道哪些错误不是更好那些文件?
这是我的建议。
(这可能完全取决于你的要求,但你有什么对于几乎零信息内容感觉很多工作。)
这种结构化并行化的方式非常低效,因为它读取每个模式的每个文件,因此每个文件可能被扫描五次!
并行化文件名。
这意味着需要在执行循环之前创建和编译Regex
实例。
您还阅读每次整个文件。您显示的模式不会跨越行,因此每次匹配每行一个以避免可能的大量IO。
我已经将每个文件检查提取到函数中,因此它可以消除冗余IO和Regex
匹配。
我还简化了Regex
模式。 br />
类似(我的VB生锈):
' 已编辑的MTH:根据您的评论...
Dim patterns = < span class =code-keyword>新列表( 字符串())来自{
({ 检查图链接, (?<!>)(?:数字?| figs?\。)\d +}),
({ 检查表格链接, (?<!>)(?:tables?| tabs?\。)\ d +}),
({ 检查部分链接, (?<!>)(?:sections?| sect?\。)\d +} ),
({ 检查空间, < / inline> \w +})}
Dim compiledPatterns = 新字典( 正则表达式,字符串)
对于 每个 pat 作为 字符串()在模式中
compiledPatterns.Add(新正则表达式(pat( 1 ),RegexOptions.Compiled),pat( 0 ))
下一步
' 编辑:设置转置下面收集的信息。
' 这将是棘手的结合并行化。
' 所以它是作为顺序传递完成的info。
Dim pathsByMessage = 新字典( Of 字符串,列表( 字符串))
对于 每个 pat 作为 字符串()在模式中
pathsByMessage.Add(pat( 0 ),< span class =code-keyword>新列表( 字符串))
下一步
Dim filteredFilenames =来自tFile 在 Directory.EnumerateFiles(TextBox1.Text, * .txt ,SearchOption.AllDirectories)
其中tFile 喜欢 * \#* \#* \#* \ txt \#* .txt
昏暗 output =来自tFile 在 filteredFilenames.AsParallel
让检查= CheckFile(tFile,compiledPatterns)
其中c hecks.Any
选择路径= tFile,Messages =检查
' 编辑:现在转置此
' 即使查询仍在并行运行,也可以按顺序*处理*!
对于 每个 pm 在输出
对于 每个 msg 在 pm.Messages
pathsByMessage(msg).Add(pm.Path)
下一步
下一步
File.WriteAllLines(TextBox1.Text.TrimEnd(< span class =code-string> \ c)& \ Checklist.log,
来自pbm 在 pathsByMessage
其中pbm.Value.Any
选择 字符串 .Format( {0}== {1},pbm.Key,vbNewLine& String .Join (vbNewLine,pbm.Value)))
MsgBox( 流程完成)
和
功能 CheckFile (tFile As String ,compiledPatterns As Dictionary ( 正则表达式,字符串))作为列表( 字符串)
' 这样可以消除冗余检查
Dim 消息作为 新列表(< span class =code-keyword> 字符串)
Dim 检查= 新 HashSet( 正则表达式)(compiledPatterns.Keys)
对于 每个行在 File.ReadLines(tFile)
如果 不 checks.Any 那么
退出 对于
结束 如果
对于 每个 re 在中检查.ToList
如果 re.IsMatch(line)那么
messages.Add( compiledPatterns(re))
checks.Remove(re)
结束 如果
下一步
下一步
返回消息
结束 功能
I want to search some regex patterns in files (*.txt) which are inside a folder whose path I'have given in a text box, and the folder contains other sub-folders with txt files in the form 12345-2031-30201\2031\30201\txt\110.txt and if the pattern matches even in one file, then a string is written on a log file which is created inside the folder whose path I've given in the text box and then it moves on to the next regex and so on.
The problem I'm having is the log file is only showing the first matched pattern and not showing whether other patterns have matched in any file or not. Basically what is happening is that the program opens say the first file and searches the first pattern and it finds a match, so it writes the text associated with that pattern "check figure link" and that's all that is in the log file, but the same file(and some other files) do match the second and the third pattern but it does not show the texts associated with those patterns like "check table link" and "check section link".
Can anyone help?
What I have tried:
Dim patterns = New List(Of String()) From {
({"Check figure link", "(?<!>)(figure|fig\.|figs\.|figures) (\d+)"}),
({"Check table link", "(?<!>)(table|tab\.|tabs\.|tables) (\d+)"}),
({"Check section link", "(?<!>)(section|sec\.|sect\.|section) (\d+)"}),
({"Check space", "</inline>w+"}),}
Dim output = From pattern In patterns.AsParallel
Let regEx = New Regex(pattern(1), RegexOptions.Compiled)
From tFile In Directory.EnumerateFiles(TextBox1.Text, "*.txt", SearchOption.AllDirectories)
Where tFile Like "*\#*\#*\#*\txt\#*.txt" AndAlso regEx.IsMatch(File.ReadAllText(tFile))
Take 1
Select pattern(0)
File.WriteAllLines(TextBox1.Text.TrimEnd("\"c) & "\Checklist.log", output)
MsgBox("Process Complete")
So it looks like you want to find which files have errors according to the patterns.
What you have will give you up to 4 lines in the log indicating which type of errors exist somewhere in the file hierarchy.
This seems very uninformative!
Wouldn't it be better to know (at least) which files have errors?
Wouldn't it be better to know which errors are in each of those files?
Here's what I suggest.
(This may be totally off from your requirements, but what you have just feels like a lot of work for nearly zero information content.)
The way you have this structured the parallelization is very inefficient, as it reads each file for each pattern, so each file is potentially scanned five times!
Parallelize across the filenames.
This then means that theRegex
instances need to be created and compiled ahead of the execution loops.
You also read the whole file each time. The patterns that you've shown do not span lines, so match against each line one at a time to avoid potentially lots of IO.
I've extracted the per-file checking to a function so it can eliminate redundant IO andRegex
matching.
I also simplified theRegex
patterns.
Something like (my VB is rusty):
'Edited MTH: based on your comment... Dim patterns = New List(Of String()) From { ({"Check figure link", "(?<!>)(?:figures?|figs?\.) \d+"}), ({"Check table link", "(?<!>)(?:tables?|tabs?\.) \d+"}), ({"Check section link", "(?<!>)(?:sections?|sect?\.) \d+"}), ({"Check space", "</inline>\w+"})} Dim compiledPatterns = New Dictionary(Of Regex, String) For Each pat As String() In patterns compiledPatterns.Add(New Regex(pat(1), RegexOptions.Compiled), pat(0)) Next 'Edit: Setup to "transpose" the information collected below. 'This would be tricky to do combined with parallelization. 'So it is done as a sequential pass through the collected info. Dim pathsByMessage = New Dictionary(Of String, List(Of String)) For Each pat As String() In patterns pathsByMessage.Add(pat(0), New List(Of String)) Next Dim filteredFilenames = From tFile In Directory.EnumerateFiles(TextBox1.Text, "*.txt", SearchOption.AllDirectories) Where tFile Like "*\#*\#*\#*\txt\#*.txt" Dim output = From tFile In filteredFilenames.AsParallel Let checks = CheckFile(tFile, compiledPatterns) Where checks.Any Select Path = tFile, Messages = checks 'Edit: Now "transpose" this 'It's OK to process this *sequentially*, even if the query is still running parallelized! For Each pm In output For Each msg In pm.Messages pathsByMessage(msg).Add(pm.Path) Next Next File.WriteAllLines(TextBox1.Text.TrimEnd("\"c) & "\Checklist.log", From pbm In pathsByMessage Where pbm.Value.Any Select String.Format("""{0}""=={1}", pbm.Key, vbNewLine & String.Join(vbNewLine, pbm.Value))) MsgBox("Process Complete")
And
Function CheckFile(tFile As String, compiledPatterns As Dictionary(Of Regex, String)) As List(Of String) 'structure this to eliminate redundant checking Dim messages As New List(Of String) Dim checks = New HashSet(Of Regex)(compiledPatterns.Keys) For Each line In File.ReadLines(tFile) If Not checks.Any Then Exit For End If For Each re In checks.ToList If re.IsMatch(line) Then messages.Add(compiledPatterns(re)) checks.Remove(re) End If Next Next Return messages End Function
这篇关于如何匹配多个文件中的多个正则表达式模式并创建日志文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!