优化 Word 文档关键字搜索 [英] Optimize Word document keyword search

查看:17
本文介绍了优化 Word 文档关键字搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在大量 MS Word 文档中搜索关键字,并将结果返回到文件中.我有一个工作脚本,但我不知道规模,而且我所拥有的还不够高效,需要几天时间才能完成所有事情.

I'm trying to search for keywords across a large number of MS Word documents, and return the results to a file. I've got a working script, but I wasn't aware of the scale, and what I've got isn't nearly efficient enough, it would take days to plod through everything.

脚本现在从 CompareData.txt 获取关键字,并在特定文件夹中的所有文件中运行它,然后将其附加到文件中.

The script as it stands now takes keywords from CompareData.txt and runs it through all the files in a specific folder, then appends it to a file.

所以当我完成后,我会知道每个特定关键字有多少个文件.

So when I'm done I will know how many files have each specific keyword.

[cmdletBinding()] 
Param( 
$Path = "C:willscratch" 
) #end param 
$findTexts = (Get-Content c:scratchCompareData.txt)
Foreach ($Findtext in $FindTexts)
{
$matchCase = $false 
$matchWholeWord = $true 
$matchWildCards = $false 
$matchSoundsLike = $false 
$matchAllWordForms = $false 
$forward = $true 
$wrap = 1 
$application = New-Object -comobject word.application 
$application.visible = $False 
$docs = Get-childitem -path $Path -Recurse -Include *.docx  
$i = 1 
$totaldocs = 0 
Foreach ($doc in $docs) 
{ 
Write-Progress -Activity "Processing files" -status "Processing $($doc.FullName)" -PercentComplete ($i /$docs.Count * 100) 
$document = $application.documents.open($doc.FullName) 
$range = $document.content 
$null = $range.movestart() 
$wordFound = $range.find.execute($findText,$matchCase, 
  $matchWholeWord,$matchWildCards,$matchSoundsLike, 
  $matchAllWordForms,$forward,$wrap) 
  if($wordFound) 
    { 
     $doc.fullname 
     $document.Words.count 
     $totaldocs ++ 
  } #end if $wordFound 
$document.close() 
$i++ 
} #end foreach $doc 
$application.quit() 
"There are $totaldocs total files with $findText"  | Out-File -Append C:scratchoutput.txt

#clean up stuff 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($range) | Out-Null 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($document) | Out-Null 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null 
Remove-Variable -Name application 
[gc]::collect() 
[gc]::WaitForPendingFinalizers() 
}

我想要做的是找出一种方法,在每个文件中搜索 CompareData.txt 中的所有内容一次,而不是多次遍历它.如果我正在处理一小组数据,我所采用的方法可以完成工作 - 但我发现 CompareData.txt 中的数据和源 Word 文件目录中的数据都非常大.

What I'd like to do is figure out a way to search each file for everything in CompareData.txt once, rather than iterate through it a bunch of times. If I was dealing with a small set of data, the approach I've got would get the job done - but I've come to find out that both the data in CompareData.txt and the source Word file directory will be very large.

关于如何优化这个有什么想法吗?

Any ideas on how to optimize this?

推荐答案

现在你正在这样做(伪代码):

Right now you're doing this (pseudocode):

foreach $Keyword {
    create Word Application
    foreach $File {
        load Word Document from $File
        find $Keyword
    }
}

这意味着,如果您有 100 个关键字和 10 个文档,那么您正在打开和关闭 100 个 Word 实例,并在您之前加载一千个 Word 文档完成了.

That means that if you have a 100 keywords and 10 documents, you're opening and closing a 100 instances of Word and loading in a thousand word documents before you're done.

改为这样做:

create Word Application
foreach $File {
    load Word Document from $File
    foreach $Keyword {
        find $Keyword
    }
}

因此,您只需启动一个 Word 实例,并且每个文档只加载一次.

So you only launch one instance of Word and only load each document once.

正如评论中提到的,你可以优化整体使用 OpenXML SDK 进行处理,而不是启动词:

As noted in the comments, you may optimize the whole process by using the OpenXML SDK, rather than launching Word:

(假设您已在其默认位置安装了 OpenXML SDK)

(assuming you've installed OpenXML SDK in its default location)

# Import the OpenXML library
Add-Type -Path 'C:Program Files (x86)Open XML SDKV2.5libDocumentFormat.OpenXml.dll'

# Grab the keywords and file names    
$Keywords  = Get-Content C:scratchCompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx  

# hashtable to store results per document
$KeywordMatches = @{}

# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]

foreach($Docx in $Docs)
{
    # create array to hold matched keywords
    $KeywordMatches[$Docx.FullName] = @()

    # open document, wrap content stream in streamreader 
    $Document       = $WordDoc::Open($Docx.FullName, $false)
    $DocumentStream = $Document.MainDocumentPart.GetStream()
    $DocumentReader = New-Object System.IO.StreamReader $DocumentStream

    # read entire document
    $DocumentContent = $DocumentReader.ReadToEnd()

    # test for each keyword
    foreach($Keyword in $Keywords)
    {
        $Pattern   = [regex]::Escape($KeyWord)
        $WordFound = $DocumentContent -match $Pattern
        if($WordFound)
        {
            $KeywordMatches[$Docx.FullName] += $Keyword
        }
    }

    $DocumentReader.Dispose()
    $Document.Dispose()
}

现在,您可以显示每个文档的字数:

Now, you can show the word count for each document:

$KeywordMatches.GetEnumerator() |Select File,@{n="Count";E={$_.Value.Count}}

这篇关于优化 Word 文档关键字搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆