优化Word文档关键字搜索 [英] Optimize Word document keyword search

查看:106
本文介绍了优化Word文档关键字搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在大量MS Word文档中搜索关键字,然后将结果返回到文件中.我有一个有效的脚本,但是我不知道它的规模,而且我所获得的效率还不够高,需要花几天的时间才能完成所有工作.

I'm trying to search for keywords across a large number of MS Word documents, and return the results to a file. I've got a working script, but I wasn't aware of the scale, and what I've got isn't nearly efficient enough, it would take days to plod through everything.

现在的脚本现在从CompareData.txt中获取关键字,并在特定文件夹中的所有文件中运行它,然后将其附加到文件中.

The script as it stands now takes keywords from CompareData.txt and runs it through all the files in a specific folder, then appends it to a file.

因此,当我完成操作后,我将知道每个特定关键字有多少个文件.

So when I'm done I will know how many files have each specific keyword.

[cmdletBinding()] 
Param( 
$Path = "C:\willscratch\" 
) #end param 
$findTexts = (Get-Content c:\scratch\CompareData.txt)
Foreach ($Findtext in $FindTexts)
{
$matchCase = $false 
$matchWholeWord = $true 
$matchWildCards = $false 
$matchSoundsLike = $false 
$matchAllWordForms = $false 
$forward = $true 
$wrap = 1 
$application = New-Object -comobject word.application 
$application.visible = $False 
$docs = Get-childitem -path $Path -Recurse -Include *.docx  
$i = 1 
$totaldocs = 0 
Foreach ($doc in $docs) 
{ 
Write-Progress -Activity "Processing files" -status "Processing $($doc.FullName)" -PercentComplete ($i /$docs.Count * 100) 
$document = $application.documents.open($doc.FullName) 
$range = $document.content 
$null = $range.movestart() 
$wordFound = $range.find.execute($findText,$matchCase, 
  $matchWholeWord,$matchWildCards,$matchSoundsLike, 
  $matchAllWordForms,$forward,$wrap) 
  if($wordFound) 
    { 
     $doc.fullname 
     $document.Words.count 
     $totaldocs ++ 
  } #end if $wordFound 
$document.close() 
$i++ 
} #end foreach $doc 
$application.quit() 
"There are $totaldocs total files with $findText"  | Out-File -Append C:\scratch\output.txt

#clean up stuff 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($range) | Out-Null 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($document) | Out-Null 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null 
Remove-Variable -Name application 
[gc]::collect() 
[gc]::WaitForPendingFinalizers() 
}

我想做的是找到一种方法,可以在CompareData.txt中一次搜索每个文件中的所有内容,而不是反复遍历该文件.如果我只处理少量数据,那么我所采用的方法就可以完成工作-但是我发现,CompareData.txt和源Word文件目录中的数据都非常大.

What I'd like to do is figure out a way to search each file for everything in CompareData.txt once, rather than iterate through it a bunch of times. If I was dealing with a small set of data, the approach I've got would get the job done - but I've come to find out that both the data in CompareData.txt and the source Word file directory will be very large.

关于如何优化此方法的任何想法?

Any ideas on how to optimize this?

推荐答案

现在您正在执行此操作(伪代码):

Right now you're doing this (pseudocode):

foreach $Keyword {
    create Word Application
    foreach $File {
        load Word Document from $File
        find $Keyword
    }
}

这意味着,如果您有100个关键字和10个文档,则在打开和关闭 100个Word实例并加载一千个Word文档之前,完成了.

That means that if you have a 100 keywords and 10 documents, you're opening and closing a 100 instances of Word and loading in a thousand word documents before you're done.

相反,请执行以下操作:

Do this instead:

create Word Application
foreach $File {
    load Word Document from $File
    foreach $Keyword {
        find $Keyword
    }
}

因此,您仅启动Word的一个实例,并且仅加载每个文档一次.

So you only launch one instance of Word and only load each document once.

在注释中指出,您可以优化整体使用 OpenXML SDK 而不是启动字词:

As noted in the comments, you may optimize the whole process by using the OpenXML SDK, rather than launching Word:

(假设您已将OpenXML SDK安装在其默认位置)

(assuming you've installed OpenXML SDK in its default location)

# Import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll'

# Grab the keywords and file names    
$Keywords  = Get-Content C:\scratch\CompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx  

# hashtable to store results per document
$KeywordMatches = @{}

# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]

foreach($Docx in $Docs)
{
    # create array to hold matched keywords
    $KeywordMatches[$Docx.FullName] = @()

    # open document, wrap content stream in streamreader 
    $Document       = $WordDoc::Open($Docx.FullName, $false)
    $DocumentStream = $Document.MainDocumentPart.GetStream()
    $DocumentReader = New-Object System.IO.StreamReader $DocumentStream

    # read entire document
    $DocumentContent = $DocumentReader.ReadToEnd()

    # test for each keyword
    foreach($Keyword in $Keywords)
    {
        $Pattern   = [regex]::Escape($KeyWord)
        $WordFound = $DocumentContent -match $Pattern
        if($WordFound)
        {
            $KeywordMatches[$Docx.FullName] += $Keyword
        }
    }

    $DocumentReader.Dispose()
    $Document.Dispose()
}

现在,您可以显示每个文档的字数:

Now, you can show the word count for each document:

$KeywordMatches.GetEnumerator() |Select File,@{n="Count";E={$_.Value.Count}}

这篇关于优化Word文档关键字搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆