Powershell 中的高级模式匹配 [英] Advanced pattern matching in Powershell

查看:58
本文介绍了Powershell 中的高级模式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望你能帮我做点什么.感谢@mklement0,我得到了一个很棒的脚本,它按字母顺序匹配单词的最基本的初始模式.但是缺少的是全文搜索和选择.当前脚本的示例,其中包含 Words.txt 文件中几个单词的小样本:

Hope you can help me with something. Thanks to @mklement0 I've gotten a great script matching the most basic, initial pattern for words in alphabetical order. However what's missing is a full text search and select. An example of current script with a small sample of a few words within a Words.txt file:

App
Apple
Apply
Sword
Swords
Word
Words

变成:

App
Sword
Word

这很棒,因为它确实缩小到每行的基本模式!然而,它逐行执行的结果仍然有一个可以进一步缩小范围的模式,即Word"(大写不重要),因此理想情况下的输出应该是:

This is great as it really narrows down to a basic pattern per line! However the result of it going line by line there is still a pattern that can further be narrowed down which is "Word" (capitalization not important) so ideally the output should be:

App
Word

Sword"被删除,因为它属于更基本的模式,前缀为Word".

And "Sword" is removed as it falls in more basic pattern prefixed as "Word".

您对如何实现这一目标有什么建议吗?请记住,这将是一个大约 25 万个单词的字典列表,所以我不会提前知道我在寻找什么

Would you have any suggestion on how to achieve this? Keep in mind this will be a dictionary list of about 250k words, so I would not know what I am looking for ahead of time

CODE(来自相关帖子,仅处理前缀匹配):

CODE (from a related post, handles prefix matching only):

$outFile = [IO.File]::CreateText("C:\Temp\Results.txt")   # Output File Location
$prefix = ''                   # initialize the prefix pattern

foreach ($line in [IO.File]::ReadLines('C:\Temp\Words.txt')) # Input File name.
 {

  if ($line -like $prefix) 
    { 
    continue                   # same prefix, skip
    }

  $line                        # Visual output of new unique prefix
  $prefix = "$line*"           # Saves new prefix pattern
  $outFile.writeline($line)    # Output file write to configured location
}

推荐答案

您可以尝试两步走:

  • 步骤 1:在按字母顺序排序的单词列表中查找唯一前缀列表.这是通过顺序读取行来完成的,因此只需要您将唯一前缀作为一个整体保存在内存中.

  • Step 1: Find the list of unique prefixes in the alphabetically sorted word list. This is done by reading the lines sequentially, and therefore only requires you to hold the unique prefixes as a whole in memory.

第 2 步:按长度顺序对结果前缀进行排序并对其进行迭代,在每次迭代中检查手头的单词是否已经由结果列表中的它的子串表示.

Step 2: Sort the resulting prefixes in order of length and iterate over them, checking in each iteration whether the word at hand is already represented by a substring of it in the result list.

  • 结果列表一开始是空的,只要手头的单词在结果列表中没有子字符串,它就会被附加到列表中.

  • The result list starts out empty, and whenever the word at hand has no substring in the result list, it is appended to the list.

结果列表作为带有交替 (|) 的 正则表达式 实现,以便在单个操作中匹配所有已找到的唯一词.

The result list is implemented as a regular expression with alternation (|), to enable matching against all already-found unique words in a single operation.

你必须看看性能是否足够好;为获得最佳性能,尽可能直接使用 .NET 类型.

You'll have to see if the performance is good enough; for best performance, .NET types are used directly as much as possible.

# Read the input file and build the list of unique prefixes, assuming
# alphabetical sorting.
$inFilePath = 'C:\Temp\Words.txt' # Be sure to use a full path.
$uniquePrefixWords = 
  foreach ($word in [IO.File]::ReadLines($inFilePath)) {
    if ($word -like $prefix) { continue }
    $word
    $prefix = "$word*"
  }

# Sort the prefixes by length in ascending order (shorter ones first).
# Note: This is a more time- and space-efficient alternative to:
#    $uniquePrefixWords = $uniquePrefixWords | Sort-Object -Property Length
[Array]::Sort($uniquePrefixWords.ForEach('Length'), $uniquePrefixWords)

# Build the result lists of unique shortest words with the help of a regex.
# Skip later - and therefore longer - words, if they are already represented
# in the result list of word by a substring.
$regexUniqueWords = ''; $first = $true
foreach ($word in $uniquePrefixWords) {
  if ($first) { # first word
    $regexUniqueWords = $word
    $first = $false
  } elseif ($word -notmatch $regexUniqueWords) {
    # New unique word found: add it to the regex as an alternation (|)
    $regexUniqueWords += '|' + $word
  }
}

# The regex now contains all unique words, separated by "|".
# Split it into an array of individual words, sort the array again...
$resultWords = $regexUniqueWords.Split('|')
[Array]::Sort($resultWords)

# ... and write it to the output file.
$outFilePath = 'C:\Temp\Results.txt' # Be sure to use a full path.
[IO.File]::WriteAllLines($outFilePath, $resultWords)

这篇关于Powershell 中的高级模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆