忽略二进制文件的 PowerShell 搜索脚本 [英] PowerShell search script that ignores binary files

查看:21
本文介绍了忽略二进制文件的 PowerShell 搜索脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我真的习惯于在 Unix shell 上执行 grep -iIr,但我还没有能够获得等效的 PowerShell.

I am really used to doing grep -iIr on the Unix shell but I haven't been able to get a PowerShell equivalent yet.

基本上,由于-I"选项,上述命令递归搜索目标文件夹并忽略二进制文件.此选项也等效于 --binary-files=without-match 选项,它表示 将二进制文件视为与搜索字符串不匹配"

Basically, the above command searches the target folders recursively and ignores binary files because of the "-I" option. This option is also equivalent to the --binary-files=without-match option, which says "treat binary files as not matching the search string"

到目前为止我一直在使用 Get-ChildItems -r |Select-String 作为我的 PowerShell grep 替换,偶尔添加 Where-Object.但我还没有想出一种方法来像 grep -I 命令那样忽略所有二进制文件.

So far I have been using Get-ChildItems -r | Select-String as my PowerShell grep replacement with the occasional Where-Object added. But I haven't figured out a way to ignore all binary files like the grep -I command does.

如何使用 Powershell 过滤或忽略二进制文件?

How can binary files be filtered or ignored with Powershell?

所以对于给定的路径,我只希望 Select-String 搜索文本文件.

So for a given path, I only want Select-String to search text files.

在 Google 上的几个小时产生了这个问题 如何识别文件的内容是ASCII还是Binary.问题是ASCII",但我相信作者的意思是文本编码",就像我自己一样.

A few more hours on Google produced this question How to identify the contents of a file is ASCII or Binary. The question says "ASCII" but I believe the writer meant "Text Encoded", like myself.

似乎需要编写 isBinary() 来解决这个问题.可能是一个 C# 命令行实用程序,使其更有用.

It seems that an isBinary() needs to be written to solve this issue. Probably a C# commandline utility to make it more useful.

似乎 grep 正在做的是检查 ASCII NUL Byte 或 UTF-8 Overlong.如果这些存在,它会考虑文件二进制.这是一个 memchr() 调用.

It seems that what grep is doing is checking for ASCII NUL Byte or UTF-8 Overlong. If those exists, it considers the file binary. This is a single memchr() call.

推荐答案

在 Windows 上,文件扩展名通常就足够了:

On Windows, file extensions are usually good enough:

# all C# and related files (projects, source control metadata, etc)
dir -r -fil *.cs* | ss foo

# exclude the binary types most likely to pollute your development workspace
dir -r -exclude *exe, *dll, *pdb | ss foo

# stick the first three lines in your $profile (refining them over time)
$bins = new-list string
$bins.AddRange( [string[]]@("exe", "dll", "pdb", "png", "mdf", "docx") )
function IsBin([System.IO.FileInfo]$item) { !$bins.Contains($item.extension.ToLower()) }
dir -r | ? { !IsBin($_) } | ss foo

当然,文件扩展名并不完美.没有人喜欢输入长列表,反正很多文件都被错误命名了.

But of course, file extensions are not perfect. Nobody likes typing long lists, and plenty of files are misnamed anyway.

我认为 Unix 在文件系统中没有任何特殊的二进制与文本指示符.(嗯,VMS 做到了,但我怀疑这就是您的 grep 习惯的来源.)我查看了 Grep -I 的实现,显然它只是基于文件第一块的快速n-dirty 启发式方法.原来这是我的策略 有点经验.所以这里是我关于选择适合 Windows 文本文件的启发式函数的建议:

I don't think Unix has any special binary vs text indicators in the filesystem. (Well, VMS did, but I doubt that's the source of your grep habits.) I looked at the implementation of Grep -I, and apparently it's just a quick-n-dirty heuristic based on the first chunk of the file. Turns out that's a strategy I have a bit of experience with. So here's my advice on choosing a heuristic function that is appropriate for Windows text files:

  • 检查至少 1KB 的文件.许多文件格式都以一个看起来像文本的标题开头,但很快就会破坏你的解析器.现代硬件的工作方式是,读取 50 字节的 I/O 开销与读取 4KB 的 I/O 开销大致相同.
  • 如果您只关心纯 ASCII,请在看到字符范围 [31-127 加 CR 和 LF] 之外的内容时立即退出.您可能不小心排除了一些聪明的 ASCII 艺术,但尝试将这些情况与二进制垃圾分开是很重要的.
  • 如果您想处理 Unicode 文本,请让 MS 库处理繁琐的工作.这比你想象的要难.从 Powershell,您可以轻松访问 IMultiLang2 接口 (COM) 或 Encoding.GetEncoding 静态方法 (.NET).当然,他们现在还只是猜测.Raymond 对记事本检测算法的评论(以及链接在迈克尔·卡普兰 (Michael Kaplan) 中)在决定您想要如何混合之前值得回顾一下.匹配平台提供的库.
  • 如果结果很重要——也就是说,一个缺陷会造成比让 grep 控制台混乱更糟糕的事情——那么不要害怕为了准确性而对某些文件扩展名进行硬编码.例如,尽管 *.PDF 文件是二进制格式,但偶尔会在前面有几 KB 的文本,从而导致上面链接的臭名昭著的错误.同样,如果您的文件扩展名可能包含 XML 或类似 XML 的数据,您可以尝试类似于 Visual Studio 的 HTML 编辑器.(SourceSafe 2005 实际上在某些情况下借用了这个算法)
  • 无论发生什么,都要制定合理的备份计划.
  • Examine at least 1KB of the file. Lots of file formats begin with a header that looks like text but will bust your parser shortly afterward. The way modern hardware works, reading 50 bytes has roughly the same I/O overhead as reading 4KB.
  • If you only care about straight ASCII, exit as soon you see something outside the character range [31-127 plus CR and LF]. You might accidentally exclude some clever ASCII art, but trying to separate those cases from binary junk is nontrivial.
  • If you want to handle Unicode text, let MS libraries handle the dirty work. It's harder than you think. From Powershell you can easily access the IMultiLang2 interface (COM) or Encoding.GetEncoding static method (.NET). Of course, they are still just guessing. Raymond's comments on the Notepad detection algorithm (and the link within to Michael Kaplan) are worth reviewing before deciding exactly how you want to mix & match the platform-provided libraries.
  • If the outcome is important -- ie a flaw will do something worse than just clutter up your grep console -- then don't be afraid to hard-code some file extensions for the sake of accuracy. For example, *.PDF files occasionally have several KB of text at the front despite being a binary format, leading to the notorious bugs linked above. Similarly, if you have a file extension that is likely to contain XML or XML-like data, you might try a detection scheme similar to Visual Studio's HTML editor. (SourceSafe 2005 actually borrows this algorithm for some cases)
  • Whatever else happens, have a reasonable backup plan.

例如,这里是快速 ASCII 检测器:

As an example, here's the quick ASCII detector:

function IsAscii([System.IO.FileInfo]$item)
{
    begin 
    { 
        $validList = new-list byte
        $validList.AddRange([byte[]] (10,13) )
        $validList.AddRange([byte[]] (31..127) )
    }

    process
    {
        try 
        {
            $reader = $item.Open([System.IO.FileMode]::Open)
            $bytes = new-object byte[] 1024
            $numRead = $reader.Read($bytes, 0, $bytes.Count)

            for($i=0; $i -lt $numRead; ++$i)
            {
                if (!$validList.Contains($bytes[$i]))
                    { return $false }
            }
            $true
        }
        finally
        {
            if ($reader)
                { $reader.Dispose() }
        }
    }
}

我的目标使用模式是插入在dir"和ss"之间的管道中的 where-object 子句.还有其他方法,具体取决于您的脚本风格.

The usage pattern I'm targeting is a where-object clause inserted in the pipeline between "dir" and "ss". There are other ways, depending on your scripting style.

改进沿建议路径之一的检测算法留给读者.

Improving the detection algorithm along one of the suggested paths is left to the reader.

我开始在我自己的评论中回复您的评论,但时间太长了...

edit: I started replying to your comment in a comment of my own, but it got too long...

以上,我从白名单已知良好序列的POV中查看了问题.在我维护的应用程序中,错误地将二进制文件存储为文本的后果比相反的后果严重得多.对于您选择要使用的 FTP 传输模式或要发送到电子邮件服务器的 MIME 编码类型等情况,情况也是如此.

Above, I looked at the problem from the POV of whitelisting known-good sequences. In the application I maintained, incorrectly storing a binary as text had far worse consequences than vice versa. The same is true for scenarios where you are choosing which FTP transfer mode to use, or what kind of MIME encoding to send to an email server, etc.

在其他情况下,将明显伪造的内容列入黑名单并允许将其他所有内容称为文本也是一种同样有效的技术.虽然 U+0000 是一个有效的代码点,但在现实世界的文本中几乎找不到.同时,0 在结构化二进制文件中很常见(即,每当固定字节长度的字段需要填充时),因此它是一个非常简单的黑名单.VSS 6.0 单独使用了这个检查,没问题.

In other scenarios, blacklisting the obviously bogus and allowing everything else to be called text is an equally valid technique. While U+0000 is a valid code point, it's pretty much never found in real world text. Meanwhile, 0 is quite common in structured binary files (namely, whenever a fixed-byte-length field needs padding), so it makes a great simple blacklist. VSS 6.0 used this check alone and did ok.

旁白:*.zip 文件是检查 风险更大的情况.与大多数二进制文件不同,它们的结构化页眉"(页脚?)块位于末尾,而不是开头.假设理想的熵压缩,前 1KB 中没有 的几率是 (1-1/256)^1024 或大约 2%.幸运的是,只需扫描 4KB 集群 NTFS 读取的其余部分即可将风险降低至 0.00001%,而无需更改算法或编写其他特殊案例.

Aside: *.zip files are a case where checking for is riskier. Unlike most binaries, their structured "header" (footer?) block is at the end, not the beginning. Assuming ideal entropy compression, the chance of no in the first 1KB is (1-1/256)^1024 or about 2%. Luckily, simply scanning the rest of the 4KB cluster NTFS read will drive the risk down to 0.00001% without having to change the algorithm or write another special case.

要排除无效的 UTF-8,请将 C0-C1 和 F8-FD 和 FE-FF(一旦您通过可能的 BOM 查找)到黑名单.非常不完整,因为您实际上并没有验证序列,但对于您的目的来说已经足够接近了.如果您想获得比这更有趣的东西,是时候调用 IMultiLang2::DetectInputCodepage 等平台库之一了.

To exclude invalid UTF-8, add C0-C1 and F8-FD and FE-FF (once you've seeked past the possible BOM) to the blacklist. Very incomplete since you're not actually validating the sequences, but close enough for your purposes. If you want to get any fancier than this, it's time to call one of the platform libraries like IMultiLang2::DetectInputCodepage.

不确定为什么 C8(十进制 200)在 Grep 的列表中.这不是过长的编码.例如,序列C880代表(U+0200).也许是 Unix 特有的东西.

Not sure why C8 (200 decimal) is on Grep's list. It's not an overlong encoding. For example, the sequence C8 80 represents Ȁ (U+0200). Maybe something specific to Unix.

这篇关于忽略二进制文件的 PowerShell 搜索脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆