忽略二进制文件的PowerShell搜索脚本 [英] PowerShell search script that ignores binary files

查看:115
本文介绍了忽略二进制文件的PowerShell搜索脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我非常习惯在Unix shell上执行 grep -iIr ,但是我还没有能够获得与PowerShell相同的功能。



基本上,上述命令以递归方式搜索目标文件夹,并由于-I选项而忽略二进制文件。该选项也相当于 - binary-files = without-match 选项,它表示将二进制文件视为与搜索字符串不匹配



到目前为止,我一直在使用 Get-ChildItems -r | Select-String 作为我的PowerShell grep替换,偶尔添加 Where-Object 。但我还没有想出一种方法来忽略像 grep -I 命令这样的所有二进制文件。



如何用Powershell过滤或忽略二进制文件?



因此,对于给定的路径,我只想要 Select-String 搜索文本文件。



编辑: Google上的几个小时产生了这个问题。如何识别文件的内容是ASCII或二进制文件。这个问题说ASCII,但我相信作者的意思是文本编码,就像我一样。



编辑:看来,<$需要编写c $ c> isBinary()来解决这个问题。可能是一个C#命令行工具,使它更有用。



编辑:看起来 grep 正在检查ASCII NUL Byte 或UTF-8 Overlong 。如果存在,它会考虑文件二进制文件。这是一个单独的 memchr()调用。

解决方案

在Windows上,文件扩展名通常足够好:

 #所有C#及相关文件(项目,源代码管理元数据等)
dir -r -fil * .cs * | ss foo

#排除最有可能污染开发工作区的二进制类型
dir -r -exclude * exe,* dll,* pdb | ss foo

#将您的$ profile中的前三行(随着时间推移而改进)
$ bins = new-list string
$ bins.AddRange([string [] ] @(exe,dll,pdb,png,mdf,docx))
函数IsBin([System.IO.FileInfo] $ item){!$ bins。包含($ item.extension.ToLower())}
dir -r | ? {!IsBin($ _)} | ss foo

当然,文件扩展名并不完美。没有人喜欢打字长列表,并且大量文件都被错误地命名。



我认为Unix在文件系统中没有任何特殊的二进制文本指示符。 (好吧,VMS的确如此,但我怀疑这是你的grep习惯的根源。)我看了一下Grep -I的实现,显然它只是基于文件第一部分的快速n-dirty启发式。原来,这是我有一个策略,我已经一点经验。所以,这里是我对于选择适合Windows文本文件的启发式函数的建议:


  • 检查至少1KB的文件。很多文件格式都以一个看起来像文本的标题开头,但不久之后就会破解你的解析器。现代硬件的工作方式,读取50个字节与读取4KB大致相同的I / O开销。

  • 如果您只关心直线ASCII,只要看到字符范围外的东西就立即退出[31-127加上CR和LF]。您可能会不小心排除了一些聪明的ASCII艺术,但试图将这些情况与二进制垃圾分开是不平凡的。

  • 如果要处理Unicode文本,请让MS库处理脏东西。这比你想象的更难。从Powershell中,您可以轻松访问 IMultiLang2界面(COM)或 Encoding.GetEncoding 静态方法(.NET)。当然,他们仍然只是猜测。雷蒙德对记事本检测算法(以及链接在迈克尔卡普兰)是值得回顾之前,确切地决定你想如何混合&匹配平台提供的库。
  • 如果结果很重要 - 例如,缺陷会造成比糟糕的grep控制台更糟糕的情况 - 那么不要害怕,为了准确而编写一些文件扩展名。例如,* .PDF文件尽管是二进制格式,但偶尔也会有几KB的文本,导致上面链接的臭名昭着的错误。同样,如果您的文件扩展名可能包含XML或类似XML的数据,则可以尝试类似于 Visual Studio的HTML编辑器。 (SourceSafe 2005实际上在某些情况下借用了此算法)
  • 无论发生什么情况,都需要一个合理的备份计划。



作为一个例子,这里是快速的ASCII检测器:

 函数IsAscii([System.IO.FileInfo] $项目)
{
begin
{
$ validList = new-list byte
$ validList.AddRange([byte []](10,13))
$ validList.AddRange([byte []](31..127))
}

进程
{
尝试
{
$ reader = $ item.Open([System.IO.FileMode] :: Open)
$ bytes = new-object byte [] 1024
$ numRead = $ reader.Read($ bytes, 0,$ bytes.Count)

($ i = 0; $ i -lt $ numRead; ++ $ i)
{
if(!$ validList.Contains ($ bytes $ [$ i]))
{return $ false}
}
$ true
}
finally
{
if($ reader)
{$ reader.Dispose()}
}
}
}

我要使用的使用模式是在dir和ss之间的管道中插入的where-object子句。还有其他方法,这取决于您的脚本风格。



改进检测算法沿着一个建议的路径是给读者的。

编辑:我开始回复您在我自己的评论中的评论,但它太长了......

以上,我从白名单已知良好序列的POV着眼于问题。在我所维护的应用程序中,错误地将二进制文件存储为文本的后果远比反之亦然。在选择使用哪种FTP传输模式,或发送给电子邮件服务器的MIME编码类型等情况下,情况也是如此。在其他场景中 ,将这些明显虚假的黑名单并允许其他所有内容称为文本是一种同样有效的技术。虽然U + 0000是一个有效的代码点,但它在现实世界中几乎找不到。同时,\00在结构化二进制文件中很常见(即每当固定字节长度的字段需要填充时),因此它成为一个非常简单的黑名单。除此之外:* .zip文件是检查\0风险较高的一种情况。与大多数二进制文件不同,它们的结构化标题(页脚?)块在结尾,而不是开头。假设理想的熵压缩,第一个1KB中无\\ \\ 0的概率是(1-1 / 256)^ 1024或大约2%。幸运的是,只需扫描剩余的4KB集群NTFS读取操作,就可以将风险降低到0.00001%,而无需更改算法或编写其他特例。



排除无效UTF-8,将\C0-C1和\ F8-FD和\FE-FF(一旦您搜索到可能的物料清单)加入黑名单。非常不完整,因为你实际上没有验证序列,但足够接近你的目的。如果你想获得更多的信息,现在可以调用IMultiLang2 :: DetectInputCodepage之类的平台库了。



不知道为什么\ C8(200十进制)在Grep的名单上。这不是一个超长的编码。例如,序列\C8 \80表示Ȁ(U + 0200)。也许是特定于Unix的东西。


I am really used to doing grep -iIr on the Unix shell but I haven't been able to get a PowerShell equivalent yet.

Basically, the above command searches the target folders recursively and ignores binary files because of the "-I" option. This option is also equivalent to the --binary-files=without-match option, which says "treat binary files as not matching the search string"

So far I have been using Get-ChildItems -r | Select-String as my PowerShell grep replacement with the occasional Where-Object added. But I haven't figured out a way to ignore all binary files like the grep -I command does.

How can binary files be filtered or ignored with Powershell?

So for a given path, I only want Select-String to search text files.

EDIT: A few more hours on Google produced this question How to identify the contents of a file is ASCII or Binary. The question says "ASCII" but I believe the writer meant "Text Encoded", like myself.

EDIT: It seems that an isBinary() needs to be written to solve this issue. Probably a C# commandline utility to make it more useful.

EDIT: It seems that what grep is doing is checking for ASCII NUL Byte or UTF-8 Overlong. If those exists, it considers the file binary. This is a single memchr() call.

解决方案

On Windows, file extensions are usually good enough:

# all C# and related files (projects, source control metadata, etc)
dir -r -fil *.cs* | ss foo

# exclude the binary types most likely to pollute your development workspace
dir -r -exclude *exe, *dll, *pdb | ss foo

# stick the first three lines in your $profile (refining them over time)
$bins = new-list string
$bins.AddRange( [string[]]@("exe", "dll", "pdb", "png", "mdf", "docx") )
function IsBin([System.IO.FileInfo]$item) { !$bins.Contains($item.extension.ToLower()) }
dir -r | ? { !IsBin($_) } | ss foo

But of course, file extensions are not perfect. Nobody likes typing long lists, and plenty of files are misnamed anyway.

I don't think Unix has any special binary vs text indicators in the filesystem. (Well, VMS did, but I doubt that's the source of your grep habits.) I looked at the implementation of Grep -I, and apparently it's just a quick-n-dirty heuristic based on the first chunk of the file. Turns out that's a strategy I have a bit of experience with. So here's my advice on choosing a heuristic function that is appropriate for Windows text files:

  • Examine at least 1KB of the file. Lots of file formats begin with a header that looks like text but will bust your parser shortly afterward. The way modern hardware works, reading 50 bytes has roughly the same I/O overhead as reading 4KB.
  • If you only care about straight ASCII, exit as soon you see something outside the character range [31-127 plus CR and LF]. You might accidentally exclude some clever ASCII art, but trying to separate those cases from binary junk is nontrivial.
  • If you want to handle Unicode text, let MS libraries handle the dirty work. It's harder than you think. From Powershell you can easily access the IMultiLang2 interface (COM) or Encoding.GetEncoding static method (.NET). Of course, they are still just guessing. Raymond's comments on the Notepad detection algorithm (and the link within to Michael Kaplan) are worth reviewing before deciding exactly how you want to mix & match the platform-provided libraries.
  • If the outcome is important -- ie a flaw will do something worse than just clutter up your grep console -- then don't be afraid to hard-code some file extensions for the sake of accuracy. For example, *.PDF files occasionally have several KB of text at the front despite being a binary format, leading to the notorious bugs linked above. Similarly, if you have a file extension that is likely to contain XML or XML-like data, you might try a detection scheme similar to Visual Studio's HTML editor. (SourceSafe 2005 actually borrows this algorithm for some cases)
  • Whatever else happens, have a reasonable backup plan.

As an example, here's the quick ASCII detector:

function IsAscii([System.IO.FileInfo]$item)
{
    begin 
    { 
        $validList = new-list byte
        $validList.AddRange([byte[]] (10,13) )
        $validList.AddRange([byte[]] (31..127) )
    }

    process
    {
        try 
        {
            $reader = $item.Open([System.IO.FileMode]::Open)
            $bytes = new-object byte[] 1024
            $numRead = $reader.Read($bytes, 0, $bytes.Count)

            for($i=0; $i -lt $numRead; ++$i)
            {
                if (!$validList.Contains($bytes[$i]))
                    { return $false }
            }
            $true
        }
        finally
        {
            if ($reader)
                { $reader.Dispose() }
        }
    }
}

The usage pattern I'm targeting is a where-object clause inserted in the pipeline between "dir" and "ss". There are other ways, depending on your scripting style.

Improving the detection algorithm along one of the suggested paths is left to the reader.

edit: I started replying to your comment in a comment of my own, but it got too long...

Above, I looked at the problem from the POV of whitelisting known-good sequences. In the application I maintained, incorrectly storing a binary as text had far worse consequences than vice versa. The same is true for scenarios where you are choosing which FTP transfer mode to use, or what kind of MIME encoding to send to an email server, etc.

In other scenarios, blacklisting the obviously bogus and allowing everything else to be called text is an equally valid technique. While U+0000 is a valid code point, it's pretty much never found in real world text. Meanwhile, \00 is quite common in structured binary files (namely, whenever a fixed-byte-length field needs padding), so it makes a great simple blacklist. VSS 6.0 used this check alone and did ok.

Aside: *.zip files are a case where checking for \0 is riskier. Unlike most binaries, their structured "header" (footer?) block is at the end, not the beginning. Assuming ideal entropy compression, the chance of no \0 in the first 1KB is (1-1/256)^1024 or about 2%. Luckily, simply scanning the rest of the 4KB cluster NTFS read will drive the risk down to 0.00001% without having to change the algorithm or write another special case.

To exclude invalid UTF-8, add \C0-C1 and \F8-FD and \FE-FF (once you've seeked past the possible BOM) to the blacklist. Very incomplete since you're not actually validating the sequences, but close enough for your purposes. If you want to get any fancier than this, it's time to call one of the platform libraries like IMultiLang2::DetectInputCodepage.

Not sure why \C8 (200 decimal) is on Grep's list. It's not an overlong encoding. For example, the sequence \C8 \80 represents Ȁ (U+0200). Maybe something specific to Unix.

这篇关于忽略二进制文件的PowerShell搜索脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆