忽略二进制文件的PowerShell搜索脚本 [英] PowerShell search script that ignores binary files
问题描述
我非常习惯在Unix shell上执行 grep -iIr
,但是我还没有能够获得与PowerShell相同的功能。
基本上,上述命令以递归方式搜索目标文件夹,并由于-I选项而忽略二进制文件。该选项也相当于 - binary-files = without-match
选项,它表示将二进制文件视为与搜索字符串不匹配
到目前为止,我一直在使用 Get-ChildItems -r | Select-String
作为我的PowerShell grep替换,偶尔添加 Where-Object
。但我还没有想出一种方法来忽略像 grep -I
命令这样的所有二进制文件。
如何用Powershell过滤或忽略二进制文件?
因此,对于给定的路径,我只想要 Select-String
搜索文本文件。
编辑: Google上的几个小时产生了这个问题。如何识别文件的内容是ASCII或二进制文件。这个问题说ASCII,但我相信作者的意思是文本编码,就像我一样。
编辑:看来,<$需要编写c $ c> isBinary()来解决这个问题。可能是一个C#命令行工具,使它更有用。
编辑:看起来 grep
正在检查ASCII NUL Byte 或UTF-8 Overlong 。如果存在,它会考虑文件二进制文件。这是一个单独的 memchr()调用。
在Windows上,文件扩展名通常足够好:
#所有C#及相关文件(项目,源代码管理元数据等)
dir -r -fil * .cs * | ss foo
#排除最有可能污染开发工作区的二进制类型
dir -r -exclude * exe,* dll,* pdb | ss foo
#将您的$ profile中的前三行(随着时间推移而改进)
$ bins = new-list string
$ bins.AddRange([string [] ] @(exe,dll,pdb,png,mdf,docx))
函数IsBin([System.IO.FileInfo] $ item){!$ bins。包含($ item.extension.ToLower())}
dir -r | ? {!IsBin($ _)} | ss foo
当然,文件扩展名并不完美。没有人喜欢打字长列表,并且大量文件都被错误地命名。
我认为Unix在文件系统中没有任何特殊的二进制文本指示符。 (好吧,VMS的确如此,但我怀疑这是你的grep习惯的根源。)我看了一下Grep -I的实现,显然它只是基于文件第一部分的快速n-dirty启发式。原来,这是我有一个策略,我已经一点经验。所以,这里是我对于选择适合Windows文本文件的启发式函数的建议:
作为一个例子,这里是快速的ASCII检测器:
函数IsAscii([System.IO.FileInfo] $项目)
{
begin
{
$ validList = new-list byte
$ validList.AddRange([byte []](10,13))
$ validList.AddRange([byte []](31..127))
}
进程
{
尝试
{
$ reader = $ item.Open([System.IO.FileMode] :: Open)
$ bytes = new-object byte [] 1024
$ numRead = $ reader.Read($ bytes, 0,$ bytes.Count)
($ i = 0; $ i -lt $ numRead; ++ $ i)
{
if(!$ validList.Contains ($ bytes $ [$ i]))
{return $ false}
}
$ true
}
finally
{
if($ reader)
{$ reader.Dispose()}
}
}
}
我要使用的使用模式是在dir和ss之间的管道中插入的where-object子句。还有其他方法,这取决于您的脚本风格。
改进检测算法沿着一个建议的路径是给读者的。
编辑:我开始回复您在我自己的评论中的评论,但它太长了......
以上,我从白名单已知良好序列的POV着眼于问题。在我所维护的应用程序中,错误地将二进制文件存储为文本的后果远比反之亦然。在选择使用哪种FTP传输模式,或发送给电子邮件服务器的MIME编码类型等情况下,情况也是如此。在其他场景中 ,将这些明显虚假的黑名单并允许其他所有内容称为文本是一种同样有效的技术。虽然U + 0000是一个有效的代码点,但它在现实世界中几乎找不到。同时,\00在结构化二进制文件中很常见(即每当固定字节长度的字段需要填充时),因此它成为一个非常简单的黑名单。除此之外:* .zip文件是检查\0风险较高的一种情况。与大多数二进制文件不同,它们的结构化标题(页脚?)块在结尾,而不是开头。假设理想的熵压缩,第一个1KB中无\\ \\ 0的概率是(1-1 / 256)^ 1024或大约2%。幸运的是,只需扫描剩余的4KB集群NTFS读取操作,就可以将风险降低到0.00001%,而无需更改算法或编写其他特例。
排除无效UTF-8,将\C0-C1和\ F8-FD和\FE-FF(一旦您搜索到可能的物料清单)加入黑名单。非常不完整,因为你实际上没有验证序列,但足够接近你的目的。如果你想获得更多的信息,现在可以调用IMultiLang2 :: DetectInputCodepage之类的平台库了。
不知道为什么\ C8(200十进制)在Grep的名单上。这不是一个超长的编码。例如,序列\C8 \80表示Ȁ(U + 0200)。也许是特定于Unix的东西。
I am really used to doing grep -iIr
on the Unix shell but I haven't been able to get a PowerShell equivalent yet.
Basically, the above command searches the target folders recursively and ignores binary files because of the "-I" option. This option is also equivalent to the --binary-files=without-match
option, which says "treat binary files as not matching the search string"
So far I have been using Get-ChildItems -r | Select-String
as my PowerShell grep replacement with the occasional Where-Object
added. But I haven't figured out a way to ignore all binary files like the grep -I
command does.
How can binary files be filtered or ignored with Powershell?
So for a given path, I only want Select-String
to search text files.
EDIT: A few more hours on Google produced this question How to identify the contents of a file is ASCII or Binary. The question says "ASCII" but I believe the writer meant "Text Encoded", like myself.
EDIT: It seems that an isBinary()
needs to be written to solve this issue. Probably a C# commandline utility to make it more useful.
EDIT: It seems that what grep
is doing is checking for ASCII NUL Byte or UTF-8 Overlong. If those exists, it considers the file binary. This is a single memchr() call.
On Windows, file extensions are usually good enough:
# all C# and related files (projects, source control metadata, etc)
dir -r -fil *.cs* | ss foo
# exclude the binary types most likely to pollute your development workspace
dir -r -exclude *exe, *dll, *pdb | ss foo
# stick the first three lines in your $profile (refining them over time)
$bins = new-list string
$bins.AddRange( [string[]]@("exe", "dll", "pdb", "png", "mdf", "docx") )
function IsBin([System.IO.FileInfo]$item) { !$bins.Contains($item.extension.ToLower()) }
dir -r | ? { !IsBin($_) } | ss foo
But of course, file extensions are not perfect. Nobody likes typing long lists, and plenty of files are misnamed anyway.
I don't think Unix has any special binary vs text indicators in the filesystem. (Well, VMS did, but I doubt that's the source of your grep habits.) I looked at the implementation of Grep -I, and apparently it's just a quick-n-dirty heuristic based on the first chunk of the file. Turns out that's a strategy I have a bit of experience with. So here's my advice on choosing a heuristic function that is appropriate for Windows text files:
- Examine at least 1KB of the file. Lots of file formats begin with a header that looks like text but will bust your parser shortly afterward. The way modern hardware works, reading 50 bytes has roughly the same I/O overhead as reading 4KB.
- If you only care about straight ASCII, exit as soon you see something outside the character range [31-127 plus CR and LF]. You might accidentally exclude some clever ASCII art, but trying to separate those cases from binary junk is nontrivial.
- If you want to handle Unicode text, let MS libraries handle the dirty work. It's harder than you think. From Powershell you can easily access the IMultiLang2 interface (COM) or Encoding.GetEncoding static method (.NET). Of course, they are still just guessing. Raymond's comments on the Notepad detection algorithm (and the link within to Michael Kaplan) are worth reviewing before deciding exactly how you want to mix & match the platform-provided libraries.
- If the outcome is important -- ie a flaw will do something worse than just clutter up your grep console -- then don't be afraid to hard-code some file extensions for the sake of accuracy. For example, *.PDF files occasionally have several KB of text at the front despite being a binary format, leading to the notorious bugs linked above. Similarly, if you have a file extension that is likely to contain XML or XML-like data, you might try a detection scheme similar to Visual Studio's HTML editor. (SourceSafe 2005 actually borrows this algorithm for some cases)
- Whatever else happens, have a reasonable backup plan.
As an example, here's the quick ASCII detector:
function IsAscii([System.IO.FileInfo]$item)
{
begin
{
$validList = new-list byte
$validList.AddRange([byte[]] (10,13) )
$validList.AddRange([byte[]] (31..127) )
}
process
{
try
{
$reader = $item.Open([System.IO.FileMode]::Open)
$bytes = new-object byte[] 1024
$numRead = $reader.Read($bytes, 0, $bytes.Count)
for($i=0; $i -lt $numRead; ++$i)
{
if (!$validList.Contains($bytes[$i]))
{ return $false }
}
$true
}
finally
{
if ($reader)
{ $reader.Dispose() }
}
}
}
The usage pattern I'm targeting is a where-object clause inserted in the pipeline between "dir" and "ss". There are other ways, depending on your scripting style.
Improving the detection algorithm along one of the suggested paths is left to the reader.
edit: I started replying to your comment in a comment of my own, but it got too long...
Above, I looked at the problem from the POV of whitelisting known-good sequences. In the application I maintained, incorrectly storing a binary as text had far worse consequences than vice versa. The same is true for scenarios where you are choosing which FTP transfer mode to use, or what kind of MIME encoding to send to an email server, etc.
In other scenarios, blacklisting the obviously bogus and allowing everything else to be called text is an equally valid technique. While U+0000 is a valid code point, it's pretty much never found in real world text. Meanwhile, \00 is quite common in structured binary files (namely, whenever a fixed-byte-length field needs padding), so it makes a great simple blacklist. VSS 6.0 used this check alone and did ok.
Aside: *.zip files are a case where checking for \0 is riskier. Unlike most binaries, their structured "header" (footer?) block is at the end, not the beginning. Assuming ideal entropy compression, the chance of no \0 in the first 1KB is (1-1/256)^1024 or about 2%. Luckily, simply scanning the rest of the 4KB cluster NTFS read will drive the risk down to 0.00001% without having to change the algorithm or write another special case.
To exclude invalid UTF-8, add \C0-C1 and \F8-FD and \FE-FF (once you've seeked past the possible BOM) to the blacklist. Very incomplete since you're not actually validating the sequences, but close enough for your purposes. If you want to get any fancier than this, it's time to call one of the platform libraries like IMultiLang2::DetectInputCodepage.
Not sure why \C8 (200 decimal) is on Grep's list. It's not an overlong encoding. For example, the sequence \C8 \80 represents Ȁ (U+0200). Maybe something specific to Unix.
这篇关于忽略二进制文件的PowerShell搜索脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!