Grep-列出以正则表达式二进制字节序列开头的文件? [英] Grep - list files that start with regex binary byte sequence?

查看:88
本文介绍了Grep-列出以正则表达式二进制字节序列开头的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想列出以某个字节序列开头的文件.我的想法因相同的行为而失败:

I want to list files that start with a certain byte sequence. My ideas are failing with identical behavior:

grep -Rl $'\A\xff\xd8' .
grep -Rl \A$'\xff\xd8' .
grep -RlP "\A\xff\xd8" .

未找到以ff d8开头的测试文件,而其他3个文件的字节顺序在该文件的其他位置.我的测试文件的前几个字节已用hexdump -C确认.

A test file starting with ff d8 is not found, while 3 other files are found that have the byte sequence elsewhere in the file. My test file first few bytes are confirmed with hexdump -C.

00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 00 00 01  |......JFIF......|

我找到了多个差不多"的答案.我已经研究了hexdump,但是更喜欢直接grepping的速度,而不是大量的管道处理和循环遍历递归文件名,并带有环绕文本异常. 2-1/2年前的一个先前的问题用Bash雕刻文件无法找到带有grep的十六进制值FFD8或FFD9"非常接近,但是LC_ALL = C不会改变行为.使用-a和-b不会改变行为.

I found multiple "almost" answers. I've explored hexdump, but prefer the speed of directly grepping rather than a lot of piping and looping through recursive filenames, with wrap around text exceptions. A prior question 2-1/2 years ago "File carving with Bash can't find hex values FFD8 or FFD9 with grep" is very close but LC_ALL=C doesn't change behavior. Playing with -a and -b doesn't change behavior.

执行此操作的正确方法是什么?我正在使用GNU grep 3.1.

What is the right way to do this? I'm using GNU grep 3.1.

/// 进一步的研究使我认为grep可能有问题.下面的代码显示 2字节序列不在开头时找不到. 然后在开始时找到2字节序列. 同样在真实的jpg文件中,匹配在开头时就可以找到 到目前为止,一切都很好.

/// Further study makes me think grep maybe has as problem. The code below shows that the 2-byte sequence is not found when it's not at the beginning. Then 2-byte sequence IS found when it IS at the beginnning. Also on a real jpg file, the match is found when it is at the beginning So far, so good.

dell@DELL-E6440:~$ echo $'\xffThis is a short test file I\xff\xd8 made' > junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000  ff 54 68 69 73 20 69 73  20 61 20 73 68 6f 72 74  |.This is a short|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
dell@DELL-E6440:~$ echo $'\xff\xd8This is a short test file I\xff\xd8 made' > junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000  ff d8 54 68 69 73 20 69  73 20 61 20 73 68 6f 72  |..This is a shor|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
junk.txt
dell@DELL-E6440:~$ hexdump -C avoid-powered.jpg | head -n1
00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 00 00 01  |......JFIF......|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" avoid-powered.jpg
avoid-powered.jpg
dell@DELL-E6440:~$ 

那么,为什么不在开始时就在更大的文件中进行匹配? 首先显示没有必要的2字节序列的文件是匹配的. 然后,仅保留真实文件的开头,并且正确找不到2字节的序列.

So, why is it matched in a larger file when it's NOT at the beginng? First show that a file without the necessary 2-byte sequence is matched. Then, keep only the beginning of the real file, and the 2-byte sequence is properly not found.

dell@DELL-E6440:~$ cp 130913-SEMSA.pdf junk.txt
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000  25 50 44 46 2d 31 2e 34  0a 31 20 30 20 6f 62 6a  |%PDF-1.4.1 0 obj|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
junk.txt
dell@DELL-E6440:~$ dd if=130913-SEMSA.pdf bs=10 count=1 of=junk.txt
1+0 records in
1+0 records out
10 bytes copied, 0.0062894 s, 1.6 kB/s
dell@DELL-E6440:~$ hexdump -C junk.txt | head -n1
00000000  25 50 44 46 2d 31 2e 34  0a 31                    |%PDF-1.4.1|
dell@DELL-E6440:~$ LC_ALL=C grep -lP "\A\xff\xd8" junk.txt
dell@DELL-E6440:~$

全尺寸文件中可能出现什么错误匹配? grep应该只使用\ A选项查看文件的前2个字节.

What can possibly be in the full size file that makes a false match? grep should be looking only at the first 2-bytes of the file with \A option.

回应破折号的答案...

Responding to dash-o's answer...

我考虑了grep v3.3手册 https://www.gnu .org/software/grep/manual/grep.html 其中说,

I considered the grep v3.3 manual https://www.gnu.org/software/grep/manual/grep.html which says,

-P Interpret patterns as Perl-compatible regular expressions (PCREs)

和perl正则表达式指南 https://www.tutorialspoint.com/perl/perl_regular_expressions. htm 说,

and a perl regex guide https://www.tutorialspoint.com/perl/perl_regular_expressions.htm says,

\A Matches beginning of string.

此外,\ A想法按可打印字节序列的方式工作,并且没有文档为某些字节值提供例外或建议面向行"的想法应否定该想法.看一下文件实用程序,它对ID文件类型来说非常酷,但是我发现没有简单的方法可以递归目录并打印出路径/文件名,并且仅当它具有任意的前导字节序列时才每行一个.最后,我有点像bash家伙..是的.我需要去学习更多perl和python ..但我确定通用bash/grep组合可以按文档所述工作.

Also, the \A idea works as it's supposed to for printable byte sequences and no documentation makes an exception for certain byte values or suggests "line oriented" should negate the idea. Looking at the file utility, it's pretty cool to ID file types, but I see no easy way to recurse directories and get a path/filename printed out, one per line if and only if it has an arbitrary leading byte sequence. Lastly, I'm sort of a bash guy .. yea.. I need to go learn perl and python more ..but I'd sure like the universal bash/grep combo to work as documented.

推荐答案

根据grep手册,不支持'\ A`定位,仅支持'^'和'$'

According to grep manual, there is no support for '\A` anchoring, only for '^' and '$'

3.4 Anchoring
=============
The caret ‘^’ and the dollar sign ‘$’ are meta-characters that
respectively match the empty string at the beginning and end of a line.
They are termed "anchors", since they force the match to be "anchored"
to beginning or end of a line, respectively.

此外,还记得grep是面向行的搜索实用程序.它几乎没有处理二进制文件的选项(--binary-files = binary,文本,不匹配).它们都没有改变搜索的性质"-它仍然会在lines

Also, recall that grep is a line oriented search utility. It has few options to handle binary files (--binary-files=binary, text, without-match). None of them changes the 'nature' of the search - it will still look for regexp in lines

要考虑的两个选择

  1. 如果要搜索文件类型"(JPEG,PDF),请考虑使用file实用程序.它使用魔术"数据库检查文件内容,并确定文件类型".它包括JPEG,PDF和更多类型.
  2. 使用其他实用程序(sed,perl),该实用程序可以更好地控制位置(例如,您可以将搜索限制在文件的第一行等).您将需要花费更多的时间来设置这些过滤器.就个人而言,如果您走这条路线,我会选择Perl.
  1. If you are looking for a search on 'file types' (JPEG, PDF), consider using the file utility. It uses the 'magic' database to examine the file content, and determinte the 'file type'. It included JPEG, PDF and more types.
  2. Use other utility (sed, perl), which allows more control over location (e.g., you can limit search to the first line of the file, etc). You will need to spend more on setting those filters. Personally, I would go with Perl, if you take this route.

这篇关于Grep-列出以正则表达式二进制字节序列开头的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆