Grep认为文本文件是二进制的,但事实并非如此 [英] Grep thinks text file is binary, but it isn't

查看:747
本文介绍了Grep认为文本文件是二进制的,但事实并非如此的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我们的代码库中发现了一个被grep看作二进制文件的 .cpp 文件。所以我不能像文本文件那样格式化它,这很烦人,显然不是事情应该如何。所以我想知道为什么grep认为这个文件是二进制文件并且解决了这个问题。



我试图用命令找到任何不寻常的字符

  grep -Pna --color -r[\x00-\x08] | [\x10-\x19] | [\\ \\ x80-\xFF]test.cpp 

但它不会产生任何匹配。 / p>

如何才能找出这个问题的原因?

我应该提到我在windows git bash上。



语言环境输出:

  LANG = en_US.UTF-8 
LC_CTYPE =en_US.UTF-8
LC_NUMERIC =en_US.UTF-8
LC_TIME =en_US.UTF-8
LC_COLLATE =en_US.UTF- 8
LC_MONETARY =en_US.UTF-8
LC_MESSAGES =en_US.UTF-8
LC_ALL =
test.cpp

解决方案 >文件使用UTF-16(在Windows的最新版本中通用)或Windows-1252(CP-125)进行编码2)作为其字符编码(可能是其中一个注释中的印刷引用)。

当您的语言环境设置为UTF-8并且 grep 检测到该语言环境的无效字符,它假定该文件是二进制文件。解决此问题的一个快速方法是,通过临时修改 grep 来使用 C 语言环境在运行 grep 命令时,c> LC_ALL 环境变量:

  LC_ALL = C grep模式test.cpp 

更好的长期解决方案是转换文本文件(使用 iconv 或您最喜爱的文本编辑器)以使用UTF-8作为它们的字符编码。


I came across a .cpp file in our codebase that is seen as binary by grep. So I can't grep it like a text file, which is annoying and obviously not how things ought to be. So I want to know why grep thinks the file is binary and address the issue.

I tried to find any characters out of the ordinary using the command

grep -Pna --color -r "[\x00-\x08]|[\x10-\x19]|[\x80-\xFF]" test.cpp

but it doesn't yield any matches.

How can figure out the cause of this problem?

I should mention I'm on windows git bash.

Output of locale:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

解决方案

Since you’re using MS Windows, it’s possible that the test.cpp file is encoded using either UTF-16 (common in recent versions of Windows) or Windows-1252 (CP-1252) as its character encoding (perhaps a typographic quote in one of the comments).

When your locale is set to UTF-8 and grep detects invalid characters for that locale, it assumes that the file is binary. A quick way around this issue, is to get grep to use the C locale by temporarily modifying the LC_ALL environment variable when running the grep command:

LC_ALL=C grep pattern test.cpp

A better long term solution would be to convert text files (using iconv or your favourite text editor) to use UTF-8 as their character encoding.

这篇关于Grep认为文本文件是二进制的,但事实并非如此的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆