正则表达式解析或验证 Base64 数据 [英] RegEx to parse or validate Base64 data

查看:83
本文介绍了正则表达式解析或验证 Base64 数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用 RegEx 来验证或清理 Base64 数据?这是一个简单的问题,但推动这个问题的因素是使问题变得困难的原因.

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.

我有一个 Base64 解码器,它不能完全依赖输入数据来遵循 RFC 规范.因此,我面临的问题可能是 Base64 数据,它们可能无法分解为 78(我认为是 78,我必须仔细检查 RFC,所以如果确切数字错误,请不要责怪我)字符行,或者行不能以 CRLF 结尾;因为它可能只有一个 CR 或 LF,或者两者都没有.

I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.

所以,我花了很多时间解析这样格式化的 Base64 数据.由于这个原因,像下面这样的例子变得不可能可靠地解码.为简洁起见,我将仅显示部分 MIME 标头.

So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.

Content-Transfer-Encoding: base64

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

好的,所以解析没有问题,这正是我们期望的结果.在 99% 的情况下,使用任何代码至少验证缓冲区中的每个字符都是有效的 base64 字符,效果很好.但是,下一个示例将扳手混为一谈.

Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.

Content-Transfer-Encoding: base64

http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

这是 Base64 编码的一个版本,我在一些病毒和其他东西中看到过,它试图利用一些邮件读者希望不惜一切代价解析 mime,而不是严格按照书本,或者更确切地说是 RFC;如果你愿意.

This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.

我的 Base64 解码器将第二个示例解码为以下数据流.请记住,原始流都是 ASCII 数据!

My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!

[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8

有人有同时解决这两个问题的好方法吗?我不确定这是否可能,除了对应用不同规则的数据进行两次转换并比较结果之外.但是,如果您采用这种方法,您信任哪个输出?似乎 ASCII 启发式算法是最佳解决方案,但是对于像病毒扫描程序这样复杂的东西(这段代码实际上涉及到哪些内容),这会增加多少代码、执行时间和复杂性?您将如何训练启发式引擎以了解哪些是可接受的 Base64,哪些不是?

Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?

为了这个问题继续获得的浏览量,我决定发布我在 C# 应用程序中使用了 3 年的简单 RegEx,有数十万笔交易.老实说,我最喜欢 Gumbo 给出的答案,这就是为什么我选择它作为选择的答案.但是对于任何使用 C# 并且正在寻找至少检测字符串或字节 [] 是否包含有效 Base64 数据的非常快速的方法的人来说,我发现以下对我来说非常有效.

Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.

[^-A-Za-z0-9+/=]|=[^=]|={3,}$

是的,这仅适用于 Base64 数据的 STRING,而不是格式正确的 RFC1341 消息.因此,如果您正在处理此类数据,请在尝试使用上述 RegEx 之前考虑到这一点.如果您出于其他目的(URL、文件名、XML 编码等)处理 Base16、Base32、Radix 甚至 Base64,那么强烈建议您阅读RFC4648 Gumbo 在他的回答中提到,因为在尝试使用此问题/答案集中的建议之前,您需要充分了解实现使用的字符集和终止符.

And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.

推荐答案

来自 RFC 4648:

数据的基本编码在许多情况下用于在环境中存储或传输数据,这些环境可能由于传统原因而仅限于 US-ASCII 数据.

Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.

因此,如果数据被认为是危险的,则取决于编码数据的使用目的.

So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.

但如果您只是在寻找匹配 Base64 编码单词的正则表达式,您可以使用以下内容:

But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

这篇关于正则表达式解析或验证 Base64 数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆