浮点数的正则表达式 [英] Regular expression for floating point numbers

查看:78
本文介绍了浮点数的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个任务来匹配浮点数.我为它编写了以下正则表达式:

I have a task to match floating point numbers. I have written the following regular expression for it:

[-+]?[0-9]*\.?[0-9]*

但是,它返回一个错误:

But, it returns an error:

Invalid escape sequence (valid ones are  \b  \t  \n  \f  \r  \"  \'  \\ )

据我所知,我们还需要对 . 使用转义字符.请纠正我的错误.

As per my knowledge, we need to use an escape character for the . also. Please correct me where I am wrong.

推荐答案

TL;DR

使用 [.] 代替 \.[0-9] 代替 \d避免转义某些语言(如 Java)中的问题.

TL;DR

Use [.] instead of \. and [0-9] instead of \d to avoid escaping issues in some languages (like Java).

感谢无名之辈最初认识到这一点.

匹配浮点数的一种相对简单的模式是

One relatively simple pattern for matching a floating point number is

[+-]?([0-9]*[.])?[0-9]+

这将匹配:

  • 123
  • 123.456
  • .456

查看工作示例

如果您还想匹配 123.(一个没有小数部分的句点),那么您需要一个稍长的表达式:

If you also want to match 123. (a period with no decimal part), then you'll need a slightly longer expression:

[+-]?([0-9]+([.][0-9]*)?|[.][0-9]+)

请参阅pkeller 的回答,以获取对这种模式的更全面解释

See pkeller's answer for a fuller explanation of this pattern

如果您想包含非十进制数字,例如十六进制和八进制,请参阅我对 如何识别字符串的回答是数字吗?.

If you want to include non-decimal numbers, such as hex and octal, see my answer to How do I identify if a string is a number?.

如果你想验证一个输入是一个数字(而不是在输入中找到一个数字),那么你应该用^$,像这样:

If you want to validate that an input is a number (rather than finding a number within the input), then you should surround the pattern with ^ and $, like so:

^[+-]?([0-9]+([.][0-9]*)?|[.][0-9]+)$

不规则正则表达式

在大多数现代语言、API、框架、库等中实现的正则表达式"基于 正式语言理论.但是,软件工程师添加了许多扩展,使这些实现远远超出了正式定义.因此,虽然大多数正则表达式引擎彼此相似,但实际上并没有标准.因此,很大程度上取决于您使用的语言、API、框架或库.

Irregular Regular Expressions

"Regular expressions", as implemented in most modern languages, APIs, frameworks, libraries, etc., are based on a concept developed in formal language theory. However, software engineers have added many extensions that take these implementations far beyond the formal definition. So, while most regular expression engines resemble one another, there is actually no standard. For this reason, a lot depends on what language, API, framework or library you are using.

(顺便说一句,为了帮助减少混淆,许多人已经开始使用regex"或regexp"来描述这些增强的匹配语言.请参阅正则表达式与正则表达式相同吗? 在 RexEgg.com 上了解更多信息.)

(Incidentally, to help reduce confusion, many have taken to using "regex" or "regexp" to describe these enhanced matching languages. See Is a Regex the Same as a Regular Expression? at RexEgg.com for more information.)

也就是说,大多数正则表达式引擎(实际上,据我所知,所有这些引擎)都会接受 \..很可能是转义问题.

That said, most regex engines (actually, all of them, as far as I know) would accept \.. Most likely, there's an issue with escaping.

某些语言内置了对正则表达式的支持,例如JavaScript.对于那些没有的语言,转义可能是一个问题.

Some languages have built-in support for regexes, such as JavaScript. For those languages that don't, escaping can be a problem.

这是因为您基本上是用一种语言中的一种语言进行编码.例如,Java 使用 \ 作为其字符串中的转义字符,因此如果您想在字符串中放置文字反斜杠字符,则必须对其进行转义:

This is because you are basically coding in a language within a language. Java, for example, uses \ as an escape character within it's strings, so if you want to place a literal backslash character within a string, you must escape it:

// creates a single character string: "\"
String x = "\\";

然而,正则表达式使用\字符进行转义,所以如果你想匹配一个文字的\字符,你必须将它转义对于正则表达式引擎,然后为 Java 再次转义:

However, regexes also use the \ character for escaping, so if you want to match a literal \ character, you must escape it for the regexe engine, and then escape it again for Java:

// Creates a two-character string: "\\"
// When used as a regex pattern, will match a single character: "\"
String regexPattern = "\\\\";

在您的情况下,您可能没有在您正在编程的语言中转义反斜杠字符:

In your case, you have probably not escaped the backslash character in the language you are programming in:

// will most likely result in an "Illegal escape character" error
String wrongPattern = "\.";
// will result in the string "\."
String correctPattern = "\\.";

所有这些转义都会变得非常混乱.如果您使用的语言支持原始字符串,那么您应该使用它们来减少反斜杠的数量,但并非所有语言都这样做(最显着的是:Java).幸运的是,有一种替代方法可以在某些时候起作用:

All this escaping can get very confusing. If the language you are working with supports raw strings, then you should use those to cut down on the number of backslashes, but not all languages do (most notably: Java). Fortunately, there's an alternative that will work some of the time:

String correctPattern = "[.]";

对于正则表达式引擎,\.[.] 的意思完全相同.请注意,这并不适用于所有情况,例如换行符 (\\n)、左方括号 (\\[) 和反斜杠 (\\n)\ 或 [\\]).

For a regex engine, \. and [.] mean exactly the same thing. Note that this doesn't work in every case, like newline (\\n), open square bracket (\\[) and backslash (\\\\ or [\\]).

(提示:这比你想象的要难)

匹配数字是您认为使用正则表达式很容易的事情之一,但实际上非常棘手.让我们一块一块地看看你的方法:

Matching a number is one of those things you'd think is quite easy with regex, but it's actually pretty tricky. Let's take a look at your approach, piece by piece:

[-+]?

匹配可选的 -+

[0-9]*

匹配 0 个或多个连续数字

\.?

匹配一个可选的.

[0-9]*

匹配 0 个或多个连续数字

首先,我们可以使用字符类速记来稍微清理一下这个表达式对于数字(请注意,这也容易受到上述转义问题的影响):

First, we can clean up this expression a bit by using a character class shorthand for the digits (note that this is also susceptible to the escaping issue mentioned above):

[0-9] = \d

我将在下面使用 \d,但请记住它与 [0-9] 的含义相同.(嗯,实际上,在某些引擎中 \d 会匹配所有脚本中的数字,所以它会比 [0-9] 匹配更多,但这可能并不重要在你的情况下.)

I'm going to use \d below, but keep in mind that it means the same thing as [0-9]. (Well, actually, in some engines \d will match digits from all scripts, so it'll match more than [0-9] will, but that's probably not significant in your case.)

现在,如果您仔细观察,您会发现模式的每个部分都是可选的.此模式可以匹配长度为 0 的字符串;仅由 +- 组成的字符串;或者,一个仅由 . 组成的字符串.这可能不是您想要的.

Now, if you look at this carefully, you'll realize that every single part of your pattern is optional. This pattern can match a 0-length string; a string composed only of + or -; or, a string composed only of a .. This is probably not what you've intended.

要解决此问题,首先使用最低要求的字符串(可能是单个数字)锚定"您的正则表达式会很有帮助:

To fix this, it's helpful to start by "anchoring" your regex with the bare-minimum required string, probably a single digit:

\d+

现在我们要添加小数部分,但它没有添加到您认为可能的位置:

Now we want to add the decimal part, but it doesn't go where you think it might:

\d+\.?\d* /* This isn't quite correct. */

这仍将匹配 123. 之类的值.更糟糕的是,它带有邪恶色彩.句点是可选的,这意味着您有两个并排重复的类(\d+\d*).如果使用不当,这实际上可能很危险,使您的系统容易受到 DoS 攻击.

This will still match values like 123.. Worse, it's got a tinge of evil about it. The period is optional, meaning that you've got two repeated classes side-by-side (\d+ and \d*). This can actually be dangerous if used in just the wrong way, opening your system up to DoS attacks.

为了解决这个问题,我们需要将句点视为可选的,而不是将句点视为可选的(以分隔重复的字符类),而是将整个小数部分设为可选:

To fix this, rather than treating the period as optional, we need to treat it as required (to separate the repeated character classes) and instead make the entire decimal portion optional:

\d+(\.\d+)? /* Better. But... */

现在看起来好多了.我们要求第一个数字序列和第二个数字序列之间有一个句点,但有一个致命的缺陷:我们不能匹配 .123,因为现在需要一个前导数字.

This is looking better now. We require a period between the first sequence of digits and the second, but there's a fatal flaw: we can't match .123 because a leading digit is now required.

这实际上很容易解决.与其将数字的小数"部分设为可选,我们需要将其视为一个字符序列:1 个或多个可能以 为前缀的数字. 可能以 0 或更多数字:

This is actually pretty easy to fix. Instead of making the "decimal" portion of the number optional, we need to look at it as a sequence of characters: 1 or more numbers that may be prefixed by a . that may be prefixed by 0 or more numbers:

(\d*\.)?\d+

现在我们只需添加符号:

Now we just add the sign:

[+-]?(\d*\.)?\d+

当然,这些斜线在 Java 中很烦人,所以我们可以在我们的长格式字符类中替换:

Of course, those slashes are pretty annoying in Java, so we can substitute in our long-form character classes:

[+-]?([0-9]*[.])?[0-9]+

匹配与验证

这在评论中出现了几次,所以我添加了一个关于匹配与验证的附录.

Matching versus Validating

This has come up in the comments a couple times, so I'm adding an addendum on matching versus validating.

匹配的目标是在输入中找到一些内容(大海捞针").验证的目标是确保输入的格式符合预期.

The goal of matching is to find some content within the input (the "needle in a haystack"). The goal of validating is to ensure that the input is in an expected format.

正则表达式,就其性质而言,仅匹配文本.给定一些输入,他们要么会找到一些匹配的文本,要么不会.但是,通过使用锚标记(^$)将表达式对齐"到输入的开头和结尾,我们可以确保找不到匹配项,除非整个输入匹配表达式,有效地使用正则表达式来验证.

Regexes, by their nature, only match text. Given some input, they will either find some matching text or they will not. However, by "snapping" an expression to the beginning and ending of the input with anchor tags (^ and $), we can ensure that no match is found unless the entire input matches the expression, effectively using regexes to validate.

上述正则表达式 ([+-]?([0-9]*[.])?[0-9]+) 将匹配一个或目标字符串中的更多数字.所以给定输入:

The regex described above ([+-]?([0-9]*[.])?[0-9]+) will match one or more numbers within a target string. So given the input:

apple 1.34 pear 7.98 version 1.2.3.4

正则表达式将匹配 1.347.981.2.3.4.

The regex will match 1.34, 7.98, 1.2, .3 and .4.

要验证给定的输入是一个数字,而只是一个数字,请将表达式对齐"到输入的开头和结尾,方法是将其包装在锚标记中:

To validate that a given input is a number and nothing but a number, "snap" the expression to the start and end of the input by wrapping it in anchor tags:

^[+-]?([0-9]*[.])?[0-9]+$

如果整个输入是浮点数,这只会找到匹配项,如果输入包含其他字符,则不会找到匹配项.因此,给定输入 1.2,将找到匹配项,但给定 apple 1.2 pear 找不到匹配项.

This will only find a match if the entire input is a floating point number, and will not find a match if the input contains additional characters. So, given the input 1.2, a match will be found, but given apple 1.2 pear no matches will be found.

请注意,某些正则表达式引擎具有 validateisMatch 或类似功能,它们基本上会自动执行我所描述的操作,返回 true 如果找到匹配项,false 如果没有找到匹配项.还要记住,一些引擎允许你设置改变 ^$ 定义的标志,匹配一行的开头/结尾而不是开头/结尾整个输入.这通常不是默认设置,但请注意这些标志.

Note that some regex engines have a validate, isMatch or similar function, which essentially does what I've described automatically, returning true if a match is found and false if no match is found. Also keep in mind that some engines allow you to set flags which change the definition of ^ and $, matching the beginning/end of a line rather than the beginning/end of the entire input. This is typically not the default, but be on the lookout for these flags.

这篇关于浮点数的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆