用于解析格式化数字的正则表达式 [英] regexes for parsing formatted numbers

查看:91
本文介绍了用于解析格式化数字的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析包含大量格式化数字的文档,例如:

I am parsing documents which contain large amounts of formatted numbers, an example being:

 Frc consts  --     1.4362                 1.4362                 5.4100
 IR Inten    --     0.0000                 0.0000                 0.0000
 Atom AN      X      Y      Z        X      Y      Z        X      Y      Z
    1   6     0.00   0.00   0.00     0.00   0.00   0.00     0.00   0.00   0.00
    2   1     0.40  -0.20   0.23    -0.30  -0.18   0.36     0.06   0.42   0.26

这些是分开的行,所有行都有很大的前导空格,可能有也可能没有重要的尾随空格)。它们由72,72,78,78和78个字符组成。我可以推断出字段之间的界限。这些是可描述的(使用fortran格式(nx = nspaces,an = n alphanum,in = n,n列中的整数,fm.n = m个字符的浮点数,小数点后面的n个位置)by:

These are separated lines all with a significant leading space and there may or may not be significant trailing whitespace). They consist of 72,72, 78, 78, and 78 characters. I can deduce the boundaries between fields. These are describable (using fortran format (nx = nspaces, an = n alphanum, in = integer in n columns, fm.n = float of m characters with n places after the decimal point) by:

 (1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
 (1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
 (1x,a4,a4,3(2x,3a7))
 (1x,2i4,3(2x,3f7.2))
 (1x,2i4,3(2x,3f7.2))

I有可能有几千种不同的格式(我可以自动生成或移出),并通过描述组件的正则表达式来描述它们。因此,如果regf10_4代表满足f10.4约束的任何字符串的正则表达式,我可以创建一个形式的正则表达式:

I have potentially several thousand different formats (which I can autogenerate or farm out) and am describing them by regular expressions describing the components. Thus if regf10_4 represents a regex for any string satisfying the f10.4 constraint I can create a regex of the form:

COMMENTS 
      (\s
      .{14}
      \s
      regf10_4,
      \s{13}
      regf10_4,
      \s{13}
      regf10_4,
)

我想知道是否有正则数据库满足在此重复使用计算机和人类创造的数字有很多种,比如f10.4。我相信以下是fortran的所有合法输入和/或输出(我不需要像12.4f那样的f或d形式的后缀)[SO中的格式应该被理解为没有第一个的前导空格,一个用于第二,等等。

I would like to know whether there are regexes that satisfy re-use in this way. There is a wide variety in the way computers and humans create numbers that are compatible with, say f10.4. I believe the following are all legal input and/or output for fortran (I do not require suffixes of the form f or d as in 12.4f) [the formatting in SO should be read as no leading spaces for the first, one for the second, etc.]

-1234.5678
 1234.5678
            // missing number
 12345678.
 1.
 1.0000000
    1.0000
        1.
 0.
        0.
     .1234
    -.1234
    1E2
    1.E2
    1.E02
  -1.0E-02
**********  // number over/underflow

它们必须对相邻字段的内容具有鲁棒性(例如,只能在精确位置精确检查10个字符。因此,以下是(a1,f5.2,a1)合法的:

They have to be robust against the content of the neighbouring fields (e.g. only examine precisely 10 characters in a precise position. Thus the following are legal for (a1,f5.2,a1):

a-1.23b   // -1.23
- 1.23.   // 1.23
3 1.23-   // 1.23

我正在使用Java需要与Java 1.6兼容的正则表达式构造(例如,不是perl扩展)

I am using Java so need regex constructs compatible with Java 1.6 (e.g. not perl extensions)

推荐答案

据我了解,每行包含一个或多个固定宽度字段,可能包含不同种类的标签,空格或数据。如果你知道字段的宽度和类型,提取它们的数据就是一个简单的事情 substring() trim()和(可选) Whatever.parseWhatever()。正则表达式无法让这项工作变得更容易 - 事实上,他们所能做的就是让它变得更加困难。

As I understand it, each line comprises one or more fixed-width fields, which may contain labels, spaces, or data of different kinds. If you know the widths and types of the fields, extracting their data is a simple matter of substring(), trim() and (optionally) Whatever.parseWhatever(). Regexes can't make that job any easier--in fact, all they can do is make it a lot harder.

扫描仪也没有任何帮助。确实,它为各种值类型预定义了正则表达式,它为您进行转换,但仍需要告知每次要查找的类型,并且需要将字段用可识别的分隔符分隔。根据定义,固定宽度数据不需要分隔符。你可以通过做一个前瞻来伪造分隔符,但是行中应该留下许多字符,但这只是使工作比其需要更难的另一种方式。

Scanner doesn't really help, either. True, it has predefined regexes for various value types, and it does the conversions for you, but it still needs to be told which type to look for each time, and it needs the fields to be separated by a delimiter it can recognize. Fixed-width data doesn't require delimiters, by definition. You might be able to fake the delimiters by doing a lookahead for however many characters should be left in the line, but that's just another way of making the job harder than it needs to be.

听起来性能将是一个主要问题;即使你可以使正则表达式解决方案工作,它可能会太慢。不是因为正则表达本身就很慢,而是因为你必须经历扭曲以使它们适合这个问题。我建议你忘掉这份工作的正则表达式。

It sounds like performance is going to be a major concern; even if you could make a regex solution work, it would probably be too slow. Not because regexes are inherently slow, but because of the contortions you would have to go through to make them fit the problem. I suggest you forget about regexes for this job.

这篇关于用于解析格式化数字的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆