分析数据的正则表达式 [英] Regular Expression For Parsing Data

查看:135
本文介绍了分析数据的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个应用程序,从一个简单的文本文件读取一些数据。我感兴趣的数据文件
具有以下形式的行:

  Mem(100)= 120 
Mem(200)= 231
Mem(43)= 12
...
Mem(1293)= 12.54

所以,你可以理解,每行的模式是像

 (\ s)*(\t)* Mem([0-9] *)(\s,\t)* =(\s,\t)* [0 -9] *(。)* [0-9] * 

就像我有多个空格之前的字符序列Mem,其次是
a左括号。然后,有一个数字和一个右括号。之后,有任何数量的空格,直到遇到一个'='(等于)字符。然后,任何数量的空格,直到我遇到一个(可能)浮点数。



如何在C ++正则表达式模式表达?我真的很喜欢C ++中的正则表达式概念,所以我需要一些帮助。



谢谢

解决方案

首先,请记住 #include< regex>



C ++ std :: regex_match 与其他语言的正则表达式一样。



让我们从一个简单的例子开始:

  std :: string str =Mem(100)= 120 
std :: regex regex(^ Mem\\([0-9] + \\)= [0-9] + $);
std :: cout<< std :: regex_match(str,regex)<< std :: endl;

在这种情况下,我们的正则表达式是 ^ Mem\([ 9] + \)= [0-9] + $ 。让我们来看看它的作用:




  • ^ 在开始时告诉C ++这是该行开始的位置,因此 AMem(1)= 2 不应该匹配。

  • c $ c> $ 在结束时告诉C ++这是行结束的地方,因此 Mem(1)= 2x 不应该匹配。 li>
  • \\(是字符字符,在正则表达式中有一个非常特殊的含义,所以我们转义 \( \\ 字符在C ++字符串中有特殊的含义,因此我们使用 \\()告诉C ++传递
  • $ [0-9] 匹配正则表达式引擎。 \\d 也应该可以使用,但然后也可能不是
  • [0-9] + 表示如果 Mem()是可接受的,则改用 [0-9] * / li>


正如你所看到的,这和其他语言(例如Java或C#)中的正则表达式一样。 / p>

现在,考虑空格,使用 std :: regex regex(^ \\s * Mem\\([ 9] + \\)\\s * = \\s * [0-9] + \\s * $);



请注意, \s 包括 \t ,因此无需指定都。如果没有,您将使用(\s | \t) [\s\t] ,而不是(\s,\t)



最后, ,我们首先需要考虑 Mem(1)= 1。(即,后面没有数字的点)是可以接受的。



如果不是,则 1.23 中的 .23 。在正则表达式中,我们使用来表示。

  std :: regex regex(^ [\\s] * Mem\\([0-9] + \\)\\s * = \\s * [0-9] + \\。[0-9] +)?\\s * $); 

请注意,我们使用 \。在正则表达式中有一个特殊的含义 - 它匹配任何字符,所以我们需要转义它。



你有一个编译器支持原始字符串(例如 Visual Studio 2013 GCC 4.5 Clang 3.0 ),您可以简化正则表达式字符串:

  std :: regex regex(R^ [\s] * Mem\([0-9] + \)\s * = \s * [0-9] (\。[0-9] +)?\s * $)

有关匹配字符串的信息,可以使用 std :: smatch



让我们从一个小的变化开始:

  std :: string str =Mem(100)= 120 
std :: regex regex(^ [\\s] * Mem\\(([0-9] +)\\)\\s * = \\ s *([0-9] +(\\。[0-9] +)?)\\s * $);
std :: smatch m;

std :: cout<< std :: regex_match(str,m,regex)<< std :: endl;

请注意三件事:


  1. 我们添加了 smatch 。这个类存储了关于匹配的额外结果信息。

  2. 我们在 [0-9] * 上添加了额外的括号。这定义了一个组。组可以让正则表达式引擎跟踪其中的任何内容。

  3. 浮点数周围还有更多的括号。

  4. 非常重要的是,定义组的圆括号不会转义希望他们匹配实际的括号字符。我们实际上想要特殊的正则表达式。



    现在我们有了组,我们可以使用它们:

      for(auto result:m){
    std :: cout<结果< std :: endl;
    }

    这将首先打印整个字符串,然后打印 Mem(),然后是最终数字。



    换句话说, m [0] 给我们整个匹配, m [1] 给我们第一个组 m [2] 给我们第二组, m [3] 会给我们第三组,如果我们有一个。


    I am writing an application that reads some data from a simple text file. The data files, that I am interested in, have lines in the following form:

    Mem(100) = 120
    Mem(200) = 231
    Mem(43) = 12
    ...
    Mem(1293) = 12.54
    

    So, as you can understand, the pattern of each line is something like

    (\s)*(\t)*Mem([0-9]*) (\s,\t)*= (\s,\t)*[0-9]*(.)*[0-9]*
    

    like I have any number of whitespaces before the character sequence "Mem", followed by a left parenthesis. Then, there is a number and a right parenthesis. Afterwards, there is any number of white spaces until an '=' (equals) character is encountered. Then, any number of white spaces until I come across a (possibly) floating point number.

    How can I express that in a C++ regex pattern? I am really new to the regular expression concept in C++ so I would need some help.

    Thank you

    解决方案

    First of all, remember to #include <regex>.

    C++ std::regex_match works like the regular expressions in other languages.

    Let's start with a simple example:

    std::string str = "Mem(100)=120";
    std::regex regex("^Mem\\([0-9]+\\)=[0-9]+$");
    std::cout << std::regex_match(str, regex) << std::endl;
    

    In this case, our regex is ^Mem\([0-9]+\)=[0-9]+$. Let's take a look at what it does:

    • The ^ at the beginning tells C++ this is where the line starts, so AMem(1)=2 should not match.
    • The $ at the end tells C++ this is where the line ends, so Mem(1)=2x should not match.
    • \\( is a literal ( character. ( has a very special meaning in regular expressions, so we escape it \(. However, the \ character has a special meaning in C++ strings, so we use \\( to tell C++ to pass the \( to the regular expression engine.
    • [0-9] matches a digit. \\dshould also work, but then again maybe not.
    • [0-9]+ means at least one digit. If Mem() is acceptable, then use [0-9]* instead.

    As you can see, this is just like the regular expressions you'd find in other languages (such as Java or C# ).

    Now, to consider whitespace, use std::regex regex("^\\s*Mem\\([0-9]+\\)\\s*=\\s*[0-9]+\\s*$");

    Note that \s includes \t, so no need to specify both. If it didn't, you'd use (\s|\t) or [\s\t], not (\s,\t).

    Finally, to include float numbers, we first need to think if Mem(1) = 1. (that is, a dot without a number after it) is acceptable.

    If it isn't, then the .23 in 1.23 is optional. In regexes, we use ? to indicate that.

    std::regex regex("^[\\s]*Mem\\([0-9]+\\)\\s*=\\s*[0-9]+(\\.[0-9]+)?\\s*$");
    

    Note that we use \. instead of just .. . has a special meaning in regular expressions - it matches any character - so we need to escape it.

    If you have a compiler that supports raw strings (e.g. Visual Studio 2013, GCC 4.5, Clang 3.0), you can simplify the regex string:

    std::regex regex(R"^[\s]*Mem\([0-9]+\)\s*=\s*[0-9]+(\.[0-9]+)?\s*$")
    

    To extract information about the matched string, you can use std::smatch and groups.

    Let's start with a small change:

    std::string str = " Mem(100)=120";
    std::regex regex("^[\\s]*Mem\\(([0-9]+)\\)\\s*=\\s*([0-9]+(\\.[0-9]+)?)\\s*$");
    std::smatch m;
    
    std::cout << std::regex_match(str, m, regex) << std::endl;
    

    Note three things:

    1. We added smatch. This class stores extra result info about the match.
    2. We added additional parenthesis around [0-9]*. This defines a group. Groups tell the regex engine to keep track of whatever is within them.
    3. Yet more parenthesis around the floating point number. This defines a second group.

    Very importantly the parenthesis that define groups are NOT escaped since we don't want them to match actual parenthesis characters. We actually want the special regex meaning.

    Now that we have the groups, we can use them:

    for (auto result : m) {
        std::cout << result << std::endl;
    }
    

    This will first print the whole string, then the number in Mem(), then the final number.

    In other words, m[0] gives us the whole match, m[1] gives us the first group, m[2] gives us the second group and m[3] would give us the third group if we had one.

    这篇关于分析数据的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆