C 正则表达式:提取实际匹配项 [英] C Regular Expressions: Extracting the Actual Matches

查看:18
本文介绍了C 正则表达式:提取实际匹配项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 C 中使用正则表达式(使用regex.h"库).在为 regcomp(...) 和 regexec(...) 设置标准调用(和检查)之后,我只能设法打印与我编译的正则表达式匹配的实际子字符串.根据手册页,使用 regexec 意味着您将子字符串匹配存储在称为regmatch_t"的结构中.该结构仅包含 rm_so 和 rm_eo 来引用我理解为内存中匹配子字符串的字符的地址,但我的问题是我如何才能使用这些偏移量和两个指针来提取实际的子字符串并将其存储到一个数组(理想情况下是一个二维字符串数组)?

I am using regular expressions in C (using the "regex.h" library). After setting up the standard calls (and checks) for regcomp(...) and regexec(...), I can only manage to print the actual substrings that match my compiled regular expression. Using regexec, according to the manual pages, means you store the substring matches in a structure known as "regmatch_t". The struct only contains rm_so and rm_eo to reference what I understand to be the addresses of the characters of the matched substring in memory, but my question is how can I just use these to offsets and two pointers to extract the actual substring and store it into an array (ideally a 2D array of strings)?

当您只是打印到标准输出时它可以工作,但是每当您尝试使用相同的设置但将其存储在字符串/字符数组中时,它会存储最初用于匹配表达式的整个字符串.此外,打印语句中的%.*s"是什么?我想它本身就是一个正则表达式,可以正确读取指向字符数组的指针.我只想将匹配的子字符串存储在集合中,以便可以在软件的其他地方使用它们.

It works when you just print to standard out, but whenever you try to use the same setup but store it in a string/character array, it stores the entire string that was originally used to match against the expression. Further, what is the "%.*s" inside the print statement? I imagine it's a regular expression in of itself to read in the pointers to a character array correctly. I just want to store the matched substrings inside a collection so I can work with them elsewhere in my software.

背景:p 和 p2 都是设置为指向要匹配的字符串开头的指针,然后在下面的代码中进入 while 循环:

Background: p and p2 are both pointers set to point to the start of string to match before entering the while loop in the code below:

int ind = 0;
while(1){
    regExErr1 = regexec(&r, p, 10, m, 0);
    //printf("Did match regular expr, value %i
", regExErr1);
    if( regExErr1 != 0 ){ 
        fprintf(stderr, "No more matches with the inherent regular expression!
"); 
        break; 
    }   
    printf("What was found was: ");
    int i = 0;
    while(1){
        if(m[i].rm_so == -1){
            break;
        }
        int start = m[i].rm_so + (p - p2);
        int finish = m[i].rm_eo + (p - p2);
        strcpy(matches[ind], ("%.*s
", (finish - start), p2 + start));
        printf("Storing:  %.*s", matches[ind]);
        ind++;
        printf("%.*s
", (finish - start), p2 + start);
        i++;
    }
    p += m[0].rm_eo; // this will move the pointer p to the end of last matched pattern and on to the start of a new one
}
printf("We have in [0]:  %s
", temp);

推荐答案

有很多正则表达式包,但你的似乎与 POSIX 中的匹配:regcomp()

There are quite a lot of regular expression packages, but yours seems to match the one in POSIX: regcomp() etc.

它在<regex.h>中定义的两个结构是:

The two structures it defines in <regex.h> are:

  • regex_t 至少包含 size_t re_nsub,括号内的子表达式的数量.

  • regex_t containing at least size_t re_nsub, the number of parenthesized subexpressions.

regmatch_t 至少包含 regoff_t rm_so,从字符串开头到子字符串开头的字节偏移量,以及 regoff_t rm_eo,子字符串结尾后第一个字符的字符串开头的字节偏移量.

regmatch_t containing at least regoff_t rm_so, the byte offset from start of string to start of substring, and regoff_t rm_eo, the byte offset from start of string of the first character after the end of substring.

请注意,偏移量"不是指针,而是字符数组的索引.

Note that 'offsets' are not pointers but indexes into the character array.

执行函数为:

  • int regexec(const regex_t *restrict preg, const char *restrict string,size_t nmatch, regmatch_t pmatch[restrict], int eflags);

您的打印代码应该是:

for (int i = 0; i <= r.re_nsub; i++)
{
    int start = m[i].rm_so;
    int finish = m[i].rm_eo;
//  strcpy(matches[ind], ("%.*s
", (finish - start), p + start));  // Based on question
    sprintf(matches[ind], "%.*s
", (finish - start), p + start);   // More plausible code
    printf("Storing:  %.*s
", (finish - start), matches[ind]);     // Print once
    ind++;
    printf("%.*s
", (finish - start), p + start);                  // Why print twice?
}

请注意,应升级代码以确保字符串副本(通过 sprintf())不会溢出目标字符串 — 可以改用 snprintf()sprintf().在打印中标记字符串的开始和结束也是一个好主意.例如:

Note that the code should be upgraded to ensure that the string copy (via sprintf()) does not overflow the target string — maybe by using snprintf() instead of sprintf(). It is also a good idea to mark the start and end of a string in the printing. For example:

    printf("<<%.*s>>
", (finish - start), p + start);

这使得整个堆更容易看到空格等.

This makes it a whole heap easier to see spaces etc.

[以后,请尝试提供 MCVE(最小、完整、可验证的示例)或 SSCCE(简短、独立、正确的示例),以便人们可以更轻松地提供帮助.]

[In future, please attempt to provide an MCVE (Minimal, Complete, Verifiable Example) or SSCCE (Short, Self-Contained, Correct Example) so that people can help more easily.]

这是我创建的 SSCCE,可能是为了回应 2010 年的另一个 SO 问题.它是我保留的多个程序之一,我称之为小插曲";显示某些功能本质的小程序(例如 POSIX 正则表达式,在这种情况下).我发现它们作为记忆慢跑者很有用.

This is an SSCCE that I created, probably in response to another SO question in 2010. It is one of a number of programs I keep that I call 'vignettes'; little programs that show the essence of some feature (such as POSIX regexes, in this case). I find them useful as memory joggers.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <regex.h>

#define tofind    "^DAEMONS=\(([^)]*)\)[ 	]*$"

int main(int argc, char **argv)
{
    FILE *fp;
    char line[1024];
    int retval = 0;
    regex_t re;
    regmatch_t rm[2];
    //this file has this line "DAEMONS=(sysklogd network sshd !netfs !crond)"
    const char *filename = "/etc/rc.conf";

    if (argc > 1)
        filename = argv[1];

    if (regcomp(&re, tofind, REG_EXTENDED) != 0)
    {
        fprintf(stderr, "Failed to compile regex '%s'
", tofind);
        return EXIT_FAILURE;
    }
    printf("Regex: %s
", tofind);
    printf("Number of captured expressions: %zu
", re.re_nsub);

    fp = fopen(filename, "r");
    if (fp == 0)
    {
        fprintf(stderr, "Failed to open file %s (%d: %s)
", filename, errno, strerror(errno));
        return EXIT_FAILURE;
    }

    while ((fgets(line, 1024, fp)) != NULL)
    {
        line[strcspn(line, "
")] = '';
        if ((retval = regexec(&re, line, 2, rm, 0)) == 0)
        {
            printf("<<%s>>
", line);
            // Complete match
            printf("Line: <<%.*s>>
", (int)(rm[0].rm_eo - rm[0].rm_so), line + rm[0].rm_so);
            // Match captured in (...) - the ( and ) match literal parenthesis
            printf("Text: <<%.*s>>
", (int)(rm[1].rm_eo - rm[1].rm_so), line + rm[1].rm_so);
            char *src = line + rm[1].rm_so;
            char *end = line + rm[1].rm_eo;
            while (src < end)
            {
                size_t len = strcspn(src, " ");
                if (src + len > end)
                    len = end - src;
                printf("Name: <<%.*s>>
", (int)len, src);
                src += len;
                src += strspn(src, " ");
            }
        }
    } 
    return EXIT_SUCCESS;
}

这旨在在文件 /etc/rc.conf 中找到以 DAEMONS= 开头的特定行(但您可以在命令行上指定替代文件名).您可以轻松地将其调整为适合您的目的.

This was designed to find a particular line starting DAEMONS= in a file /etc/rc.conf (but you can specify an alternative file name on the command line). You can adapt it to your purposes easily enough.

这篇关于C 正则表达式:提取实际匹配项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆