解析字符串到数组基于空间或QUOT;双引号字符串" [英] Parse string into array based on spaces or "double quotes strings"

查看:195
本文介绍了解析字符串到数组基于空间或QUOT;双引号字符串"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试着拿一个用户输入字符串并解析成是一个数组所谓的char * entire_line [100];其中,每个字被放置在阵列的不同指数,但如果将字符串的一部分是由一个报价包封,应把在一个单一的索引。
所以,如果我有

Im trying to take a user input string and parse is into an array called char *entire_line[100]; where each word is put at a different index of the array but if a part of the string is encapsulated by a quote, that should be put in a single index. So if I have

char buffer[1024]={0,};
fgets(buffer, 1024, stdin);

例如输入:这个词FILENAME.TXT这是shoudl占用一个索引在输出数组的字符串;

example input: "word filename.txt "this is a string that shoudl take up one index in an output array";

tokenizer=strtok(buffer," ");//break up by spaces
        do{
            if(strchr(tokenizer,'"')){//check is a word starts with a "
            is_string=YES;
            entire_line[i]=tokenizer;// if so, put that word into current index
            tokenizer=strtok(NULL,"\""); //should get rest of string until end "
            strcat(entire_line[i],tokenizer); //append the two together, ill take care of the missing space once i figure out this issue

              }  
        entire_line[i]=tokenizer;
        i++;
        }while((tokenizer=strtok(NULL," \n"))!=NULL);

这显然是行不通的,如果封装的字符串中的双引号是在输入字符串的结束只是靠拢
但我可以有
输入:单词这是文本,将用户输入FILENAME.TXT
一直试图算出这个一会,总要得到某个地方卡住。
谢谢

This clearly isn't working and only gets close if the double quote encapsulated string is at the end of the input string but i could have input: word "this is text that will be user entered" filename.txt Been trying to figure this out for a while, always get stuck somewhere. thanks

推荐答案

strtok的的功能是在C来标记一个可怕的方式,除了一个(诚然常见)的情况下:简单的空格隔开的话。 (即使如此,它仍然不是很大,由于缺乏重入和递归的能力,这就是为什么我们发明了 strsep BSD的归途时的感觉。)

The strtok function is a terrible way to tokenize in C, except for one (admittedly common) case: simple whitespace-separated words. (Even then it's still not great due to lack of re-entrance and recursion ability, which is why we invented strsep for BSD way back when.)

在这种情况下,最好的办法是建立自己的简单的状态机:

Your best bet in this case is to build your own simple state-machine:

char *p;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;

for (p = buffer; *p != '\0'; p++) {
    c = (unsigned char) *p; /* convert to unsigned char for is* functions */
    switch (state) {
    case DULL: /* not in a word, not in a double quoted string */
        if (isspace(c)) {
            /* still not in a word, so ignore this char */
            continue;
        }
        /* not a space -- if it's a double quote we go to IN_STRING, else to IN_WORD */
        if (c == '"') {
            state = IN_STRING;
            start_of_word = p + 1; /* word starts at *next* char, not this one */
            continue;
        }
        state = IN_WORD;
        start_of_word = p; /* word starts here */
        continue;

    case IN_STRING:
        /* we're in a double quoted string, so keep going until we hit a close " */
        if (c == '"') {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_STRING or we handled the end above */

    case IN_WORD:
        /* we're in a word, so keep going until we get to a space */
        if (isspace(c)) {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_WORD or we handled the end above */
    }
}

请注意,这并不占一个字内使用双引号,例如可能性:

Note that this does not account for the possibility of a double quote inside a word, e.g.:

"some text in quotes" plus four simple words p"lus something strange"

通过状态机上述工作,你会看到引号中的一些文字变成一个单一的令牌(即忽略了双引号),但 p的lu 也是一个令牌(包含引号),的东西是一个道理,而奇怪是一个道理。无论你是想这样,还是要如何处理它,是你的。对于更复杂的,但彻底的词汇符号化,您可能需要使用code-建筑工具,如弯曲

Work through the state machine above and you will see that "some text in quotes" turns into a single token (that ignores the double quotes), but p"lus is also a single token (that includes the quote), something is a single token, and strange" is a token. Whether you want this, or how you want to handle it, is up to you. For more complex but thorough lexical tokenization, you may want to use a code-building tool like flex.

此外,当退出循环,如果状态不是 DULL ,你需要处理的最后一句话(我离开了这一点,上面的code),并决定该怎么做,如果状态 IN_STRING (意思是有没有近距离双引号)。

Also, when the for loop exits, if state is not DULL, you need to handle the final word (I left this out of the code above) and decide what to do if state is IN_STRING (meaning there was no close-double-quote).

这篇关于解析字符串到数组基于空间或QUOT;双引号字符串"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆