如何来标记(字)分类标点符号空间 [英] How to tokenize (words) classifying punctuation as space

查看:164
本文介绍了如何来标记(字)分类标点符号空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据这个问题,这是相当迅速关闭:结果
<一href=\"http://stackoverflow.com/questions/6147866/hey-guys-im-trying-to-create-a-program-to-read-a-users-input-then-break-the-array/6148403#6148403\">Trying要创建一个程序来读取用户输入的话打破了数组转换成单独的词是我的三分球全部有效?

Based on this question which was closed rather quickly:
Trying to create a program to read a users input then break the array into seperate words are my pointers all valid?

而不是关闭,我认为一些额外的工作可能已经进入帮助OP澄清的问题。

Rather than closing I think some extra work could have gone into helping the OP to clarify the question.

欲标记化用户输入和所述令牌存储到字的阵列。结果
我想用标点符号( - )作为分隔符,因此删除它从令牌流

I want to tokenize user input and store the tokens into an array of words.
I want to use punctuation (.,-) as delimiter and thus removed it from the token stream.

在C我会使用的strtok()打破数组为标记,然后手工创建数组。结果
像这样的:

In C I would use strtok() to break an array into tokens and then manually build an array.
Like this:

主要功能:

char **findwords(char *str);

int main()
{
    int     test;
    char    words[100]; //an array of chars to hold the string given by the user
    char    **word;  //pointer to a list of words
    int     index = 0; //index of the current word we are printing
    char    c;

    cout << "die monster !";
    //a loop to place the charecters that the user put in into the array  

    do
    {
        c = getchar();
        words[index] = c;
    }
    while (words[index] != '\n');

    word = findwords(words);

    while (word[index] != 0) //loop through the list of words until the end of the list
    {
        printf("%s\n", word[index]); // while the words are going through the list print them out
        index ++; //move on to the next word
    }

    //free it from the list since it was dynamically allocated
    free(word);
    cin >> test;

    return 0;
}

行标记生成器:

char **findwords(char *str)
{
    int     size = 20; //original size of the list 
    char    *newword; //pointer to the new word from strok
    int     index = 0; //our current location in words
    char    **words = (char **)malloc(sizeof(char *) * (size +1)); //this is the actual list of words

    /* Get the initial word, and pass in the original string we want strtok() *
     *   to work on. Here, we are seperating words based on spaces, commas,   *
     *   periods, and dashes. IE, if they are found, a new word is created.   */

    newword = strtok(str, " ,.-");

    while (newword != 0) //create a loop that goes through the string until it gets to the end
    {
        if (index == size)
        {
            //if the string is larger than the array increase the maximum size of the array
            size += 10;
            //resize the array
            char **words = (char **)malloc(sizeof(char *) * (size +1));
        }
        //asign words to its proper value
        words[index] = newword;
        //get the next word in the string
        newword = strtok(0, " ,.-");
        //increment the index to get to the next word
        ++index;
    }
    words[index] = 0;

    return words;
}

以上code任何评论都将AP preciated。结果
但是,除此之外,什么是用C实现这一目标的最佳技术++?

Any comments on the above code would be appreciated.
But, additionally, what is the best technique for achieving this goal in C++?

推荐答案

这是一个很多问题已经覆盖是如何在C标记化流++。结果
    例如:如何阅读文件,并用C字获得++

Already covered by a lot of questions is how to tokenize a stream in C++.
Example: How to read a file and get words in C++

但是,什么是很难找到的是如何得到相同的功能的strtok():

But what is harder to find is how get the same functionality as strtok():

基本上的strtok()让你拆就一大堆的用户定义的字符的字符串,而C ++流只允许你使用空格作为分隔符。幸运的空格定义是由区域设置定义,因此我们可以通过修改区域对待其他字符的空间,那么这将使我们能够以一种更自然的标记化流时尚。

Basically strtok() allows you to split the string on a whole bunch of user defined characters, while the C++ stream only allows you to use white space as a separator. Fortunately the definition of white space is defined by the locale so we can modify the locale to treat other characters as space and this will then allow us to tokenize the stream in a more natural fashion.

#include <locale>
#include <string>
#include <sstream>
#include <iostream>

// This is my facet that will treat the ,.- as space characters and thus ignore them.
class WordSplitterFacet: public std::ctype<char>
{
    public:
        typedef std::ctype<char>    base;
        typedef base::char_type     char_type;

        WordSplitterFacet(std::locale const& l)
            : base(table)
        {
            std::ctype<char> const&  defaultCType  = std::use_facet<std::ctype<char> >(l);

            // Copy the default value from the provided locale
            static  char data[256];
            for(int loop = 0;loop < 256;++loop) { data[loop] = loop;}
            defaultCType.is(data, data+256, table);

            // Modifications to default to include extra space types.
            table[',']  |= base::space;
            table['.']  |= base::space;
            table['-']  |= base::space;
        }
    private:
        base::mask  table[256];
};

然后,我们可以用这个方面在当地是这样的:

We can then use this facet in a local like this:

    std::ctype<char>*   wordSplitter(new WordSplitterFacet(std::locale()));

    <stream>.imbue(std::locale(std::locale(), wordSplitter));

你的问题的另一部分是我将如何存储这些话在数组中。那么,在C ++中,你不会。你会委托该功能的的std ::矢量/的std ::字符串。通过阅读你的code,你会看到你的code在code相同的部分做了两件大事。

The next part of your question is how would I store these words in an array. Well, in C++ you would not. You would delegate this functionality to the std::vector/std::string. By reading your code you will see that your code is doing two major things in the same part of the code.


  • 这是内存管理。

  • 这是标记化的数据。

有是基本原理关注点分离在您的code只能尽量做两件事情之一。它应该做的任何资源管理(内存管理在这种情况下),或者它应该做的业务逻辑(数据的标记化)。通过将这些分离为code的不同部分,你做code更普遍更容易使用,更容易编写。幸运的是在这个例子中所有的资源管理已经由性病::矢量/标准::字符串从而使我们能够专注于业务逻辑的实现。

There is basic principle Separation of Concerns where your code should only try and do one of two things. It should either do resource management (memory management in this case) or it should do business logic (tokenization of the data). By separating these into different parts of the code you make the code more generally easier to use and easier to write. Fortunately in this example all the resource management is already done by the std::vector/std::string thus allowing us to concentrate on the business logic.

由于已被证明多次简单的方式来标记使用操作符>>和一个字符串流。这将打破流进言。然后可以使用迭代器在整个流标记化流自动圈。

As has been shown many times the easy way to tokenize a stream is using operator >> and a string. This will break the stream into words. You can then use iterators to automatically loop across the stream tokenizing the stream.

std::vector<std::string>  data;
for(std::istream_iterator<std::string> loop(<stream>); loop != std::istream_iterator<std::string>(); ++loop)
{
    // In here loop is an iterator that has tokenized the stream using the
    // operator >> (which for std::string reads one space separated word.

    data.push_back(*loop);
}

如果我们结合这与一些标准算法,简化了code。

If we combine this with some standard algorithms to simplify the code.

std::copy(std::istream_iterator<std::string>(<stream>), std::istream_iterator<std::string>(), std::back_inserter(data));

现在结合上述所有到一个单一的应用程序

Now combining all the above into a single application

int main()
{
    // Create the facet.
    std::ctype<char>*   wordSplitter(new WordSplitterFacet(std::locale()));

    // Here I am using a string stream.
    // But any stream can be used. Note you must imbue a stream before it is used.
    // Otherwise the imbue() will silently fail.
    std::stringstream   teststr;
    teststr.imbue(std::locale(std::locale(), wordSplitter));

    // Now that it is imbued we can use it.
    // If this was a file stream then you could open it here.
    teststr << "This, stri,plop";

    cout << "die monster !";
    std::vector<std::string>    data;
    std::copy(std::istream_iterator<std::string>(teststr), std::istream_iterator<std::string>(), std::back_inserter(data));

    // Copy the array to cout one word per line
    std::copy(data.begin(), data.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}

这篇关于如何来标记(字)分类标点符号空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆