在C ++中分割(整个)单词 [英] Splits of (entire) words in C++

查看:175
本文介绍了在C ++中分割(整个)单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们假设我们有一个文本,必须由一组工作者(map / reduce项目中的映射器)处理:文本的每个部分(split)必须是一致的(没有词可以在两个工人)和分裂的大小应尽可能相等(负载平衡)。



这里是我的算法


  1. text在等于拆分,每个由 off_t begin off_t end (文本中的开始和结束字节)组成。

  2. 使每个分割一致:如果分割结束于单词的中间,则跳转到下一个空格(并使用begin开始)。没有这个阶段的结果可能是错误的:想想着名的字数的例子在map / reduce。

输入(文字长度= 29个字符)


$ b b

Hello Darkness我的老朋友



per worker):



| Hello D | arkness |我的老|朋友|



阶段2:



code> | Hello Darkness | | my old | friend |



这是每个分组的阶段2的代码:

  ifstream ifs(file,ifstream :: in); 
char c;
string s;
if(begin> 0){// begin = 0 then first split:can not break a word!
//如果char之前的第一个不同于''或'\\\
'
//那么拆分开始于一个单词的中间(坏的)
ifs.seekg begin-1,ios :: beg);
ifs.get(c);
if(c!=''&& c!='\\\
'){
getline(ifs,s,''); //跳到下一个空格
begin + = s.length();
}
if(begin> end)
end = begin;
}
ifs.seekg(end,ios :: beg);
ifs.get(c);
if(c!=''&& c!='\\\
'&& end!= size){
getline(ifs,s,'');
end + = s.length();正如你所看到的,这个解决方案有两个问题:




$ b



  1. 拆分大小可能不平衡(查看第一个和第二个拆分!)

  2. 代码有点棘手

您有任何改善解决方案的建议(在优雅和/或效能方面)吗?

解决方案

我认为在算法上,你不能做得更好(如果你正在做地图|减少你可能有 lot 数据,所以差异会很小)。安排拆分的工作 Hello |黑暗|我的老|



稍微改善可能是:




  • 估算第一个分割的位置(在这种情况下为字符7)

  • 读取它。


  • 如果您已搜索回来,那么您已经拥有了最近的空间。

  • 更新您下一个区块的估计时间。

  • 循环



唯一的麻烦是,这样会使代码更复杂。


Let's suppose that we have a text which has to be processed by a set of workers (mappers in a map/reduce project): each portion of the text (split) has to be consistent (no word can be splitted between two workers) and the size of the splits should be equal as possible (load balancing).

Here's my algorithm:

  1. Divide the text in equals splits, each one composed by off_t begin and off_t end (starting and ending byte inside the text).
  2. Make each split consistent: if the split ends in the middle of a word, jump to the next white space (and do the same with the begin). Without this phase the result may be wrong: think about the famous "word count" example in map/reduce.

Input (text length = 29 chars):

Hello Darkness my old friend

Phase 1 with 4 Workers (~= 7 chars per worker):

|Hello D|arkness| my old| friend|

Phase 2:

|Hello Darkness| |my old |friend|

This is the code of Phase 2 for each split:

        ifstream ifs (file , ifstream::in);
        char c;
        string s;
        if(begin>0){//begin=0 then first split: cannot break a word!
            //if char before the first one is different from ' ' or '\n'
            //then the split begins in the middle of a word (bad)
            ifs.seekg(begin-1,ios::beg);
            ifs.get(c);
            if(c!=' ' && c!='\n'){
                getline(ifs,s,' '); //jump to the next white space
                begin+=s.length();
            }
            if(begin>end)
                end=begin;
        }
        ifs.seekg(end,ios::beg);
        ifs.get(c);
        if(c!=' ' && c!='\n' && end != size){
            getline(ifs,s,' ');
            end+=s.length();
        }

As you can see, this solution has two problems:

  1. Splits size may be unbalanced (look the first and the second splits!)
  2. The code is a little bit tricky

Have you any suggestion to improve the solution (in term of elegance and/or performance)?

解决方案

I think that algorithmically, you can't do much better (and if you are doing Map|Reduce you presumably have lots of data so the difference will be small). The work of arranging the split as Hello| Darkness| my old| friend is going to be excessive.

A slight improvement might be:

  • Estimate where the first split should be (character 7 in this case)
  • Read to it.
  • Look forward and back for the closest space.
  • Split on that.
  • If you have searched back, then you already have the start of the next chunk in your hand.
  • Update your estimate of how long the next chunk should be.
  • Loop

The only trouble is that this is going to make the code even more complex.

这篇关于在C ++中分割(整个)单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆