从字符串中随机选择特定子序列 [英] Randomly selecting specific subsequence from string

查看:159
本文介绍了从字符串中随机选择特定子序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个字符串,其中包含穿插有破折号的多个字符,例如 string s =A - TG-DF ---- GR - ; 我想随机选择一个破折号(可以是1,2,...,字符串中连续破折号的最大数量),然后将它们复制到随机选择的字符串的另一部分。



例如, A - TG-DF(---) - GR - 被移动到 A- -T(---)G-DF-GR- while
另一个迭代可以给出 A - TG-DF ---- GR )被移动到 A - TG - ( - )DF ---- GR



我通过 int i = rand()%(int)s.length(); 生成字符串的随机索引。插入通过 s.insert(rand()%(int)s.length(),substr); ,其中 substr 我的主要问题在于随机发现一连串的破折号。我想到使用 s.find( - ),但是只返回单个短划线的第一个实例,而不是一个短划线集合的随机位置。

解决方案

我知道这个问题很可能在 XY问题,但是我发现它是一个很好的挑战,所以我考虑用Boost Interval Container库实现它。



图书馆的优点是,你可以忘记很多细节,而不会牺牲很多性能。



我已经自由地推广它,以便它能够同时移动多个破折号(统一随机选择)。



Live On Coliru 并生成1,000,000个给定样本的随机转置在大约2673ms(在我的机器上为1156ms)中随机地移动块的数量(1..3):

  gen(test_case); 

std :: string result;
std :: map< std :: string,size_t> histo;

for(int i = 0; i <1000000; ++ i){
auto const mobility = gen.gen_relocations(1 + gen.random(3)); // move n blocks of dashes

result.clear();
gen.apply_relocations(mobility,std :: back_inserter(result));

histo [result] ++;
}




基准时间包括构建生成的唯一结果的直方图所花费的时间


让我们在这里做一个代码演讲来解释一下:


  1. 我试图使用可读类型:

      namespace icl = boost :: icl; 

    using Position = size_t;
    using Map = icl :: interval_set< Position> ;;
    using Region = Map :: value_type;

    构建破折号 Map 的函数简单是:

      template< ; typename It> Map region_map(It f,It l){
    Map dashes;

    for(Position pos = 0; f!= l; ++ f,++ pos)
    if(' - '== * f)
    dashes + = pos ;

    返回破折号;
    }




    我没有特别优化这个。我让interval_set组合相邻的破折号。我们可以使用 hinted插入,或者一个解析器,将连续的短划线添加为块。我在这里选择了KISS。



  2. 稍后,我们生成重定位,映射 / code>到其余文本中的不移动位置

      using Relocs = boost :: container :: flat_multimap< Position,Region> ;; 

    通过使用平面多重映射,调用者的条目已按升序插入点排序。因为我们使用 reserve() -ed前期平面多重映射,我们避免了基于节点的映射实现的开销。


  3. 我们首先选择要移动的破折号块:

      ){
    地图地图;
    if(!_dashes.empty())
    for(int i = 0; i map + = * std :: next(_dashes.begin ),_select(_rng));
    return map;
    }

    随机分布已在建设时确定尺寸,例如:

      _dashes(region_map(_input.begin(),_input.end())),
    _rng(std :: random_device {} ),
    _select(0,_dashes.iterative_size() - 1),
    _randpos(0,_input.size() - 1),
    pre>

  4. 接下来,我们为每个分配插入位置。诀窍是在源的非移动(惰性)区域内分配位置。




    • 这包括在其位置保留的其他区块。

    • 我们在构造函数中检测到这种情况:

        _is_degenerate(cardinality(_dashes)== _input .size())




    如下:

     重定位gen_relocations(int n = 1){
    Map const moving = pick_dashes(n);

    重定位relocs;
    relocs.reserve(moving.iterative_size());

    if(_is_degenerate)
    {
    //退化情况(一切都是破折号);没有不移动的位置
    //存在,只需选择0
    (auto& region:moving)
    relocs.emplace(0,region);
    } else {
    auto inertia = [&] {
    Position inert_point;
    while(contains(moving,inert_point = _randpos(_rng)))
    ; //丢弃正在移动的插入点
    return inert_point;
    };

    for(auto& region:moving)
    relocs.emplace(inertia(),region);
    }

    return relocs;
    }



    现在我们需要做的就是应用重定位。

    / li>
  5. 这种方法的通用实现非常简单。同样,为了保持简单(KISS),特别优化:

      template<类型名F> 
    void do_apply_relocations(Relocs const& mobility,F const& apply)const {
    icl :: split_interval_set< Position& steps {{0,_input.size()}};

    for(auto& m:mobility){
    steps + = m.first; // force a split on insertion point
    steps - = m.second; // remove the source of the move
    // std :: cout<< m.second < 移动到#< m.first<< :<<步骤< \\\
    ;
    }

    auto next_mover = mobility.begin();

    for(auto& step:steps){
    while(next_mover!= mobility.end()&& apply((next_mover ++) - > second,true);

    apply(step,false);
    }
    }




    这里的诀窍是,我们滥用 split_interval_set 组合策略将处理分解为在随机生成的插入点停止的子范围:这些人工区域是我们生成循环中的步骤。



  6. c> apply 函数有我们实现得到我们想要的。在我们的例子中,我们需要一个字符串 A - TG-DFGR(----) - ,所以我们写一个重载到一个容器(例如 std :: string )使用任何输出迭代器:

      template< typename Out> 
    out apply_relocations(Relocs const& mobility,Out out)const {
    if(_is_degenerate)
    return std :: copy(_input.begin(),_input.end(),out);

    auto to_iter = [this](Position pos){return _input.begin()+ pos; };

    do_apply_relocations(mobile,[&](Region const& r,bool relocated){
    if(relocated)* out ++ ='(';
    out = std ::复制(
    to_iter(first(r)),
    to_iter(last(r)+1),
    out
    );
    if(relocated)* out ++ = ')';
    });

    return out;
    }




    复杂部分在这里将 Position 映射到输入迭代器( to_iter ),并且代码可选地添加()







完整列表



  #include< boost / container / flat_map.hpp> 
#include< boost / icl / interval_set.hpp>
#include< boost / icl / split_interval_set.hpp>
#include< boost / icl / separate_interval_set.hpp>
#include< boost / lexical_cast.hpp>
#include< boost / range / algorithm.hpp>
#include< iomanip>
#include< iostream>
#include< random>
#include< map>
#include< chrono>

命名空间icl = boost :: icl;

using Position = size_t;
using Map = icl :: interval_set< Position> ;;
using Region = Map :: value_type;
using Relocs = boost :: container :: flat_multimap< Position,Region> ;;

struct Generator {
Generator(std :: string const& input)
:_input(input),
_dashes(region_map(_input.begin(),_input .end())),
_rng(std :: random_device {}()),
_select(0,_dashes.iterative_size() - 1),
_randpos(0,_input。 size() - 1),
_is_degenerate(cardinality(_dashes)== _input.size())
{
}

无符号随机
return _rng()%below; // q& d,只在这里为固定的种子确定性测试
}

映射full()const {
return Map {{0,_input.size )}};
}

重定位gen_relocations(int n = 1){
Map const moving = pick_dashes(n);

重定位relocs;
relocs.reserve(moving.iterative_size());

if(_is_degenerate)
{
//退化情况(一切都是破折号);没有不移动的位置
//存在,只需选择0
(auto& region:moving)
relocs.emplace(0,region);
} else {
auto inertia = [&] {
Position inert_point;
while(contains(moves,inert_point = _randpos(_rng)))
; //丢弃正在移动的插入点
return inert_point;
};

for(auto& region:moving)
relocs.emplace(inertia(),region);
}

return relocs;
}

template< typename Out>
out apply_relocations(Relocs const& mobility,Out out)const {
if(_is_degenerate)
return std :: copy(_input.begin(),_input.end(),out);

auto to_iter = [this](Position pos){return _input.begin()+ pos; };

do_apply_relocations(mobile,[&](Region const& r,bool relocated){
if(relocated)* out ++ ='(';
out = std ::复制(
to_iter(first(r)),
to_iter(last(r)+1),
out
);
if(relocated)* out ++ = ')';
});

return out;
}

template< typename F>
void do_apply_relocations(Relocs const& mobility,F const& apply)const {
icl :: split_interval_set< Position> steps {{0,_input.size()}};

for(auto& m:mobility){
steps + = m.first; // force a split on insertion point
steps - = m.second; // remove the source of the move
// std :: cout<< m.second < 移动到#< m.first<< :<<步骤< \\\
;
}

auto next_mover = mobility.begin();

for(auto& step:steps){
while(next_mover!= mobility.end()&& apply((next_mover ++) - > second,true);

apply(step,false);
}
}

private:
std :: string _input;
Map _dash;
std :: mt19937 _rng;
std :: uniform_int_distribution< Position> _选择;
std :: uniform_int_distribution< Position> _randpos;
bool _is_degenerate;

Map pick_dashes(int n){
Map map;
if(!_dashes.empty())
for(int i = 0; i map + = * std :: next(_dashes.begin ),_select(_rng));
return map;
}

template< typename it> map region_map(It f,It l){
Map dashes;

for(Position pos = 0; f!= l; ++ f,++ pos)
if(' - '== * f)
dashes + = pos ;

返回破折号;
}
};

int main(){

for(std :: string test_case:{
----,
A - TG -DF ---- GR--,

ATGDFGR,
})
{
auto start = std :: chrono :: high_resolution_clock :: now();
Generator gen(test_case);

std :: string result;
std :: map< std :: string,size_t> histo;

for(int i = 0; i <1000000; ++ i){
auto const mobility = gen.gen_relocations(1 + gen.random(3)); // move n blocks of dashes

result.clear();
gen.apply_relocations(mobility,std :: back_inserter(result));

histo [result] ++;
}
std :: cout<< histo.size()< unique results for'< test_case<< '
<< in< std :: chrono :: duration_cast< std :: chrono :: milliseconds>(std :: chrono :: high_resolution_clock :: now() - start).count()< ms\\\
;

std :: multimap< size_t,std :: string,std :: greater< size_t> >排名
for(auto& entry:histo)
ranked.emplace(entry.second,entry.first);

int topN = 10;
for(auto& rank:ranked)
{
std :: cout< std :: setw(8)<< std :: right<< rank.first<< :<< rank.second<< \\\
;
if(0 == --topN)
break;
}
}
}

打印

 对于'----'在186ms中有1个唯一结果
1000000:----
3430个唯一结果'A - TG-DF ---- GR--'in 1156ms
9251:A(----) - TG-DFGR--
9226: --TG-DFGR--
9191:A - T(----)G-DFGR--
9150:A - TG-DFGR - (----) -
9132:A - (----)TG-DFGR--
9128:A - TG(----) - DFGR--
9109:A - TG-D (----)FGR--
9098:A - TG-DFG(----)R--
9079:A - TG-DFGR
9047:A - (----) - TG-DFGR--
在25ms中为''的1个唯一结果
1000000:
'ATGDFGR'的唯一结果77ms
1000000:ATGDFGR


Given a string containing a number of characters interspersed with dashes, for example string s = "A--TG-DF----GR--";, I wish to randomly select a block of dashes (could be of size 1, 2, …, max number of consecutive dashes in string), and copy them over to another part of the string chosen at random.

For example, A--TG-DF(---)-GR-- gets moved to A--T(---)G-DF-GR-- while another iteration may give A--TG-DF----GR(--) gets moved to A--TG-(--)DF----GR.

I'm generating random indices of the string through int i = rand() % (int) s.length();. Insertion happens through s.insert(rand() % (int) s.length(), substr);, where substr is the block of dashes.

My main problem lies in finding randomly a continuous group of dashes. I thought of using s.find("-"), but that'd only return the first instance of a single dash, and not a random position of a collection of dashes.

解决方案

I know this problem is likely steeped in XY problems, but I found it a nice challenge none-the-less, so I thought about implementing it with the Boost Interval Container library.

The beauty of the library is that you can forget about a lot of details, while you don't sacrifice a lot of performance.

I've taken the liberty to generalize it, so that it is capable of moving multiple blocks of dashes (uniform randomly selected) simultaneously.

The solution runs Live On Coliru and generates 1,000,000 random transpositions of the given sample with randomly varied numbers of moved blocks (1..3) in about 2673 ms (1156 ms on my machine):

Generator gen(test_case);

std::string result;
std::map<std::string, size_t> histo;

for(int i = 0; i < 1000000; ++i) {
    auto const mobility = gen.gen_relocations(1 + gen.random(3)); // move n blocks of dashes

    result.clear();
    gen.apply_relocations(mobility, std::back_inserter(result));

    histo[result]++;
}

Note: the benchmark times include the time taken to build the histogram of unique results generated

Let's do a code walk-through here to explain things:

  1. I tried to use "readable" types:

    namespace icl = boost::icl;
    
    using Position = size_t;
    using Map      = icl::interval_set<Position>;
    using Region   = Map::value_type;
    

    E.g. the function that builds the Map of dashes is simply:

    template <typename It> Map region_map(It f, It l) {
        Map dashes;
    
        for (Position pos = 0; f!=l; ++f, ++pos)
            if ('-' == *f)
                dashes += pos;
    
        return dashes;
    }
    

    Note how I didn't particularly optimize this. I let the interval_set combine adjacent dashes. We might use hinted insertion, or a parser that add consecutive dashes as a block. I opted for KISS here.

  2. Later on, we generate relocations, which map a Region onto a non-moving Position in the rest of the text.

    using Relocs   = boost::container::flat_multimap<Position, Region>;
    

    By using the flat multimap, the caller has the entries already sorted by ascending insertion point. Because we use a reserve()-ed up-front flat multimap, we avoid the overhead of a node based implementation of map here.

  3. We start by picking the dash-blocks to be moved:

    Map pick_dashes(int n) {
        Map map;
        if (!_dashes.empty())
            for (int i = 0; i < n; ++i)
                map += *std::next(_dashes.begin(), _select(_rng));
        return map;
    }
    

    The random distribution have been dimensioned at construction, e.g.:

      _dashes(region_map(_input.begin(), _input.end())),
      _rng(std::random_device {}()),
      _select (0, _dashes.iterative_size() - 1),
      _randpos(0, _input.size() - 1),
    

  4. Next, we assign insertion-positions to each. The trick is to assign positions inside non-mobile (inert) regions of the source.

    • this includes other blocks of dashes that stay in their place
    • there is the degenerate case where everything is a block of dashes, we detected this in the constructor:

        _is_degenerate(cardinality(_dashes) == _input.size())
      

    So the code reads as follows:

    Relocs gen_relocations(int n=1) {
        Map const moving = pick_dashes(n);
    
        Relocs relocs;
        relocs.reserve(moving.iterative_size());
    
        if (_is_degenerate)
        {
            // degenerate case (everything is a dash); no non-moving positions
            // exist, just pick 0
            for(auto& region : moving)
                relocs.emplace(0, region);
        } else {
            auto inertia = [&] {
                Position inert_point;
                while (contains(moving, inert_point = _randpos(_rng)))
                    ; // discard insertion points that are moving
                return inert_point;
            };
    
            for(auto& region : moving)
                relocs.emplace(inertia(), region);
        }
    
        return relocs;
    }
    

    Now all we need to do is apply the relocations.

  5. The generic implementation of this is pretty straightforward. Again, it's not particularly optimized in order to keep it simple (KISS):

    template <typename F>
        void do_apply_relocations(Relocs const& mobility, F const& apply) const {
            icl::split_interval_set<Position> steps {{0, _input.size()}};
    
            for (auto& m : mobility) {
                steps += m.first; // force a split on insertion point
                steps -= m.second; // remove the source of the move
                //std::cout << m.second << " moving to #" << m.first << ": " << steps << "\n";
            }
    
            auto next_mover = mobility.begin();
    
            for(auto& step : steps) {
                while (next_mover!=mobility.end() && contains(step, next_mover->first))
                    apply((next_mover++)->second, true);
    
                apply(step, false);
            }
        }
    

    Note The trick here is that we "abuse" the split_interval_set combining strategy to break the processing into sub-ranges that "stop" at the randomly generated insertion points: these artificial regions are the "steps" in our generation loop.

  6. The apply function there is what we implement to get what we want. In our case we want a string like A--TG-DFGR(----)-- so we write an overload that appends to a container (e.g. std::string) using any output iterator:

    template <typename Out>
        Out apply_relocations(Relocs const& mobility, Out out) const {
            if (_is_degenerate)
                return std::copy(_input.begin(), _input.end(), out);
    
            auto to_iter = [this](Position pos) { return _input.begin() + pos; };
    
            do_apply_relocations(mobility, [&](Region const& r, bool relocated) {
                if (relocated) *out++ = '(';
                out = std::copy(
                    to_iter(first(r)),
                    to_iter(last(r)+1),
                    out
                );
                if (relocated) *out++ = ')';
            });
    
            return out;
        }
    

    Note The "complicated" part here are mapping the Position to input iterators (to_iter) and the code to optionally add () around moved blocks.

And with that, we have seen all the code.

Full Listing

#include <boost/container/flat_map.hpp>
#include <boost/icl/interval_set.hpp>
#include <boost/icl/split_interval_set.hpp>
#include <boost/icl/separate_interval_set.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/range/algorithm.hpp>
#include <iomanip>
#include <iostream>
#include <random>
#include <map>
#include <chrono>

namespace icl = boost::icl;

using Position = size_t;
using Map      = icl::interval_set<Position>;
using Region   = Map::value_type;
using Relocs   = boost::container::flat_multimap<Position, Region>;

struct Generator {
    Generator(std::string const& input) 
        : _input(input),
          _dashes(region_map(_input.begin(), _input.end())),
          _rng(std::random_device {}()),
          _select (0, _dashes.iterative_size() - 1),
          _randpos(0, _input.size() - 1),
          _is_degenerate(cardinality(_dashes) == _input.size())
    {
    }

    unsigned random(unsigned below) {
        return _rng() % below; // q&d, only here to make the tests deterministic for a fixed seed
    }

    Map full() const { 
        return Map { { 0, _input.size() } };
    }

    Relocs gen_relocations(int n=1) {
        Map const moving = pick_dashes(n);

        Relocs relocs;
        relocs.reserve(moving.iterative_size());

        if (_is_degenerate)
        {
            // degenerate case (everything is a dash); no non-moving positions
            // exist, just pick 0
            for(auto& region : moving)
                relocs.emplace(0, region);
        } else {
            auto inertia = [&] {
                Position inert_point;
                while (contains(moving, inert_point = _randpos(_rng)))
                    ; // discard insertion points that are moving
                return inert_point;
            };

            for(auto& region : moving)
                relocs.emplace(inertia(), region);
        }

        return relocs;
    }

    template <typename Out>
        Out apply_relocations(Relocs const& mobility, Out out) const {
            if (_is_degenerate)
                return std::copy(_input.begin(), _input.end(), out);

            auto to_iter = [this](Position pos) { return _input.begin() + pos; };

            do_apply_relocations(mobility, [&](Region const& r, bool relocated) {
                if (relocated) *out++ = '(';
                out = std::copy(
                    to_iter(first(r)),
                    to_iter(last(r)+1),
                    out
                );
                if (relocated) *out++ = ')';
            });

            return out;
        }

    template <typename F>
        void do_apply_relocations(Relocs const& mobility, F const& apply) const {
            icl::split_interval_set<Position> steps {{0, _input.size()}};

            for (auto& m : mobility) {
                steps += m.first; // force a split on insertion point
                steps -= m.second; // remove the source of the move
                //std::cout << m.second << " moving to #" << m.first << ": " << steps << "\n";
            }

            auto next_mover = mobility.begin();

            for(auto& step : steps) {
                while (next_mover!=mobility.end() && contains(step, next_mover->first))
                    apply((next_mover++)->second, true);

                apply(step, false);
            }
        }

  private:
    std::string                             _input;
    Map                                     _dashes;
    std::mt19937                            _rng;
    std::uniform_int_distribution<Position> _select;
    std::uniform_int_distribution<Position> _randpos;
    bool                                    _is_degenerate;

    Map pick_dashes(int n) {
        Map map;
        if (!_dashes.empty())
            for (int i = 0; i < n; ++i)
                map += *std::next(_dashes.begin(), _select(_rng));
        return map;
    }

    template <typename It> Map region_map(It f, It l) {
        Map dashes;

        for (Position pos = 0; f!=l; ++f, ++pos)
            if ('-' == *f)
                dashes += pos;

        return dashes;
    }
};

int main() {

    for (std::string test_case : {
            "----",
            "A--TG-DF----GR--",
            "",
            "ATGDFGR",
        })
    {
        auto start = std::chrono::high_resolution_clock::now();
        Generator gen(test_case);

        std::string result;
        std::map<std::string, size_t> histo;

        for(int i = 0; i < 1000000; ++i) {
            auto const mobility = gen.gen_relocations(1 + gen.random(3)); // move n blocks of dashes

            result.clear();
            gen.apply_relocations(mobility, std::back_inserter(result));

            histo[result]++;
        }
        std::cout << histo.size() << " unique results for '" << test_case << "'"
                  << " in " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()-start).count() << "ms\n";

        std::multimap<size_t, std::string, std::greater<size_t> > ranked;
        for (auto& entry : histo)
            ranked.emplace(entry.second, entry.first);

        int topN = 10;
        for (auto& rank : ranked)
        {
            std::cout << std::setw(8) << std::right << rank.first << ": " << rank.second << "\n";
            if (0 == --topN)
                break;
        }
    }
}

Prints e.g.

1 unique results for '----' in 186ms
 1000000: ----
3430 unique results for 'A--TG-DF----GR--' in 1156ms
    9251: A(----)--TG-DFGR--
    9226: (----)A--TG-DFGR--
    9191: A--T(----)G-DFGR--
    9150: A--TG-DFGR-(----)-
    9132: A--(----)TG-DFGR--
    9128: A--TG(----)-DFGR--
    9109: A--TG-D(----)FGR--
    9098: A--TG-DFG(----)R--
    9079: A--TG-DFGR(----)--
    9047: A-(----)-TG-DFGR--
1 unique results for '' in 25ms
 1000000: 
1 unique results for 'ATGDFGR' in 77ms
 1000000: ATGDFGR

这篇关于从字符串中随机选择特定子序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆