C ++在关键字段上连接两个管道分割文件 [英] C++ Join two pipe divided files on key fields

查看:130
本文介绍了C ++在关键字段上连接两个管道分割文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试创建一个C ++函数,将一个或两个关键字段上的超过10.000.000条记录的两个管道分割文件连接起来。

i am currently trying to create a C++ function to join two pipe divided files with over 10.000.000 records on one or two key fields.



    P2347|John Doe|C1234
    P7634|Peter Parker|D2344
    P522|Toni Stark|T288



    P2347|Bruce Wayne|C1234
    P1111|Captain America|D534
    P522|Terminator|T288

要在字段1和3上加入,预期输出应显示:

To join on field 1 and 3, the expected output should show:



    P2347|C1234|John Doe|Bruce Wayne
    P522|T288|Toni Stark|Terminator

我目前的想法是使用set / array /并创建如下:

What I currently thinking about is using a set/array/vector to read in the files and create something like:



    P2347|C1234>>John Doe
    P522|T288>>Toni Stark



    P2347|C1234>>Bruce Wayne
    P522|T288>>Terminator

然后使用第一部分作为键,并匹配第二个set / vector /数组。

And then use the slip the first part as the key and match that against the second set/vector/array.

我现在有的是:读取第一个文件,并匹配第二个文件逐行匹配集。它需要整行并匹配它:

What I currently have is: Read in the first file and match the second file line by line against the set. It takes the whole line and matches it:



    #include iostream>
    #include fstream>
    #include string>
    #include set>
    #include ctime>
    using namespace std;

    int main()
    {

        clock_t startTime = clock();

        ifstream inf("test.txt");
        set lines;
        string line;
        for (unsigned int i=1; std::getline(inf,line); ++i)
            lines.insert(line);

        ifstream inf2("test2.txt");

        clock_t midTime = clock();

        ofstream outputFile("output.txt");  
        while (getline(inf2, line))
        {
            if (lines.find(line) != lines.end())
                outputFile > a;
        return 0;

}



我非常高兴任何建议。如果有更好的(更快)方式,我也乐意改变整个概念。速度是至关重要的,因为可能有超过1000万条记录。

I am very happy for any suggestion. I am also happy to change the whole concept if there is any better (faster) way. Speed is critical as there might be even more than 10 million records.

编辑:另一个想法是获取一张地图,并将键作为关键 - 但这可能有点慢。任何建议?

Another idea would be to take a map and have the key being the key - but this might be a little slower. Any suggestions?

非常感谢任何帮助。

推荐答案

我尝试了多种方式来完成这个任务,到目前为止都还没有什么效率:

I tried multiple ways to get this task completed, none of it was efficient so far:

将所有内容读入集合,并将关键字段解析为格式:keys >>模拟数组类型集的值。解析需要很长时间,但内存使用率保持相对较低。未完全开发的代码:

Read everything into a set and parse the key fields into a format: keys >> values simulating an array type set. Parsing took a long time, but memory usage stays relatively low. Not fully developed code:



        #include \
        #include \
        #include \
        #include \
        #include \
        #include \
        #include \
        std::vector &split(const std::string &s, char delim, std::vector &elems) {
        std::stringstream ss(s);
        std::string item;
        while (std::getline(ss, item, delim)) {
            elems.push_back(item);
        }
        return elems;
    }


    std::vector split(const std::string &s, char delim) {
        std::vector elems;
        split(s, delim, elems);
        return elems;
    }

    std::string getSelectedRecords(std::string record, int position){

        std::string values;
        std::vector tokens = split(record, ' ');




        //get position in vector
        for(auto& s: tokens)
            //pick last one or depending on number, not developed
            values = s;

        return values;
    }

    int main()
    {

        clock_t startTime = clock();

        std::ifstream secondaryFile("C:/Users/Batman/Desktop/test/secondary.txt");
        std::set secondarySet;
        std::string record;

        for (unsigned int i=1; std::getline(secondaryFile,record); ++i){
            std::string keys = getSelectedRecords(record, 2);
            std::string values = getSelectedRecords(record, 1);
            secondarySet.insert(keys + ">>>" + values);
        }

        clock_t midTime = clock();

        std::ifstream primaryFile("C:/Users/Batman/Desktop/test/primary.txt");
        std::ofstream outputFile("C:/Users/Batman/Desktop/test/output.txt");

        while (getline(primaryFile, record))
        {
            //rewrite find() function to go through set and find all keys (first part until >> ) and output values
            std::string keys = getSelectedRecords(record, 2);

            if (secondarySet.find(keys) != secondarySet.end())
                outputFile > a;
        return 0;
        }

应该不是一个问题。读取数据非常快,但是解析它需要很多时间。

Instead of pipe divided it currently uses space divided, but that should not be a problem. Reading the data is very quick, but parsing it takes an awful lot of time

另一个选项是多重映射。类似的概念,关键字段指向值,但这一个是非常低和内存密集。

The other option was taking a multimap. Similar concept with key fields pointing to values, but this one is very low and memory intensive.



    #include \
    #include \
    #include \
    #include \
    #include \
    #include \
    #include \

    int main()
    {


    std::clock_t startTime = clock();

    std::ifstream inf("C:/Users/Batman/Desktop/test/test.txt");
    typedef std::multimap Map;
    Map map;

    std::string line;

    for (unsigned int i=1; std::getline(inf,line); ++i){
        //load tokens into vector
        std::istringstream buffer(line);
        std::istream_iterator beg(buffer), end;
        std::vector tokens(beg, end);
        //get keys
        for(auto& s: tokens)
            //std::cout >>" second;
            outputFile > a;
    return 0;
    }

进一步的想法是:将管道分割的文件拆分为不同的文件

Further thoughts are: Splitting the pipe divided files into different files with one column each right when importing the data. With that I will not have to parse anything but can read in each column individually.

编辑:优化了第一个使用递归的例子对于100.000记录,仍然> 30秒。希望看到更快,加上实际的find()函数仍然丢失。

optimized the first example with a recursive split function. Still >30 seconds for 100.000 records. Would like to see that faster plus the actual find() function is still missing.

任何想法?
谢谢!

Any thoughts? Thanks!

这篇关于C ++在关键字段上连接两个管道分割文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆