使用-jsonArray时mongoimport的速度非常慢 [英] The speed of mongoimport while using -jsonArray is very slow

查看:932
本文介绍了使用-jsonArray时mongoimport的速度非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个超过25百万行的15GB文件,这是json格式(mongodb接受导入:

I have a 15GB file with more than 25 milion rows, which is in this json format(which is accepted by mongodb for importing:

[
    {"_id": 1, "value": "\u041c\..."}
    {"_id": 2, "value": "\u041d\..."}
    ...
]

当我我试图用以下命令在mongodb中导入它我的速度只有每秒50行,这对我来说真的很慢。

When I'm trying to import it in mongodb with the following command I get speed of only 50 rows per second which is really slow for me.

mongoimport --db wordbase --collection sentences --type json --file C:\Users\Aleksandar\PycharmProjects\NLPSeminarska\my_file.json -jsonArray

当我尝试使用python和pymongo将数据插入集合时,速度更加糟糕。我也尝试提高过程的优先级但是它没有任何区别。

When I tried to insert the data into the collection by using python with pymongo the speed was even worse. I also tried increasing the priority of the process but it didn't make any difference.

我尝试的下一件事是同样的事情,但没有使用 -jsonArray 虽然我走了大的速度增加(~4000 /秒),它说所提供的JSON的BSON表示太大了。

The next thing that I tried is the same thing but without using -jsonArray and although I got a big speed increase(~4000/sec), it said that the BSON representation of the supplied JSON is too large.

我也尝试将文件分成5个单独的文件并将它们从单独的控制台导入到同一个集合中,但我将它们的速度降低到大约20个文档/秒。

I also tried splitting the file into 5 separate files and importing them from separate consoles into the same collection, but I get speed decrease of all of them to about 20 documents/sec.

当我在网上搜索时我看到人们的速度超过8K文件/秒,我看不出我做错了什么。

While I searched all over the web I saw that people had speeds of over 8K documents/sec and I can't see what do I do wrong.

有没有办法加速这件事,或者应该我将整个json文件转换为bson并以这种方式导入,如果是这样,那么转换和导入的正确方法是什么?

Is there a way to speed this thing up, or should I convert the whole json file to bson and import it that way, and if so which is the correct way to do both the converting and the importing?

非常感谢。

推荐答案

我对160Gb转储文件有完全相同的问题。我花了两天的时间用 -jsonArray 加载3%的原始文件,用这些更改加载15分钟。

I have the exact same problem with a 160Gb dump file. It took me two days to load 3% of the original file with -jsonArray and 15 minutes with these changes.

首先,删除最初的 [和尾随] 字符:

First, remove the initial [ and trailing ] characters:

sed 's/^\[//; s/\]$/' -i filename.json

然后在没有<$ c $的情况下导入c> -jsonArray 选项:

mongoimport --db "dbname" --collection "collectionname" --file filename.json

如果文件很大, sed 将花费很长时间,也许你会遇到存储问题。你可以使用这个C程序(不是由我写的,所有荣耀都归于@guillermobox):

If the file is huge, sed will take a really long time and maybe you run into storage problems. You can use this C program instead (not written by me, all glory to @guillermobox):

int main(int argc, char *argv[])
{
    FILE * f;
    const size_t buffersize = 2048;
    size_t length, filesize, position;
    char buffer[buffersize + 1];

    if (argc < 2) {
        fprintf(stderr, "Please provide file to mongofix!\n");
        exit(EXIT_FAILURE);
    };

    f = fopen(argv[1], "r+");

    /* get the full filesize */
    fseek(f, 0, SEEK_END);
    filesize = ftell(f);

    /* Ignore the first character */
    fseek(f, 1, SEEK_SET);

    while (1) {
        /* read chunks of buffersize size */
        length = fread(buffer, 1, buffersize, f);
        position = ftell(f);

        /* write the same chunk, one character before */
        fseek(f, position - length - 1, SEEK_SET);
        fwrite(buffer, 1, length, f);

        /* return to the reading position */
        fseek(f, position, SEEK_SET);

        /* we have finished when not all the buffer is read */
        if (length != buffersize)
            break;
    }

    /* truncate the file, with two less characters */
    ftruncate(fileno(f), filesize - 2);

    fclose(f);

    return 0;
};

PS:我没有权力建议移植这个问题,但我认为这可能乐于助人。

P.S.: I don't have the power to suggest a migration of this question but I think this could be helpful.

这篇关于使用-jsonArray时mongoimport的速度非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆