从CSV加载时PostgreSQL / JooQ批量插入性能问题;如何改善流程? [英] PostgreSQL/JooQ bulk insertion performance issues when loading from CSV; how do I improve the process?

查看:94
本文介绍了从CSV加载时PostgreSQL / JooQ批量插入性能问题;如何改善流程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于此项目,我打算制作一个网络版本,现在正在制作

For this project, I intend to make a web version and am right now working on making a PostgreSQL (9.x) backend from which the webapp will query.

现在,发生的情况是跟踪器生成了一个包含两个CSV的zip文件,并加载了该文件在运行时将其存储到H2数据库中,该数据库的架构是这样的(是的,我知道SQL可以写得更好一些):

Right now, what happens is that the tracer generates a zip file with two CSVs in it, load it into an H2 database at runtime whose schema is this (and yes, I'm aware that the SQL could be written a little better):

create table matchers (
    id integer not null,
    class_name varchar(255) not null,
    matcher_type varchar(30) not null,
    name varchar(1024) not null
);

alter table matchers add primary key(id);

create table nodes (
    id integer not null,
    parent_id integer not null,
    level integer not null,
    success integer not null,
    matcher_id integer not null,
    start_index integer not null,
    end_index integer not null,
    time bigint not null
);

alter table nodes add primary key(id);
alter table nodes add foreign key (matcher_id) references matchers(id);
create index nodes_parent_id on nodes(parent_id);
create index nodes_indices on nodes(start_index, end_index);

现在,由于PostgreSQL数据库将能够处理多个跟踪,因此我不得不添加一个进一步表; PostgreSQL后端上的架构看起来像这样(也比一般的SQL警报要少;此外,在 parse_info 表中, content 列包含已解析文件的全文,在zip文件中单独存储):

Now, since the PostgreSQL database will be able to handle more than one trace, I had to add a further table; the schema on the PostgreSQL backend looks like this (less than average SQL alert as well; also, in the parse_info table, the content column contains the full text of the file parsed, in the zip file it is stored separately):

create table parse_info (
    id uuid primary key,
    date timestamp not null,
    content text not null
);

create table matchers (
    parse_info_id uuid references parse_info(id),
    id integer not null,
    class_name varchar(255) not null,
    matcher_type varchar(30) not null,
    name varchar(1024) not null,
    unique (parse_info_id, id)
);

create table nodes (
    parse_info_id uuid references parse_info(id),
    id integer not null,
    parent_id integer not null,
    level integer not null,
    success integer not null,
    matcher_id integer not null,
    start_index integer not null,
    end_index integer not null,
    time bigint not null,
    unique (parse_info_id, id)
);

alter table nodes add foreign key (parse_info_id, matcher_id)
    references matchers(parse_info_id, id);
create index nodes_parent_id on nodes(parent_id);
create index nodes_indices on nodes(start_index, end_index);

现在,我目前正在做的工作是获取现有的zip文件并将其插入到postgresql数据库中;我正在使用JooQ及其 CSV加载API

Now, what I am currently doing is taking existing zip files and inserting them into a postgresql database; I'm using JooQ and its CSV loading API.

过程有点复杂...这是当前步骤:

The process is a little complicated... Here are the current steps:


  • 已生成UUID;

  • 我从zip中读取了必要的信息(解析日期,输入文本),并将记录写入 parse_info 表;

  • 我创建了CSV的临时副本,以便JooQ加载API能够使用它(请参见代码提取后的原因);

  • 我插入所有匹配器,然后插入所有节点。

  • a UUID is generated;
  • I read the necessary info from the zip (parse date, input text) and write the record in the parse_info table;
  • I create temporary copies of the CSV in order for the JooQ loading API to be able to use it (see after the code extract as to why);
  • I insert all matchers, then all nodes.

这是代码:

public final class Zip2Db2
{
    private static final Pattern SEMICOLON = Pattern.compile(";");
    private static final Function<String, String> CSV_ESCAPE
        = TraceCsvEscaper.ESCAPER::apply;

    // Paths in the zip to the different components
    private static final String INFO_PATH = "/info.csv";
    private static final String INPUT_PATH = "/input.txt";
    private static final String MATCHERS_PATH = "/matchers.csv";
    private static final String NODES_PATH = "/nodes.csv";

    // Fields to use for matchers zip insertion
    private static final List<Field<?>> MATCHERS_FIELDS = Arrays.asList(
        MATCHERS.PARSE_INFO_ID, MATCHERS.ID, MATCHERS.CLASS_NAME,
        MATCHERS.MATCHER_TYPE, MATCHERS.NAME
    );

    // Fields to use for nodes zip insertion
    private static final List<Field<?>> NODES_FIELDS = Arrays.asList(
        NODES.PARSE_INFO_ID, NODES.PARENT_ID, NODES.ID, NODES.LEVEL,
        NODES.SUCCESS, NODES.MATCHER_ID, NODES.START_INDEX, NODES.END_INDEX,
        NODES.TIME
    );

    private final FileSystem fs;
    private final DSLContext jooq;
    private final UUID uuid;

    private final Path tmpdir;

    public Zip2Db2(final FileSystem fs, final DSLContext jooq, final UUID uuid)
        throws IOException
    {
        this.fs = fs;
        this.jooq = jooq;
        this.uuid = uuid;

        tmpdir = Files.createTempDirectory("zip2db");
    }

    public void removeTmpdir()
        throws IOException
    {
        // From java7-fs-more (https://github.com/fge/java7-fs-more)
        MoreFiles.deleteRecursive(tmpdir, RecursionMode.KEEP_GOING);
    }

    public void run()
    {
        time(this::generateMatchersCsv, "Generate matchers CSV");
        time(this::generateNodesCsv, "Generate nodes CSV");
        time(this::writeInfo, "Write info record");
        time(this::writeMatchers, "Write matchers");
        time(this::writeNodes, "Write nodes");
    }

    private void generateMatchersCsv()
        throws IOException
    {
        final Path src = fs.getPath(MATCHERS_PATH);
        final Path dst = tmpdir.resolve("matchers.csv");

        try (
            final Stream<String> lines = Files.lines(src);
            final BufferedWriter writer = Files.newBufferedWriter(dst,
                StandardOpenOption.CREATE_NEW);
        ) {
            // Throwing below is from throwing-lambdas
            // (https://github.com/fge/throwing-lambdas)
            lines.map(this::toMatchersLine)
                .forEach(Throwing.consumer(writer::write));
        }
    }

    private String toMatchersLine(final String input)
    {
        final List<String> parts = new ArrayList<>();
        parts.add('"' + uuid.toString() + '"');
        Arrays.stream(SEMICOLON.split(input, 4))
            .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
            .forEach(parts::add);
        return String.join(";", parts) + '\n';
    }

    private void generateNodesCsv()
        throws IOException
    {
        final Path src = fs.getPath(NODES_PATH);
        final Path dst = tmpdir.resolve("nodes.csv");

        try (
            final Stream<String> lines = Files.lines(src);
            final BufferedWriter writer = Files.newBufferedWriter(dst,
                StandardOpenOption.CREATE_NEW);
        ) {
            lines.map(this::toNodesLine)
                .forEach(Throwing.consumer(writer::write));
        }
    }

    private String toNodesLine(final String input)
    {
        final List<String> parts = new ArrayList<>();
        parts.add('"' + uuid.toString() + '"');
        SEMICOLON.splitAsStream(input)
            .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
            .forEach(parts::add);
        return String.join(";", parts) + '\n';
    }

    private void writeInfo()
        throws IOException
    {
        final Path path = fs.getPath(INFO_PATH);

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            final String[] elements = SEMICOLON.split(reader.readLine());

            final long epoch = Long.parseLong(elements[0]);
            final Instant instant = Instant.ofEpochMilli(epoch);
            final ZoneId zone = ZoneId.systemDefault();
            final LocalDateTime time = LocalDateTime.ofInstant(instant, zone);

            final ParseInfoRecord record = jooq.newRecord(PARSE_INFO);

            record.setId(uuid);
            record.setContent(loadText());
            record.setDate(Timestamp.valueOf(time));

            record.insert();
        }
    }

    private String loadText()
        throws IOException
    {
        final Path path = fs.getPath(INPUT_PATH);

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            return CharStreams.toString(reader);
        }
    }

    private void writeMatchers()
        throws IOException
    {
        final Path path = tmpdir.resolve("matchers.csv");

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            jooq.loadInto(MATCHERS)
                .onErrorAbort()
                .loadCSV(reader)
                .fields(MATCHERS_FIELDS)
                .separator(';')
                .execute();
        }
    }

    private void writeNodes()
        throws IOException
    {
        final Path path = tmpdir.resolve("nodes.csv");

        try (
            final BufferedReader reader = Files.newBufferedReader(path);
        ) {
            jooq.loadInto(NODES)
                .onErrorAbort()
                .loadCSV(reader)
                .fields(NODES_FIELDS)
                .separator(';')
                .execute();
        }
    }

    private void time(final ThrowingRunnable runnable, final String description)
    {
        System.out.println(description + ": start");
        final Stopwatch stopwatch = Stopwatch.createStarted();
        runnable.run();
        System.out.println(description + ": done (" + stopwatch.stop() + ')');
    }

    public static void main(final String... args)
        throws IOException
    {
        if (args.length != 1) {
            System.err.println("missing zip argument");
            System.exit(2);
        }

        final Path zip = Paths.get(args[0]).toRealPath();

        final UUID uuid = UUID.randomUUID();
        final DSLContext jooq = PostgresqlTraceDbFactory.defaultFactory()
            .getJooq();

        try (
            final FileSystem fs = MoreFileSystems.openZip(zip, true);
        ) {
            final Zip2Db2 zip2Db = new Zip2Db2(fs, jooq, uuid);
            try {
                zip2Db.run();
            } finally {
                zip2Db.removeTmpdir();
            }
        }
    }
}

现在,这是我的第一个问题...比加载H2慢得多。以下是包含620个匹配器和45746个节点的CSV的时间安排:

Now, here is my first problem... It is much slower than loading into H2. Here is a timing for a CSV containing 620 matchers and 45746 nodes:

Generate matchers CSV: start
Generate matchers CSV: done (45.26 ms)
Generate nodes CSV: start
Generate nodes CSV: done (573.2 ms)
Write info record: start
Write info record: done (311.1 ms)
Write matchers: start
Write matchers: done (4.192 s)
Write nodes: start
Write nodes: done (22.64 s)

放弃或获取,而忘记编写专用CSV的部分(请参见下文),即25秒。将其加载到基于磁盘的即时H2数据库中需要不到5秒的时间!

Give or take, and forgetting the part about writing specialized CSVs (see below), that is 25 seconds. Loading this into an on-the-fly, disk-based H2 database takes less than 5 seconds!

我遇到的另一个问题是我必须编写专用的CSV;似乎CSV加载API在接受的内容上并不是很灵活,例如,我必须改成这一行:

The other problem I have is that I have to write dedicated CSVs; it appears that the CSV loading API is not really flexible in what it accepts, and I have, for instance, to turn this line:

328;SequenceMatcher;COMPOSITE;token

到此:

"some-randome-uuid-here";"328";"SequenceMatcher";"COMPOSITE";"token"

但是我最大的问题实际上是这个拉链很小。例如,我的邮编中没有620,但有1532个匹配器,没有45746个节点,但是超过3400万个节点;即使我们忽略了CSV的生成时间(原始节点CSV为1.2 GiB),由于H2注入需要20分钟,因此将其乘以5会得到1h30mn以南的某个时间,这是一个很大的数字!

But my biggest problem is in fact that this zip is pretty small. For instance, I have a zip with not 620, but 1532 matchers, and not 45746 nodes, but more than 34 million nodes; even if we dismiss the CSV generation time (the original nodes CSV is 1.2 GiB), since it takes 20 minutes for H2 injection, multiplying this by 5 gives a time some point south of 1h30mn, which is a lot!

总而言之,该过程目前效率很低...

All in all, the process is quite inefficient at the moment...

现在,为捍卫PostgreSQL:

Now, in the defence of PostgreSQL:


  • PostgreSQL实例上的约束比H2实例上的约束高得多:我不在生成的zip文件中需要一个UUID;

  • H2被不安全地调整为写入: jdbc:h2:/ path / to / db; LOG = 0; LOCK_MODE = 0; UNDO_LOG = 0; CACHE_SIZE = 131072

  • constraints on the PostgreSQL instance are much higher than those on the H2 instance: I don't need a UUID in generated zip files;
  • H2 is tuned "insecurely" for writes: jdbc:h2:/path/to/db;LOG=0;LOCK_MODE=0;UNDO_LOG=0;CACHE_SIZE=131072.

不过,插入时间的这种差异似乎是有点多余,我很确定它会更好。但是我不知道从哪里开始。

Still, this difference in insertion times seems a little excessive, and I am quite sure that it can be better. But I don't know where to start.

此外,我知道PostgreSQL有专门的机制可以从CSV加载,但是这里的CSV在zip文件中首先,我真的想避免像我目前所做的那样创建专用的CSV ...理想情况下,我想直接从zip逐行读取(这是我对H2注入所做的工作)

Also, I am aware that PostgreSQL has a dedicated mechanism to load from CSVs, but here the CSVs are in a zip file to start with, and I'd really like to avoid having to create a dedicated CSV as I am currently doing... Ideally I'd like to read line by line from the zip directly (which is what I do for H2 injection), transform the line and write into the PostgreSQL schema.

最后,我还知道我目前在插入之前未禁用PostgreSQL模式上的约束;我还没有尝试过(会有所作为吗?)。

Finally, I am also aware that I currently do not disable constraints on the PostgreSQL schema before insertion; I have yet to try this (will it make a difference?).

那么,您建议我怎么做才能提高性能?

So, what do you suggest I do to improve the performance?

推荐答案

以下是您可以采取的几种措施

Here are a couple of measures you can take

在jOOQ 3.6中, Loader API中有两种新模式:

In jOOQ 3.6, there are two new modes in the Loader API:

  • bulk loads (https://github.com/jOOQ/jOOQ/pull/3975)
  • batch loads (https://github.com/jOOQ/jOOQ/issues/2664)

已观察到使用这些技术以显着加快数量级的加载速度。另请参阅有关 JDBC批处理加载的文章性能

Using these techniques have been observed to speed up loading significantly, by orders of magnitudes. See also this article about JDBC batch loading performance.

您当前正在以一项巨大的交易加载所有内容(或者您使用自动提交,但这也不好)。这对大负载不利,因为数据库需要跟踪插入会话中的所有插入,以便在需要时可以将它们回滚。

You currently load everything in one huge transaction (or you use auto-commit, but that's not good, either). This is bad for large loads, because the database needs to keep track of all the insertions in your insert session to be able to roll them back if needed.

更糟糕的是,在如此大的负载下会产生大量争用的实时系统上。

This gets even worse when you're doing that on a live system, where such large loads generate lots of contention.

jOOQ的 Loader API使您可以通过指定提交大小 LoaderOptionsStep.commitAfter(int)

jOOQ's Loader API allows you to specify the "commit" size via LoaderOptionsStep.commitAfter(int)

仅当您离线加载内容时才有可能,但是如果您完全关闭数据库(针对该表)的日志记录,并且在关闭数据库时限制条件,则可以大大加快加载速度

This is only possible if you're loading stuff offline, but it can drastically speed up loading if you turn off logging entirely in your database (for that table), and if you turn off constraints while loading, turning them on again after the load.


最后,我还知道我目前不禁用对PostgreSQL模式的约束插入之前;我还没有尝试过(会有所作为吗?)。

Finally, I am also aware that I currently do not disable constraints on the PostgreSQL schema before insertion; I have yet to try this (will it make a difference?).

哦,是的。具体来说,唯一约束在每次插入时都会花费很多,因为必须始终保持这种约束。

Oh yes it will. Specifically the unique constraint costs a lot on each single insertion, as it has to be maintained all the time.

此处的代码:

final List<String> parts = new ArrayList<>();
parts.add('"' + uuid.toString() + '"');
Arrays.stream(SEMICOLON.split(input, 4))
      .map(s -> '"' + CSV_ESCAPE.apply(s) + '"')
      .forEach(parts::add);
return String.join(";", parts) + '\n';

在您隐式创建和丢弃垃圾回收器时,会对垃圾收集器产生很大压力很多 StringBuilder 对象(有关此背景的一些知识可以在此博客文章中找到)。通常,这很好,不应该过早地进行优化,但是在大批处理过程中,如果将上述内容转换为更低的级别,则可以肯定地获得百分之几的速度:

Generates a lot of pressure on your garbage collector as you're implicitly creating, and throwing away, a lot of StringBuilder objects (some background on this can be found in this blog post). Normally, that's fine and shouldn't be optimised prematurely, but in a large batch process, you can certainly gain a couple of percents in speed if you transform the above into something more low level:

StringBuilder result = new StringBuilder();
result.append('"').append(uuid.toString()).append('"');

for (String s : SEMICOLON.split(input, 4))
    result.append('"').append(CSV_ESCAPE.apply(s)).append('"');

...

当然,您仍然可以实现相同的功能功能,但我发现使用经典的Java 8以前的习语来优化这些低级String操作更容易。

Of course, you can still achieve write the same thing in a functional style, but I've found it way easier to optimise these low-level String operations using classic pre-Java 8 idioms.

这篇关于从CSV加载时PostgreSQL / JooQ批量插入性能问题;如何改善流程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆