防止复制csv postgresql上的重复数据的最佳方法 [英] Best way to prevent duplicate data on copy csv postgresql

查看:1086
本文介绍了防止复制csv postgresql上的重复数据的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个概念上的问题,因为我计划如何最好地实现我们的目标。

This is more of a conceptual question because I'm planning how best to achieve our goals here.

我有一个postgresql / postgis表有5列。我将通过copy命令每隔10分钟从csv文件将数据插入/附加到数据库中。可能会有一些重复的数据行,所以我想将数据从csv文件复制到postgresql表,但防止任何重复的条目从csv文件进入表。有三个列,其中如果它们都相等,则意味着该条目是重复的。它们是纬度,经度和时间。我应该从所有三列复合键?如果我这样做,它只会在尝试将csv文件复制到数据库时抛出一个错误?我要自动复制csv文件,所以我想让它继续前进,复制文件的其余部分不是重复的,而不是复制副本。是否有办法做到这一点?

I have a postgresql/postgis table with 5 columns. I'll be inserting/appending data into the database from a csv file every 10 minutes or so via the copy command. There will likely be some duplicate rows of data, so I'd like to copy the data from the csv file to the postgresql table but prevent any duplicate entries from getting into the table from the csv file. There are three columns, where if they are all equal, that will mean the entry is a duplicate. They are "latitude", "longitude" and "time". Should I make a composite key from all three columns? If I do that, will it just throw an error upon trying to copy the csv file into the database? I'm going to be copying the csv file automatically so I would want it to go ahead and copy the rest of the file that aren't duplicates and not copy the duplicates. Is there a way to do this?

此外,我当然希望它以最有效的方式寻找重复。我不需要通过整个表(这将是相当大的)复制...只是过去20分钟左右通过行上的时间戳。我已经用时间列索引了数据库。

Also, I of course want it to look for duplicates in the most efficient way. I don't need to look through the whole table (which will be quite large) for duplicates...just the past 20 minutes or so via the timestamp on the row. And I've indexed the db with the time column.

感谢任何帮助!

推荐答案

我想我会采取以下方法。

I think I would take the following approach.

首先,在您关心的三列上创建一个索引:

First, create an index on the three columns that you care about:

create unique index idx_bigtable_col1_col2_col3 on bigtable(col1, col2, col3);

然后,使用 copy 将数据加载到暂存表$ c>。最后,您可以:

Then, load the data into a staging table using copy. Finally, you can do:

insert into bigtable(col1, . . . )
    select col1, . . .
    from stagingtable st
    where (col1, col2, col3) not in (select col1, col2, col3 from bigtable);

假设没有其他数据修改,这应该可以完成你想要的。

Assuming no other data modifications are going on, this should accomplish what you want. Checking for duplicates using the index should be ok from a performance perspective.

另一种方法是模仿MySQL的重复键更新来忽略这些记录。 Bill Karwin建议在对此问题。规则文档是此处。类似的东西也可以用触发器来完成。

An alternative method is to emulates MySQL's "on duplicate key update" to ignore such records. Bill Karwin suggests implementing a rule in an answer to this question. The documentation for rules is here. Something similar could also be done with triggers.

这篇关于防止复制csv postgresql上的重复数据的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆