如何找出导致此插入失败的错误数据 [英] How to find out bad data causing this insert to fail

查看:110
本文介绍了如何找出导致此插入失败的错误数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个拥有8000万条记录的数据库(Postgres 9.3.5),下面的 insert 查询失败,并显示以下信息:

I have a database (Postgres 9.3.5) of 80 millions records, the insert query below fails with:

ERROR:  invalid input syntax for integer: ""

INSERT INTO DISCOGS.TRACK_DURATION
     SELECT
        track_id,
        duration,
        hours_as_seconds + minutes_as_seconds + seconds as total_seconds
    FROM (
            select
            track_id,
            duration,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 1))::bigint * 60 * 60
                ELSE 0
            END  as hours_as_seconds,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 2))::bigint * 60
                WHEN duration like '%:%'  THEN  (split_part(duration, ':', 1))::bigint * 60
                ELSE 0
            END as minutes_as_seconds,
            CASE
                WHEN duration like '%:%:%' THEN (split_part(duration, ':', 3))::bigint
                WHEN duration like '%:%'   THEN (split_part(duration, ':', 2))::bigint
                ELSE 0
            END as seconds
            from discogs.track t1
            where release_id < 10000000
            and t1.duration!='' and t1.duration is not null
            and t1.position!=''
    ) as s1

我可以使用 where release_id 来限制检查记录的数量,而值越小则越好,因此它的数据很差,但是有这么多的记录,如何找到问题数据。注意,我已经过滤掉了持续时间为空字符串的值,并且还发现了一些我更改了但数据有误的记录(例如%%%%),但仍然失败。

I can use the where release_id to limit the number of records checked and with lower values its fine, so its bad data , but with so many records how do I find the problem data. Note I'm already filtering out values where duration is empty string and I also found a few records with bad data (such as %%%%) that I have changed but it is still failing.

推荐答案

我将使用正则表达式搜索格式错误的持续时间,例如:

I would search for malformed durations using a regular expression, as in:

create table duration (
  d varchar(20)
);

insert into duration (d) values ('12:34:56');
insert into duration (d) values ('34:56');
insert into duration (d) values ('15::'); -- bad one
insert into duration (d) values (':34:56'); -- bad one
insert into duration (d) values (':34:'); -- bad one
insert into duration (d) values ('12:34:'); -- bad one
insert into duration (d) values ('34:'); -- bad one
insert into duration (d) values (':56'); -- bad one

select *
  from duration 
  where d not similar to '([0-9]+:)?[0-9]+:[0-9]+'

结果:

d                     
------
15::                  
:34:56                
:34:                  
12:34:                
34:                   
:56 

在您的情况下,查询应类似于:

In your case the query should look like:

select track_id, duration 
  from discogs.track
  where duration not similar to '([0-9]+:)?[0-9]+:[0-9]+';

这篇关于如何找出导致此插入失败的错误数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆