在许多字段中获取不同的信息,其中某些字段为NULL [英] Get distinct information across many fields some of which are NULL

查看:281
本文介绍了在许多字段中获取不同的信息,其中某些字段为NULL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张表格,其中有超过6500万行和140列.数据来自多个来源,至少每个月提交一次.

I have a table with just over 65 million rows and 140 columns. The data comes from several sources and is submitted at least every month.

我寻求一种快速的方法来仅从这些数据中唯一的地方获取它们.问题是,我想处理所有信息,以链接将哪个发票发送给哪个识别号以及将其发送给谁.问题是,我不想遍历6500万条记录.如果我能够获得不同的价值,那么我只需要处理500万条记录,而不是6500万条记录.有关数据的说明,请参见下文;有关示例,请 SQL提琴

I look for a quick way to grab specific fields from this data only where they are unique. Thing is, I want to process all the information to link which invoice was sent with which identifying numbers and it was sent by whom. Issue is, I don't want to iterate over 65 million records. If I can get distinct values, then I will only have to process say 5 million records as opposed to 65 million. See below for a description of the data and SQL Fiddle for a sample

如果说客户每月提交一个链接到passport_number_1, national_identity_number_1 and driving_license_1invoice_number,我只希望其中出现一行.即4个字段必须唯一

If say a client submits an invoice_number linked to passport_number_1, national_identity_number_1 and driving_license_1 every month, I only want one row where this appears. i.e. the 4 fields have got to be unique

如果他们在30个月内提交了以上内容,然后在31个月发送了链接到passport_number_1, national_identity_number_2 and driving_license_1invoice_number,我也想选择此行,因为national_identity字段是新字段,因此整行都是唯一的

If they submit the above for 30 months then on the 31st month they send the invoice_number linked to passport_number_1, national_identity_number_2 and driving_license_1, I want to pick this row also since the national_identity field is new hence the whole row is unique

  • 通过linked to我是说它们出现在同一行
  • 对于所有字段,可能在某一点出现Null.
  • 数据透视/复合"列为invoice_number和 由...所提交.如果其中不存在,请删除该行
  • 我还需要在上述数据中包含database_id. IE. 由postgresql数据库自动生成的primary_id
  • 唯一不需要返回的字段是other_columnyet_another_column.请记住该表有140列,所以不要 需要他们
  • 使用结果创建一个新表,该表将保留此唯一 记录
  • By linked to I mean they appear on the same row
  • For all fields its possible to have Null occurring at one point.
  • The 'pivot/composite' columns are the invoice_number and submitted_by. If any of those aren't there, drop that row
  • I also need to include the database_id with the above data. i.e. the primary_id which is auto generated by the postgresql database
  • The only fields that don't need to be returned are the other_column and yet_another_column. Remember the table has 140 columns so don't need them
  • With the results, create a new table that will hold this unique records

请参见此 SQL小提琴,以尝试重新创建场景.

See this SQL fiddle for an attempt to recreate the scenario.

从那个小提琴中,我希望得到这样的结果:

From that fiddle, I'd expect a result like:

  • 1、2和9号行第11列:仅保留其中一个,因为它们恰好是 相同的.最好是id最小的行.
  • 第4行和第9行:其中之一将被删除,因为它们正好是 一样.
  • 5号,7号和& 8:将被丢弃,因为它们缺少 invoice_numbersubmitted_by.
  • 然后结果将具有行(1、2或11),3,(4或9),6和10.
  • Row 1, 2 & Row 11: Only one of them shall be kept as they are exactly the same. Preferably the row with the smallest id.
  • Row 4 and Row 9: One of them would be dropped as they are exactly the same.
  • Row 5, 7, & 8: Would be dropped since they are missing either the invoice_number or submitted_by.
  • The result would then have Row (1, 2 or 11), 3, (4 or 9), 6 and 10.

推荐答案

要从具有四个不同字段的组中获取一个代表行(带有其他字段):

To get one representative row (with additional fields) from a group with the four distinct fields:

SELECT 
distinct on (
  invoice_number
  , passport_number
  , national_id_number
  , driving_license_number
)
  * -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
;

请注意,除非您指定顺序( distinct )上的文档)

Note that it is unpredictable which row exactly is returned unless you specify an ordering (documentation on distinct)

仅通过在末尾添加order by id来通过id排序结果是不可行的,但是可以通过使用CTE来进行初始化

To order this result by id simply adding order by id to the end doesn't work, but it can be done by eiter using a CTE

with distinct_rows as (
    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
      -- ...
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
)
select *
from distinct_rows
order by id;

或将原始查询设为子查询

or making the original query a subquery

select *
from (
    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
      -- ...
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
) t
order by id;

这篇关于在许多字段中获取不同的信息,其中某些字段为NULL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆