在许多字段中获取不同的信息,其中某些字段为NULL [英] Get distinct information across many fields some of which are NULL
问题描述
我有一张表格,其中有超过6500万行和140列.数据来自多个来源,至少每个月提交一次.
I have a table with just over 65 million rows and 140 columns. The data comes from several sources and is submitted at least every month.
我寻求一种快速的方法来仅从这些数据中唯一的地方获取它们.问题是,我想处理所有信息,以链接将哪个发票发送给哪个识别号以及将其发送给谁.问题是,我不想遍历6500万条记录.如果我能够获得不同的价值,那么我只需要处理500万条记录,而不是6500万条记录.有关数据的说明,请参见下文;有关示例,请 SQL提琴
I look for a quick way to grab specific fields from this data only where they are unique. Thing is, I want to process all the information to link which invoice was sent with which identifying numbers and it was sent by whom. Issue is, I don't want to iterate over 65 million records. If I can get distinct values, then I will only have to process say 5 million records as opposed to 65 million. See below for a description of the data and SQL Fiddle for a sample
如果说客户每月提交一个链接到passport_number_1, national_identity_number_1 and driving_license_1
的invoice_number
,我只希望其中出现一行.即4个字段必须唯一
If say a client submits an invoice_number
linked to passport_number_1, national_identity_number_1 and driving_license_1
every month, I only want one row where this appears. i.e. the 4 fields have got to be unique
如果他们在30个月内提交了以上内容,然后在31个月发送了链接到passport_number_1, national_identity_number_2 and driving_license_1
的invoice_number
,我也想选择此行,因为national_identity
字段是新字段,因此整行都是唯一的
If they submit the above for 30 months then on the 31st month they send the invoice_number
linked to passport_number_1, national_identity_number_2 and driving_license_1
, I want to pick this row also since the national_identity
field is new hence the whole row is unique
- 通过
linked to
我是说它们出现在同一行 - 对于所有字段,可能在某一点出现Null.
- 数据透视/复合"列为invoice_number和 由...所提交.如果其中不存在,请删除该行
- 我还需要在上述数据中包含database_id. IE. 由postgresql数据库自动生成的primary_id
- 唯一不需要返回的字段是
other_column
和yet_another_column
.请记住该表有140列,所以不要 需要他们 - 使用结果创建一个新表,该表将保留此唯一 记录
- By
linked to
I mean they appear on the same row - For all fields its possible to have Null occurring at one point.
- The 'pivot/composite' columns are the invoice_number and submitted_by. If any of those aren't there, drop that row
- I also need to include the database_id with the above data. i.e. the primary_id which is auto generated by the postgresql database
- The only fields that don't need to be returned are the
other_column
andyet_another_column
. Remember the table has 140 columns so don't need them - With the results, create a new table that will hold this unique records
请参见此 SQL小提琴,以尝试重新创建场景.
See this SQL fiddle for an attempt to recreate the scenario.
从那个小提琴中,我希望得到这样的结果:
From that fiddle, I'd expect a result like:
- 1、2和9号行第11列:仅保留其中一个,因为它们恰好是
相同的.最好是
id
最小的行. - 第4行和第9行:其中之一将被删除,因为它们正好是 一样.
- 5号,7号和& 8:将被丢弃,因为它们缺少
invoice_number
或submitted_by
. - 然后结果将具有行(1、2或11),3,(4或9),6和10.
- Row 1, 2 & Row 11: Only one of them shall be kept as they are exactly the
same. Preferably the row with the smallest
id
. - Row 4 and Row 9: One of them would be dropped as they are exactly the same.
- Row 5, 7, & 8: Would be dropped since they are missing either the
invoice_number
orsubmitted_by
. - The result would then have Row (1, 2 or 11), 3, (4 or 9), 6 and 10.
推荐答案
要从具有四个不同字段的组中获取一个代表行(带有其他字段):
To get one representative row (with additional fields) from a group with the four distinct fields:
SELECT
distinct on (
invoice_number
, passport_number
, national_id_number
, driving_license_number
)
* -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
;
请注意,除非您指定顺序( distinct
)上的文档)
Note that it is unpredictable which row exactly is returned unless you specify an ordering (documentation on distinct
)
仅通过在末尾添加order by id
来通过id
排序结果是不可行的,但是可以通过使用CTE来进行初始化
To order this result by id
simply adding order by id
to the end doesn't work, but it can be done by eiter using a CTE
with distinct_rows as (
SELECT
distinct on (
invoice_number
, passport_number
, national_id_number
, driving_license_number
-- ...
)
* -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
)
select *
from distinct_rows
order by id;
或将原始查询设为子查询
or making the original query a subquery
select *
from (
SELECT
distinct on (
invoice_number
, passport_number
, national_id_number
, driving_license_number
-- ...
)
* -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
) t
order by id;
这篇关于在许多字段中获取不同的信息,其中某些字段为NULL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!