具有大量或未定义类别的交叉表 [英] Crosstab with a large or undefined number of categories

查看:59
本文介绍了具有大量或未定义类别的交叉表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的真正问题与记录非常大量的防病毒产品中的哪些同意给定样本是给定防病毒家族的成员有关.该数据库有数百万个样本,每个样本都有数十种反病毒产品投票.我想问一个查询,例如对于包含名称"XYZ"的恶意软件,哪个样本的投票最多,哪些供应商投票了?并得到如下结果:

My real problem has to do with recording which of a very large number of anti-virus products agree that a given sample is a member of a given anti-virus family. The database has millions of samples, with tens of anti-virus products voting on each sample. I want to ask a query like "For the malware containing the name 'XYZ' which sample had the most votes, and which vendors voted for it?" and get results like:

"BadBadVirus"  
                     V1  V2  V3  V4  V5  V6  V7  
Sample 1 - 4 votes    1   0   1   0   0   1   1      
Sample 2 - 5 votes    1   0   1   0   1   1   1   
Sample 3 - 5 votes    1   0   1   0   1   1   1  

 total     14         3       3       2   3   3  

可能用来告诉我供应商2和供应商4都不知道如何 来检测这种恶意软件,或者他们使用不同的名称.

Which might be used to tell me that Vendor 2 and Vendor 4 either don't know how to detect this malware, or that they name it something different.

我将尝试稍微概括一下我的问题,同时希望不会破坏您帮助我的能力.假设我有五名选民(亚历克斯,鲍勃,卡罗尔,戴夫,埃德)被要求看五张照片(P1,P2,P3,P4,P5)并确定照片的主要主题"是什么.对于我们的示例,我们仅假设它们仅限于猫",狗"或马".并非每个投票者都对每一件事进行投票.

I'm going to try to generalize my question slightly while hopefully not breaking your ability to help me. Assume that I have five voters (Alex, Bob, Carol, Dave, Ed) who have been asked to look at five photographs (P1, P2, P3, P4, P5) and decide what the "main subject" of the photograph is. For our example, we'll just assume they were limited to "Cat", "Dog", or "Horse". Not every voter votes on every thing.

数据以这种形式存在于数据库中:

The data is in the database in this form:

Photo, Voter, Decision
(1, 'Alex', 'Cat')
(1, 'Bob', 'Dog')
(1, 'Carol', 'Cat')
(1, 'Dave', 'Cat')
(1, 'Ed', 'Cat')
(2, 'Alex', 'Cat')
(2, 'Bob', 'Dog')
(2, 'Carol', 'Cat')
(2, 'Dave', 'Cat')
(2, 'Ed', 'Dog')
(3, 'Alex', 'Horse')
(3, 'Bob', 'Horse')
(3, 'Carol', 'Dog')
(3, 'Dave', 'Horse')
(3, 'Ed', 'Horse')
(4, 'Alex', 'Horse')
(4, 'Bob', 'Horse')
(4, 'Carol', 'Cat')
(4, 'Dave', 'Horse')
(4, 'Ed', 'Horse')
(5, 'Alex', 'Dog')
(5, 'Bob', 'Cat')
(5, 'Carol', 'Cat')
(5, 'Dave', 'Cat')
(5, 'Ed', 'Cat')

目标是给定我们正在寻找的照片主题,我们想知道有多少选民认为这是照片的重点,而且还列出了投票者的想法.

The objective is that given a photo topic we are looking for, we'd like to know how many voters thought that WAS the main point of that photo, but also list WHICH VOTERS thought that.

Query for: "Cat"
      Total  Alex  Bob Carol Dave Ed
1 -     4      1    0    1     1   1
2 -     3      1    0    1     1   0 
3 -     0      0    0    0     0   0 
4 -     1      0    0    1     0   0 
5 -     4      0    1    1     1   1
------------------------------------
total  12      2    1    4     3   2 

Query for: "Dog"
      Total  Alex  Bob Carol Dave Ed
1 -     1     0      1   0    0    0
2 -     2     0      1   0    0    1
3 -     1     0      0   1    0    0 
4 -     0     0      0   0    0    0 
5 -     1     1      0   0    0    0 
------------------------------------
total   5     1      2   1    0    1 

我可以用存储格式来处理数据吗?

Is that something I can do with the data in the format that I have it stored?

我很难获得执行此操作的查询-尽管它很简单,可以转储数据,然后编写一个程序来执行此操作,但我真的很希望能够在数据库中执行此操作

I'm having difficulty getting a query that does that - although it's simple enough to dump the data out and then write a program to do that, I'd really like to be able to do it IN THE DATABASE if I can.

谢谢您的建议.

推荐答案

create table vote (Photo integer, Voter text, Decision text);
insert into vote values
(1, 'Alex', 'Cat'),
(1, 'Bob', 'Dog'),
(1, 'Carol', 'Cat'),
(1, 'Dave', 'Cat'),
(1, 'Ed', 'Cat'),
(2, 'Alex', 'Cat'),
(2, 'Bob', 'Dog'),
(2, 'Carol', 'Cat'),
(2, 'Dave', 'Cat'),
(2, 'Ed', 'Dog'),
(3, 'Alex', 'Horse'),
(3, 'Bob', 'Horse'),
(3, 'Carol', 'Dog'),
(3, 'Dave', 'Horse'),
(3, 'Ed', 'Horse'),
(4, 'Alex', 'Horse'),
(4, 'Bob', 'Horse'),
(4, 'Carol', 'Cat'),
(4, 'Dave', 'Horse'),
(4, 'Ed', 'Horse'),
(5, 'Alex', 'Dog'),
(5, 'Bob', 'Cat'),
(5, 'Carol', 'Cat'),
(5, 'Dave', 'Cat'),
(5, 'Ed', 'Cat')
;

对猫的查询:

select photo,
    alex + bob + carol + dave + ed as Total,
    alex, bob, carol, dave, ed
from crosstab($$
    select
        photo, voter,
        case decision when 'Cat' then 1 else 0 end
    from vote
    order by photo
    $$,'
    select distinct voter
    from vote
    order by voter
    '
) as (
    photo integer,
    Alex integer,
    Bob integer,
    Carol integer,
    Dave integer,
    Ed integer
);
 photo | total | alex | bob | carol | dave | ed 
-------+-------+------+-----+-------+------+----
     1 |     4 |    1 |   0 |     1 |    1 |  1
     2 |     3 |    1 |   0 |     1 |    1 |  0
     3 |     0 |    0 |   0 |     0 |    0 |  0
     4 |     1 |    0 |   0 |     1 |    0 |  0
     5 |     4 |    0 |   1 |     1 |    1 |  1

如果选民人数众多或未知,则可以动态进行:

If the number of voters is big or not known then it can be done dynamically:

do $do$
declare
voter_list text;
r record;
begin

drop table if exists pivot;

voter_list := (
    select string_agg(distinct voter, ' ' order by voter) from vote
    );

execute(format('
    create table pivot (
        decision text,
        photo integer,
        Total integer,
        %1$s
    )', (replace(voter_list, ' ', ' integer, ') || ' integer')
));

for r in
select distinct decision from vote
loop
    execute (format($f$
        insert into pivot
        select
            %3$L as decision,
            photo,
            %1$s as Total,
            %2$s
        from crosstab($ct$
            select
                photo, voter,
                case decision when %3$L then 1 else 0 end
            from vote
            order by photo
            $ct$,$ct$
            select distinct voter
            from vote
            order by voter
            $ct$
        ) as (
            photo integer,
            %4$s
        );$f$,
        replace(voter_list, ' ', ' + '),
        replace(voter_list, ' ', ', '),
        r.decision,
        replace(voter_list, ' ', ' integer, ') || ' integer'
    ));
end loop;
end; $do$;

上面的代码创建了包含所有决定的表数据透视表:

The above code created the table pivot with all the decisions:

select * from pivot where decision = 'Cat';

这篇关于具有大量或未定义类别的交叉表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆