高级MySQL:查找轮询响应之间的相关性 [英] Advanced MySQL: Find correlations between poll responses
问题描述
我有四个MySQL表:
I've got four MySQL tables:
用户(ID,名称)
投票(ID,文本)
选项(id,poll_id,文本)
响应(id,poll_id,option_id,user_id)
users (id, name)
polls (id, text)
options (id, poll_id, text)
responses (id, poll_id, option_id, user_id)
鉴于特定的民意调查和特定的选项,我想生成一个表,该表显示其他民意调查中哪些选项之间的相关性最强.
Given a particular poll and a particular option, I'd like to generate a table that shows which options from other polls are most strongly correlated.
假设这是我们的数据集:
Suppose this is our data set:
TABLE users:
+------+-------+
| id | name |
+------+-------+
| 1 | Abe |
| 2 | Bob |
| 3 | Che |
| 4 | Den |
+------+-------+
TABLE polls:
+------+-----------------------+
| id | text |
+------+-----------------------+
| 1 | Do you like apples? |
| 2 | What is your gender? |
| 3 | What is your height? |
| 4 | Do you like polls? |
+------+-----------------------+
TABLE options:
+------+----------+---------+
| id | poll_id | text |
+------+----------+---------+
| 1 | 1 | Yes |
| 2 | 1 | No |
| 3 | 2 | Male |
| 4 | 2 | Female |
| 5 | 3 | Short |
| 6 | 3 | Tall |
| 7 | 4 | Yes |
| 8 | 4 | No |
+------+----------+---------+
TABLE responses:
+------+----------+------------+----------+
| id | poll_id | option_id | user_id |
+------+----------+------------+----------+
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 3 |
| 4 | 1 | 2 | 4 |
| 5 | 2 | 3 | 1 |
| 6 | 2 | 3 | 2 |
| 7 | 2 | 3 | 3 |
| 8 | 2 | 4 | 4 |
| 9 | 3 | 5 | 1 |
| 10 | 3 | 6 | 2 |
| 10 | 3 | 5 | 3 |
| 10 | 3 | 6 | 4 |
| 10 | 4 | 7 | 1 |
| 10 | 4 | 7 | 2 |
| 10 | 4 | 7 | 3 |
| 10 | 4 | 7 | 4 |
+------+----------+------------+----------+
给定轮询ID 1和选项ID 2,生成的表应如下所示:
Given the poll ID 1 and the option ID 2, the generated table should be something like this:
+----------+------------+-----------------------+
| poll_id | option_id | percent_correlated |
+----------+------------+-----------------------+
| 4 | 7 | 100 |
| 2 | 3 | 66.66 |
| 3 | 6 | 66.66 |
| 2 | 4 | 33.33 |
| 3 | 5 | 33.33 |
| 4 | 8 | 0 |
+----------+------------+-----------------------+
因此,基本上,我们正在确定所有对民意调查ID 1和选择选项ID 2做出响应的用户,并且正在调查所有其他民意调查,以查看其中有多少百分比的人还选择了其他选项.>
So basically, we're identifying all of the users who responded to poll ID 1 and selected option ID 2, and we're looking through all the other polls to see what percentage of them also selected each other option.
推荐答案
这似乎为我提供了正确的结果:
This seems to give the right results for me:
select poll_stats.poll_id,
option_stats.option_id,
(100 * option_responses / poll_responses) as percent_correlated
from (select response.poll_id,
count(*) as poll_responses
from response selecting_response
join response on response.user_id = selecting_response.user_id
where selecting_response.poll_id = 1 and selecting_response.option_id = 2
group by response.poll_id) poll_stats
join (select options.poll_id,
options.id as option_id,
count(response.id) as option_responses
from options
left join response on response.poll_id = options.poll_id
and response.option_id = options.id
and exists (
select 1 from response selecting_response
where selecting_response.user_id = response.user_id
and selecting_response.poll_id = 1
and selecting_response.option_id = 2)
group by options.poll_id, options.id
) as option_stats
on option_stats.poll_id = poll_stats.poll_id
where poll_stats.poll_id <> 1
order by 3 desc, option_responses desc
这篇关于高级MySQL:查找轮询响应之间的相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!