来自GROUP_BYs的两个LEFT JOIN的GROUP_CONCAT的奇怪重复行为 [英] Strange duplicate behavior from GROUP_CONCAT of two LEFT JOINs of GROUP_BYs
问题描述
这里是我所有表格的结构和查询(请关注下面附加的上一个查询)。正如你在小提琴中看到的那样,这里是当前输出:
$ b
+ ----- ---- + ----------- + ------- + ------------ + ------------ - +
| user_id | user_name |分数|声誉| top_two_tags |
+ --------- + ----------- + ------- + ------------ + - ------------ +
| 1 |杰克| 0 | 18 | css,mysql |
| 4 |詹姆斯| 1 | 5 | html |
| 2 |彼得| 0 | 0 | null |
| 3 |阿里| 0 | 0 | null |
+ --------- + ----------- + ------- + ------------ + - ------------ +
这是正确的,一切正常。
现在我有一个名为category的存在。每篇文章只能有一个类别。我也想为每个用户获得前两类。和这里是我的新的查询。正如你在结果中看到的那样,发生了一些重复:
+ --------- + ---- ------- ------- + ------------ + + + -------------- ------ ------------------ +
| user_id | user_name |分数|声誉| top_two_tags | top_two_categories |
+ --------- + ----------- + ------- + ------------ + - ------------ + ------------------------ +
| 1 |杰克| 0 | 18 | css,css |技术,技术|
| 4 |詹姆斯| 1 | 5 | html |政治|
| 2 |彼得| 0 | 0 | null | null |
| 3 |阿里| 0 | 0 | null | null |
+ --------- + ----------- + ------- + ------------ + - ------------ + ------------------------ +
请参阅? css,css
, technology,technology
。为什么这些是重复的?我刚刚为类别
添加了一个 LEFT JOIN
,完全像 tags
。但它不能按预期工作,甚至会影响标签。
无论如何,这是预期结果:
$ b
+ --------- + ----------- + ------- + ---- -------- + -------------- + ------------------------ +
| user_id | user_name |分数|声誉| top_two_tags |类别|
+ --------- + ----------- + ------- + ------------ + - ------------ + ------------------------ +
| 1 |杰克| 0 | 18 | css,mysql |技术,社会|
| 4 |詹姆斯| 1 | 5 | html |政治|
| 2 |彼得| 0 | 0 | null | null |
| 3 |阿里| 0 | 0 | null | null |
+ --------- + ----------- + ------- + ------------ + - ------------ + ------------------------ +
有谁知道我该怎么做到这一点?
CREATE TABLE users(id integer PRIMARY KEY,user_name varchar(5));
CREATE TABLE tags(id integer NOT NULL PRIMARY KEY,tag varchar(5));
CREATE TABLE reputations(
id integer PRIMARY KEY,
post_id integer / * REFERENCES posts(id)* /,
user_id integer REFERENCES users(id),
score整数,
信誉整数,
date_time整数);
CREATE TABLE post_tag(
post_id integer / * REFERENCES posts(id)* /,
tag_id integer REFERENCES tags(id),
PRIMARY KEY(post_id,tag_id));
CREATE TABLE categories(id INTEGER NOT NULL PRIMARY KEY,category varchar(10)NOT NULL);
CREATE TABLE post_category(
post_id INTEGER NOT NULL / * REFERENCES posts(id)* /,
category_id INTEGER NOT NULL参考类别(id),
PRIMARY KEY(post_id,category_id ));
选择
q1.user_id,q1.user_name,q1.score,q1.reputation,
SUBSTRING_INDEX(GROUP_CONCAT(q2.tag ORDER BY q2.tag_reputation DESC分离器 '' ),',',2)AS分类
substring_index(group_concat(q3.category) b(SELECT(
u.id AS user_Id,
u.user_name,
coalesce(sum(r.score),0)as score,
coalesce(sum(r.reputation ),0)作为口碑
FROM
用户u
LEFT JOIN声誉r
ON r.user_id = u.id
AND r.date_time> 1500584821 / * unix_timestamp(DATE_SUB(now(),INTERVAL 1 WEEK))* /
GROUP BY
u.id,u.user_name
)AS q1
LEFT JOIN
SELECT
r.user_id AS user_id,t.tag,sum(r.reputation)AS tag_reputation
FROM
声誉r
JOI N post_tag pt ON pt.post_id = r.post_id
JOIN标记t ON t.id = pt.tag_id
WHERE
r.date_time> 1500584821 / * unix_timestamp(DATE_SUB(now(),INTERVAL 1 WEEK))* /
GROUP BY
user_id,t.tag
)AS q2
ON q2.user_id = q1 .user_id
LEFT JOIN
(
SELECT
r.user_id AS user_id,c.category,sum(r.reputation)AS category_reputation
FROM
声望r
JOIN post_category ct ON ct.post_id = r.post_id
JOIN类别c ON c.id = ct.category_id
WHERE
r.date_time> 1500584821 / * unix_timestamp( DATE_SUB(now(),INTERVAL 1 WEEK))* /
GROUP BY
user_id,c.category
)AS q3
ON q3.user_id = q1.user_id
GROUP BY
q1.user_id,q1.user_name,q1.score,q1.reputation
ORDER BY
q1.reputation DESC,q1.score DESC;
您的第二个查询的格式为:
q1 - PK user_id
LEFT JOIN(... GROUP BY
user_id,t.tag
)AS q2
ON q2.user_id = q1.user_id
LEFT JOIN(... GROUP BY
user_id,c.category
)AS q3
ON q3 .user_id = q1.user_id
GROUP BY列的结果是(user_id,t.tag)& (user_id,c.category)是keys / UNIQUE。
正确的对称INNER JOIN方法:LEFT JOIN q1& q2--1:很多 - 然后GROUP BY& GROUP_CONCAT(这是您的第一个查询所做的);然后分别类似地LEFT JOIN q1& q3--1:很多 - 然后GROUP BY& GROUP_CONCAT;然后INNER JOIN两个结果在user_id - 1:1。
正确的对称标量子查询方法:从q1中选择GROUP_CONCAT作为标量子查询,每个都有一个GROUP BY。
正确的累积LEFT JOIN方法:JOIN q1& q2--1:很多 - 然后GROUP BY& GROUP_CONCAT;然后离开那个& q3--1:很多 - 然后GROUP BY& GROUP_CONCAT。
正确的方法就像您的第二个查询:您先LEFT JOIN q1& q2--1:很多。然后你左连接& q3 - 但这是一个类似于断层陷阱的非FK(外键)连接。它给出了t.tag&与user_id一起出现的c.category。然后你GROUP BY& GROUP_CONCAT - 来自重复(user_id,t.tag)对和重复(user_id,c.category)对。这就是为什么你有重复的列表元素。但是,如果你GROUP_CONCAT DISTINCT,这也适用。 (根据 wchiquito 的评论。)
你更喜欢哪一种与通常的工程折衷一样,通过查询计划&时间,每个实际数据/使用情况/统计数据。输入&统计期望的重复数量),实际查询的时间等等。一个问题是,chasm JOIN方法的额外行是否抵消了它的GROUP BY保存。
- 累计LEFT JOIN方法
SELECT
q1.user_id,q1.user_name,q1.score,q1.reputation,
top_two_tags,
substring_index(group_concat(q3.category ORDER BY q3.category_reputation DESC SEPARATOR','),',',2)AS类别
FROM
- 您的第一个查询(减少ORDER BY)AS q1
(SELECT
q1.user_id,q1.user_name,q1.score,q1.reputation,
SUBSTRING_INDEX(GROUP_CONCAT(q2.tag ORDER BY q2.tag_reputation DESC分离器 ' '),',' ,2)AS top_two_tags
FROM
(SELECT
u.id AS user_Id,
u.user_name,
coalesce(sum(r.score),0)as分数,
coalesce(sum(r.reputation),0)作为声望
FROM
用户u
LEFT JOIN声誉r
ON r.user_id = u.id
AND r.date_time> 1500584821 / * unix_timestamp(DATE_SUB(now(),INTERVAL 1 WEEK))* /
GROUP BY
u.id,u.user_name
)AS q1
LEFT JOIN
(
SELECT
r.user_id AS user_id,t.tag,sum(r.reputation)AS tag_reputation
FROM
声望r
JOIN post_tag pt ON pt .post_id = r.post_id
JOIN标签T ON t.id = pt.tag_id
,其中
r.date_time> 1500584821 / * UNIX_TIMESTAMP(DATE_SUB(现在的(),INTERVAL 1周) )* /
GROUP BY
user_id,t.tag
)AS q2
ON q2.user_id = q1.user_id
GROUP BY
q1.user_id ,q1.user_name,q1.score,q1.reputation
)AS q1
- 完成您的第二个查询
LEFT JOIN
(
SELECT
r.user_id AS user_id,c.category,sum(r.reputation)AS category_reputation
FROM
声望r
JOIN post_category ct ON ct.post_id = r.post_id
JOIN类别c ON c.id = ct.category_id
WHERE
r.date_time> 1500584821 / * unix_timestamp(DATE_SUB(now(),INTERVAL 1 WEEK))* /
GROUP BY
user_id,c.category
)AS q3
ON q3.user_id = q1 .user_id
GROUP BY
q1.user_id,q1.user_name,q1.score,q1.reputation
ORDER BY
q1.reputation DESC,q1.score DESC;
Here is all my tables' structure and the query (please focus on the last query, appended below). As you see in the fiddle, here is the current output:
+---------+-----------+-------+------------+--------------+
| user_id | user_name | score | reputation | top_two_tags |
+---------+-----------+-------+------------+--------------+
| 1 | Jack | 0 | 18 | css,mysql |
| 4 | James | 1 | 5 | html |
| 2 | Peter | 0 | 0 | null |
| 3 | Ali | 0 | 0 | null |
+---------+-----------+-------+------------+--------------+
It's correct and all fine.
Now I have one more existence named "category". Each post can has only one category. And I also want to get top two categories for each user. And here is my new query. As you see in the result, some duplicates happened:
+---------+-----------+-------+------------+--------------+------------------------+
| user_id | user_name | score | reputation | top_two_tags | top_two_categories |
+---------+-----------+-------+------------+--------------+------------------------+
| 1 | Jack | 0 | 18 | css,css | technology,technology |
| 4 | James | 1 | 5 | html | political |
| 2 | Peter | 0 | 0 | null | null |
| 3 | Ali | 0 | 0 | null | null |
+---------+-----------+-------+------------+--------------+------------------------+
See? css,css
, technology, technology
. Why these are duplicate? I've just added one more LEFT JOIN
for categories
, exactly like tags
. But it doesn't work as expected and even affects on the tags either.
Anyway, this is the expected result:
+---------+-----------+-------+------------+--------------+------------------------+
| user_id | user_name | score | reputation | top_two_tags | category |
+---------+-----------+-------+------------+--------------+------------------------+
| 1 | Jack | 0 | 18 | css,mysql | technology,social |
| 4 | James | 1 | 5 | html | political |
| 2 | Peter | 0 | 0 | null | null |
| 3 | Ali | 0 | 0 | null | null |
+---------+-----------+-------+------------+--------------+------------------------+
Does anybody know how can I achieve that?
CREATE TABLE users(id integer PRIMARY KEY, user_name varchar(5));
CREATE TABLE tags(id integer NOT NULL PRIMARY KEY, tag varchar(5));
CREATE TABLE reputations(
id integer PRIMARY KEY,
post_id integer /* REFERENCES posts(id) */,
user_id integer REFERENCES users(id),
score integer,
reputation integer,
date_time integer);
CREATE TABLE post_tag(
post_id integer /* REFERENCES posts(id) */,
tag_id integer REFERENCES tags(id),
PRIMARY KEY (post_id, tag_id));
CREATE TABLE categories(id INTEGER NOT NULL PRIMARY KEY, category varchar(10) NOT NULL);
CREATE TABLE post_category(
post_id INTEGER NOT NULL /* REFERENCES posts(id) */,
category_id INTEGER NOT NULL REFERENCES categories(id),
PRIMARY KEY(post_id, category_id)) ;
SELECT
q1.user_id, q1.user_name, q1.score, q1.reputation,
substring_index(group_concat(q2.tag ORDER BY q2.tag_reputation DESC SEPARATOR ','), ',', 2) AS top_two_tags,
substring_index(group_concat(q3.category ORDER BY q3.category_reputation DESC SEPARATOR ','), ',', 2) AS category
FROM
(SELECT
u.id AS user_Id,
u.user_name,
coalesce(sum(r.score), 0) as score,
coalesce(sum(r.reputation), 0) as reputation
FROM
users u
LEFT JOIN reputations r
ON r.user_id = u.id
AND r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
GROUP BY
u.id, u.user_name
) AS q1
LEFT JOIN
(
SELECT
r.user_id AS user_id, t.tag, sum(r.reputation) AS tag_reputation
FROM
reputations r
JOIN post_tag pt ON pt.post_id = r.post_id
JOIN tags t ON t.id = pt.tag_id
WHERE
r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
GROUP BY
user_id, t.tag
) AS q2
ON q2.user_id = q1.user_id
LEFT JOIN
(
SELECT
r.user_id AS user_id, c.category, sum(r.reputation) AS category_reputation
FROM
reputations r
JOIN post_category ct ON ct.post_id = r.post_id
JOIN categories c ON c.id = ct.category_id
WHERE
r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
GROUP BY
user_id, c.category
) AS q3
ON q3.user_id = q1.user_id
GROUP BY
q1.user_id, q1.user_name, q1.score, q1.reputation
ORDER BY
q1.reputation DESC, q1.score DESC ;
Your second query is of the form:
q1 -- PK user_id
LEFT JOIN (... GROUP BY
user_id, t.tag
) AS q2
ON q2.user_id = q1.user_id
LEFT JOIN (... GROUP BY
user_id, c.category
) AS q3
ON q3.user_id = q1.user_id
The GROUP BY columns result in (user_id,t.tag) & (user_id,c.category) being keys/UNIQUE.
A correct symmetrical INNER JOIN approach: LEFT JOIN q1 & q2--1:many--then GROUP BY & GROUP_CONCAT (which is what your first query did); then separately similarly LEFT JOIN q1 & q3--1:many--then GROUP BY & GROUP_CONCAT; then INNER JOIN the two results ON user_id--1:1.
A correct symmetrical scalar subquery approach: SELECT the GROUP_CONCATs from q1 as scalar subqueries each with a GROUP BY.
A correct cumulative LEFT JOIN approach: JOIN q1 & q2--1:many--then GROUP BY & GROUP_CONCAT; then left join that & q3--1:many--then GROUP BY & GROUP_CONCAT.
A correct approach like your 2nd query: You first LEFT JOIN q1 & q2--1:many. Then you LEFT JOIN that & q3--but this is a chasm-trap-like non-FK (foreign key) join. It gives a row for every possible combination of a t.tag & c.category that appears with a user_id. Then you GROUP BY & GROUP_CONCAT--from duplicate (user_id, t.tag) pairs and duplicate (user_id, c.category) pairs. That is why you have duplicate list elements. But if you GROUP_CONCAT DISTINCT, this also works. (Per wchiquito's comment.)
Which you prefer is as usual an engineering tradeoff to be informed by query plans & timings, per actual data/usage/statistics. input & stats for expected amount of duplication), timing of actual queries, etc. One issue is whether the extra rows of the chasm JOIN approach offset its saving of a GROUP BY.
-- cumulative LEFT JOIN approach
SELECT
q1.user_id, q1.user_name, q1.score, q1.reputation,
top_two_tags,
substring_index(group_concat(q3.category ORDER BY q3.category_reputation DESC SEPARATOR ','), ',', 2) AS category
FROM
-- your 1st query (less ORDER BY) AS q1
(SELECT
q1.user_id, q1.user_name, q1.score, q1.reputation,
substring_index(group_concat(q2.tag ORDER BY q2.tag_reputation DESC SEPARATOR ','), ',', 2) AS top_two_tags
FROM
(SELECT
u.id AS user_Id,
u.user_name,
coalesce(sum(r.score), 0) as score,
coalesce(sum(r.reputation), 0) as reputation
FROM
users u
LEFT JOIN reputations r
ON r.user_id = u.id
AND r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
GROUP BY
u.id, u.user_name
) AS q1
LEFT JOIN
(
SELECT
r.user_id AS user_id, t.tag, sum(r.reputation) AS tag_reputation
FROM
reputations r
JOIN post_tag pt ON pt.post_id = r.post_id
JOIN tags t ON t.id = pt.tag_id
WHERE
r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
GROUP BY
user_id, t.tag
) AS q2
ON q2.user_id = q1.user_id
GROUP BY
q1.user_id, q1.user_name, q1.score, q1.reputation
) AS q1
-- finish like your 2nd query
LEFT JOIN
(
SELECT
r.user_id AS user_id, c.category, sum(r.reputation) AS category_reputation
FROM
reputations r
JOIN post_category ct ON ct.post_id = r.post_id
JOIN categories c ON c.id = ct.category_id
WHERE
r.date_time > 1500584821 /* unix_timestamp(DATE_SUB(now(), INTERVAL 1 WEEK)) */
GROUP BY
user_id, c.category
) AS q3
ON q3.user_id = q1.user_id
GROUP BY
q1.user_id, q1.user_name, q1.score, q1.reputation
ORDER BY
q1.reputation DESC, q1.score DESC ;
这篇关于来自GROUP_BYs的两个LEFT JOIN的GROUP_CONCAT的奇怪重复行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!