MySQL标记问题:如何选择已标记为X,Y和Z的项目? [英] MySQL tagging question: how to select an item that has been tagged as X, Y, and Z?

查看:188
本文介绍了MySQL标记问题:如何选择已标记为X,Y和Z的项目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个数据库,其中的项目被标记了一定次数。



(100k行) / p>


  • id

  • 姓名

  • >


标签(10k行)




  • id

  • 姓名



item2tag 行)




  • item_id

  • tag_id

  • / li>


我正在寻找最快的解决方案:



已经标记为X,Y和Z(其中X,Y和Z对应于(可能)标记名称)?





首先从名称中获取tag_ids:

  SELECT tag.id WHERE name IN(X,Y,Z); 

然后我用这些tag_ids分组,并使用不得不过滤结果:

  SELECT item2tag。*,count(tag_id)
FROM item2tag
WHERE tag_id = 1或tag_id = 2或tag_id = 3
GROUP BY item_id
HAVING count(tag_id)= 3;

然后我可以选择带有这些ID的项目。

  SELECT * FROM item WHERE id IN([来自先前查询的结果])

我在item2tag中有数百万行,索引为(item_id,tag_id)。这是否是最快的解决方案?

解决方案

您建议的方法可能是执行查询的最常见方法,可能不是最快的。使用连接可以更快:

  SELECT T1.item_id 
FROM item2tag T1
JOIN item2tag T2 ON T1 .item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3

您应确保拥有以下索引:




  • 主键(item_id,tag_id)

  • 在(tag_id)上的索引。



<




  • 对于表中几乎每个项目都被标记的情况在搜索至少一个标签的情况下,原始查询花费大约5秒,并且JOIN版本花费大约10秒 - 稍慢。

  • 对于其中两个标签出现的情况非常频繁,并且其中一个标签很少发生,原始查询大约需要0.9秒,而JOIN查询只需要0.003秒 - 这可以显着提高性能。



我用来做性能测试的SQL在下面粘贴。您可以自行运行此测试或稍微修改它,并测试其他查询或不同的方案。



警告:不要运行此脚本您的生产数据库,因为它修改 item2tag 表的内容。运行脚本可能需要几分钟,因为它创建了大量数据。

  CREATE TABLE填充NOT NULL PRIMARY KEY AUTO_INCREMENT 
)ENGINE = Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt< = cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
CALL prc_filler(1000000);

CREATE TABLE item2tag(
item_id INT NOT NULL,
tag_id INT NOT NULL,
count INT NOT NULL
);

INSERT INTO item2tag(item_id,tag_id,count)
SELECT id%150001,id%10,1
FROM filler;
ALTER TABLE item2tag ADD PRIMARY KEY(item_id,tag_id);
ALTER TABLE item2tag ADD KEY(tag_id);

- 使标签3很少出现。
UPDATE item2tag SET tag_id = 10 WHERE tag_id = 3 AND item_id> 0;

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3;

SELECT item_id
FROM item2tag
WHERE tag_id = 1或tag_id = 2或tag_id = 3
GROUP BY item_id
HAVING count(tag_id)= 3 ;


I'm dealing with a database where items are "tagged" a certain number of times.

item (100k rows)

  • id
  • name
  • other stuff

tag (10k rows)

  • id
  • name

item2tag (1,000,000 rows)

  • item_id
  • tag_id
  • count

I'm looking for the fastest solution to:

Select items that have been tagged as X, Y, and Z (where X, Y, and Z correspond to (possibly) tag names) ?

Here's what I have so far... I'd just like to make sure I'm doing it in the best way possible:

First get the tag_ids from the names:

SELECT tag.id WHERE name IN ("X","Y","Z");

Then I group by those tag_ids and use Having to filter the result:

SELECT item2tag.*, count(tag_id)
  FROM item2tag
  WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

Then I can just select from item with those ids.

SELECT * FROM item WHERE id IN ([results from prior query])

I have millions of rows in item2tag, with an index on (item_id, tag_id). Is this going to be the fastest solution?

解决方案

The method you have suggested is probably the most common way to perform the query but might not be the fastest. Using joins can be faster:

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3

You should ensure that you have the following indexes:

  • Primary key on (item_id, tag_id)
  • Index on (tag_id).

I performance tested this query against the original in a few different scenarios.

  • For the case where nearly every item in the table is tagged with at least one of the tags being searched for, the original query takes about 5 seconds and the JOIN version takes about 10 seconds - slightly slower.
  • For the case where two of the tags occur very frequently and one of the tags occurs only very rarely the original query takes about 0.9 seconds, whereas the JOIN query takes just 0.003 seconds - a considerable performance improvement.

The SQL I used to make performance test is pasted below. You can run this test yourself or modify it slightly and test other queries, or different scenarios.

Warning: Don't run this script on your production database as it modifies the contents of the item2tag table. Running the script can take a few minutes as it creates a lot of data.

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$
CALL prc_filler(1000000);

CREATE TABLE item2tag (
    item_id INT NOT NULL,
    tag_id INT NOT NULL,
    count INT NOT NULL
);

INSERT INTO item2tag (item_id, tag_id, count)
SELECT  id % 150001, id % 10, 1
FROM    filler;
ALTER TABLE item2tag ADD PRIMARY KEY (item_id, tag_id);
ALTER TABLE item2tag ADD KEY (tag_id);

-- Make tag 3 occur rarely.    
UPDATE item2tag SET tag_id = 10 WHERE tag_id = 3 AND item_id > 0;

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3;

SELECT item_id
FROM item2tag
WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

这篇关于MySQL标记问题:如何选择已标记为X,Y和Z的项目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆