使用Count获取在HTML列中找到的最常见的前10个项目 [英] Get most frequent top 10 items found in a HTML column using Count
问题描述
我的查询有点混乱,试图找出答案.
I have a bit of a messy query to try figure out.
我有一列称为"meta_value",并且其中有一些HTML数据,例如:
I have a column called "meta_value" and in that I have some HTML data such as:
<tr class="child-row123"><td class="monsters">Monster</td><td class="monsters"><a data-name="Zure, Knight of Dark World">Zure, Knight of Dark World</a> x1</br><a data-name="Dark Necrofear">Dark Necrofear</a> x1</br><a data-name="Grapha, Dragon Lord of Dark World">Grapha, Dragon Lord of Dark World</a> x3</br><a data-name="Reign-Beaux, Overlord of Dark World">Reign-Beaux, Overlord of Dark World</a> x1</br><a data-name="Goldd, Wu-Lord of Dark World">Goldd, Wu-Lord of Dark World</a> x1</br><a data-name="Sillva, Warlord of Dark World">Sillva, Warlord of Dark World</a> x3</br><a data-name="Beiige, Vanguard of Dark World">Beiige, Vanguard of Dark World</a> x2</br><a data-name="Brron, Mad King of Dark World">Brron, Mad King of Dark World</a> x3</br><a data-name="Trance Archfiend">Trance Archfiend</a> x3</br><a data-name="Kahkki, Guerilla of Dark World">Kahkki, Guerilla of Dark World</a> x3</br><a data-name="Scarr, Scout of Dark World">Scarr, Scout of Dark World</a> x2</br></td></tr><tr class="child-row123"><td class="spells">Spells</td><td class="spells"><a data-name="Dark Core">Dark Core</a> x1</br><a data-name="Dark World Dealings">Dark World Dealings</a> x2</br><a data-name="Dark World Lightning">Dark World Lightning</a> x3</br><a data-name="The Gates of Dark World">The Gates of Dark World</a> x3</br><a data-name="Gateway to Dark World">Gateway to Dark World</a> x2</br></td></tr><tr class="child-row123"><td class="traps">Traps</td><td class="traps"><a data-name="Dark Scheme">Dark Scheme</a> x1</br><a data-name="The Forces of Darkness">The Forces of Darkness</a> x1</br><a data-name="Dark Smog">Dark Smog</a> x1</br><a data-name="Dark World Brainwashing">Dark World Brainwashing</a> x3</br><a data-name="Depth Amulet">Depth Amulet</a> x1</br></td></tr><tr class="child-row123"><td class="extra">Extra</td><td class="extra"><a data-name="Number 23: Lancelot, Dark Knight of the Underworld">Number 23: Lancelot, Dark Knight of the Underworld</a> x1</br></td></tr><tr class="child-row123"><td class="id">Id</td><td class="id"></td></tr>
我有6000多个这样的行,我试图在不杀死服务器的情况下找到前10个最常出现的值.
I have over 6000 of these rows and I'm trying to find the top 10 most frequent values found in them without killing the server.
例如,通过执行以下操作,我可以发现一张名为"LANphorhynchus"的卡出现了多少次:
As an example, I can find how many times a card called "LANphorhynchus" appears by doing the following:
SELECT COUNT(DISTINCT(meta_value))
FROM wph3_postmeta
WHERE meta_value LIKE "%>LANphorhynchus</a>%"
AND meta_key = "deck_list"
哪个很棒!但是我想尝试找出前10名,但我不知道哪些名字在前10名中.
Which is great! But I want to try figure it out for a top 10 where I don't specifically know which names are in the top 10.
这甚至应该在MySQL中执行吗?
Is this something that should even be performed within MySQL?
说明:
如上所述,meta_value列包含完整的HTML标记.每个卡名的插入位置为:
As stated, the meta_value column contains complete HTML markup. Each card name is inserted here as:
<a data-name="LANphorhynchus">LANphorhynchus</a>
,并且每一行在此meta_value列中都有多个卡名(请参阅顶部,以获取连续数据的完整示例).因此,如果要特定的卡(我需要在标签之间进行搜索),我将使用"%>LANphorhynchus</a>%"
在MySQL中搜索它们.
and each row has multiple card names within this meta_value column (see top for full example of data in a row). So I am searching for them in MySQL with "%>LANphorhynchus</a>%"
if I want a specific card (I need to search in between tags).
但是.我想尝试从我不知道卡名称的所有行中获取在此列中找到的前10张卡. 因此,我可能认为可能需要一些 regex 来在每行的标签之间进行搜索,并找出最常出现在哪张卡片上?
However. I want to try attempt to get the Top 10 cards found in this column from all rows where I don't know the card name. So I'm possibly thinking some regex might be needed to search in between the tags for each row and find which card appears most often?
答案: 多亏了Midwinter86,我得以从一个新的角度看待这个问题.
ANSWER: Thanks to Midwinter86, I was able to look at this from a new perspective.
我要做的是以下几点.
What I have done is the following.
对数据库中访问量最高的100张卡片进行查询,然后将名称插入数组.
Run a query on top 100 most viewed cards in my database and insert the names into an array.
在一个循环中,专门针对meta_values
表检查每个名称,以查看它们出现的频率(使用上面的我的LANphorhynchus示例).
Within a loop, check each of those names specifically against the meta_values
table to see how often they appear (using my LANphorhynchus example above).
将所有数据提取到新表top_10_cards中.
Extract all data to a new table top_10_cards.
将数据缓存1周,然后重新运行.
Cache data for 1 week and then re-run.
输出:
+-----------------------------+---------+
| name | count |
+-----------------------------+---------+
| Monster Reborn | 2842 |
| Ash Blossom & Joyous Spring | 2587 |
| Knightmare Phoenix | 2231 |
| Borreload Dragon | 1975 |
+-----------------------------+---------+
推荐答案
以下是纯MySQL解决方案;您可以在非高峰时段每天运行一次此查询(或两次),以更新缓存/摘要表中的计数.此外,行数大约为6000(仅)左右,因此(取决于您的服务器配置),它应该不会引起性能问题.
Following is purely MySQL-only solution; you can run this query once (or twice) a day in off-peak hours, to update the count in a cache/summary table. Moreover, number of rows are roughly around 6000 (only), so (depending on your server configuration), it should not cause performance issues.
现在,由于特定行中的纸牌数量是可变的(范围为40-60),因此我们可以使用序列表.您可以在数据库中定义一个永久表,该表存储从1到100的整数(您也许也会发现此表在许多其他情况下也很有用):
Now, since the number of cards in a particular row is variable (can range from 40-60), we can use a Sequence table. You can define a permanent table in your database storing integers ranging from 1 to 100 (you may find this table helpful in many other cases as well):
CREATE TABLE seq (n tinyint(3) UNSIGNED NOT NULL, PRIMARY KEY(n));
INSERT INTO seq (n) VALUES (1), (2), ...... , (99), (100);
现在,我们将基于特定meta_value
中子字符串'data-name=""'
的出现次数,在wph3_postmeta
和seq
表之间执行JOIN
. 我们可以使用以下方法获取子字符串的出现次数(这也意味着特定行中的卡片数量):
Now, we will do a JOIN
between wph3_postmeta
and seq
table, based on the count of occurrence of substring 'data-name=""'
inside the specific meta_value
. We can get the count of occurrence of the substring (which also means, count of cards in a particular row) using:
(
CHAR_LENGTH(wp.meta_value)
- CHAR_LENGTH(REPLACE(wp.meta_value, 'data-name=""', ''))
) / CHAR_LENGTH('data-name=""')
现在,我们可以使用
Now, we can use the Substring_Index()
function to extract the card values out. Using the different numbers in different row, we can basically extract out the first card, second card, and so on...
一旦我们将所有单词提取出来,放在单独的行中;然后,我们可以将完整的结果集用作派生表,然后执行汇总查询以获取所需结果:
Once we have extracted all the words out, in separate rows; we can then use the complete result-set as a Derived Table, and perform the aggregation queries to get the required results:
查询(在数据库小提琴上查看)
Query (View on DB Fiddle)
SELECT dt.name,
Count(DISTINCT dt.meta_id) AS unique_metaid_count
FROM (SELECT wp.meta_id,
Substring_index(Substring_index(wp.meta_value, 'data-name=""',
-seq.n),
'"">', 1
) AS name
FROM wph3_postmeta AS wp
JOIN seq
ON ( Char_length(wp.meta_value) - Char_length(
REPLACE(wp.meta_value,
'data-name=""'
,
'')) ) /
Char_length('data-name=""') >= n
WHERE wp.meta_key = 'deck_list') AS dt
GROUP BY dt.name
ORDER BY unique_metaid_count DESC
/* To get top 10 counts only, add LIMIT 10 */
结果
| name | unique_metaid_count |
| --------------------------------------------- | ------------------- |
| Call of the Haunted | 2 |
| Inferno Reckless Summon | 2 |
| Mystic Box | 2 |
| Mystical Space Typhoon | 2 |
| Number 39: Utopia | 2 |
| #created by ygopro2 | 1 |
| 98095162 | 1 |
| Abyss Dweller | 1 |
| Advanced Ritual Art | 1 |
| Armed Dragon LV3 | 1 |
| Armed Dragon LV5 | 1 |
| Axe of Despair | 1 |
| B.E.S. Covered Core | 1 |
.....
| The Dragon Dwelling in the Cave | 1 |
| The Flute of Summoning Dragon | 1 |
| The Forces of Darkness | 1 |
| Threatening Roar | 1 |
| Time Machine | 1 |
| Torike | 1 |
| Tornado Dragon | 1 |
| Torrential Tribute | 1 |
| Tragoedia | 1 |
| Trap Hole | 1 |
| Treeborn Frog | 1 |
| Trishula, Dragon of the Ice Barrier | 1 |
| Twin Twisters | 1 |
| Vanity's Ruler | 1 |
| Wind-Up Snail | 1 |
| Wind-Up Soldier | 1 |
| Wulf, Lightsworn Beast | 1 |
| Zure, Knight of Dark World | 1 |
注意::如果您只想按数量排名前10名,只需在查询末尾添加LIMIT 10
.
Note: If you want Top 10 only (by count), you can simply add LIMIT 10
at the end of the query.
这篇关于使用Count获取在HTML列中找到的最常见的前10个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!