从 SQL 中的一列字符串中获取最常用的单词 [英] Getting most used words from a column of strings in SQL
问题描述
所以我们有这个数据库填充了一堆字符串,在这种情况下是帖子标题.>
我想做的是:
- 将字符串拆分为单词
- 计算单词在字符串中出现的次数
- 给我前 50 个词
- 在 data.se 查询中没有这个超时
我尝试使用 this SO question 适应 data.se 如下:
select word, count(*) from (select (case when instr(substr(p.Title, nums.n+1), ' ') then substr(p.Title, nums.n+1)else substr(p.Title, nums.n+1, instr(substr(p.Title, nums.n+1), ' ') - 1)结束)作为词from (选择' '||标题为字符串来自帖子 p) 帖子交叉连接(选 1 作为 n union all select 2 union all select 10) 数其中 substr(p.Title, nums.n, 1) = ' ' and substr(p.Title, nums.n, 1) <>' ') w按词分组按计数排序(*) 降序
不幸的是,这给了我很多错误:
<块引用>'substr' 不是可识别的内置函数名称.语法不正确'|' 附近.'nums' 附近的语法不正确.
那么给定 SQL 中的一列字符串,每个字符串中包含可变数量的文本,我如何获得最常用的 X 词的列表?
正如 Blogbeard 所说,您提供的查询不适用于 SQL Server.这是计算最常用单词的一种方法.这是基于一个函数,DelimitedSplitN4K,由 Jeff Moden 编写,并由 SQL Server Central 社区的成员改进.
WITH E1(N) AS (从(值)中选择 1(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(N)),E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),E4(N) AS(从 E2 a 交叉连接 E2 b 中选择 1)选择前 50 名x.项目,数数(*)FROM 帖子 p交叉申请(选择ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),项目 = LTRIM(RTRIM(SUBSTRING(p.Title, l.N1, l.L1)))从 (选择 s.N1,L1 = ISNULL(NULLIF(CHARINDEX(' ',p.Title,s.N1),0)-s.N1,4000)从(选择 1 联合所有选择 t.N+1从(SELECT TOP (ISNULL(DATALENGTH(p.Title)/2,0))ROW_NUMBER() OVER (ORDER BY (SELECT NULL))从 E4) t(N)WHERE SUBSTRING(p.Title,t.N,1) = ' ') s(N1)) l(N1, L1)) XWHERE x.item <>''按 x.Item 分组按计数排序(*) DESC
由于不允许创建函数,所以我是这样写的.如果您有兴趣,这里是函数定义:
创建函数 [dbo].[DelimitedSplitN4K](@pString NVARCHAR(4000),@pDelimiter NCHAR(1))带有 SCHEMABINDING AS 的返回表返回带 E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALLSELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),E2(N) AS(从 E1 a、E1 b 中选择 1),E4(N) AS(从 E2 a、E2 b 中选择 1),cteTally(N) AS(SELECT TOP (ISNULL(DATALENGTH(@pString)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4),cteStart(N1) AS (选择 1 联合所有SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter),cteLen(N1,L1) AS(选择 s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,4000)从 cteStart s)选择ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),项目 = SUBSTRING(@pString, l.N1, l.L1)来自 cteLen l;
这里是你将如何使用它:
选择前 50 名x.项目,数数(*)FROM 帖子 p交叉应用 dbo.DelimitedSplitN4K(p.Title, ' ') xWHERE LTRIM(RTRIM(x.Item)) <>''按 x.Item 分组按计数排序(*) DESC
结果:
项目—————————至 3812411在 3331522一个 2543636如何 17709151534298与 13416321297468 的和 1166664在 970554来自 964449886007不是 835979是 704724使用 703007我 633838- 632441一个 548450当 449169文件 409717怎么样 358745数据 335271做 323854可以 310298得到 305922或 266317错误 263563使用 258408值 254392它 251254我的 238902功能 235832由 231025安卓 228308作为 216654阵列 209157工作 207445207274是 205613多个 203336那个 197826为什么是 196979进入 196591192056 之后字符串 189053PHP 187018一个 182360班级 179965如果 179590文本 174878表 169393
So we have this database filled with a bunch of strings, in this case post titles.
What I want to do is:
- Split the string up in to words
- Count how many times words appear in strings
- Give me to top 50 words
- Not have this timeout in a data.se query
I tried using the info from this SO question adapted to data.se as follows:
select word, count(*) from (
select (case when instr(substr(p.Title, nums.n+1), ' ') then substr(p.Title, nums.n+1)
else substr(p.Title, nums.n+1, instr(substr(p.Title, nums.n+1), ' ') - 1)
end) as word
from (select ' '||Title as string
from Posts p
)Posts cross join
(select 1 as n union all select 2 union all select 10
) nums
where substr(p.Title, nums.n, 1) = ' ' and substr(p.Title, nums.n, 1) <> ' '
) w
group by word
order by count(*) desc
Unfortunately, this gives me a slew of errors:
'substr' is not a recognized built-in function name. Incorrect syntax near '|'. Incorrect syntax near 'nums'.
So given a column of strings in SQL with a variable amount of text in each string, how can I get a list of the most frequently used X words?
As Blogbeard said, the query you provided does not work with SQL Server. Here is one way to count the most used word. This is based from a function, DelimitedSplitN4K, written by Jeff Moden and improved by members of the SQL Server Central community.
WITH E1(N) AS (
SELECT 1 FROM (VALUES
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
) t(N)
),
E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
SELECT TOP 50
x.Item,
COUNT(*)
FROM Posts p
CROSS APPLY (
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = LTRIM(RTRIM(SUBSTRING(p.Title, l.N1, l.L1)))
FROM (
SELECT s.N1,
L1 = ISNULL(NULLIF(CHARINDEX(' ',p.Title,s.N1),0)-s.N1,4000)
FROM(
SELECT 1 UNION ALL
SELECT t.N+1
FROM(
SELECT TOP (ISNULL(DATALENGTH(p.Title)/2,0))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4
) t(N)
WHERE SUBSTRING(p.Title ,t.N,1) = ' '
) s(N1)
) l(N1, L1)
) x
WHERE x.item <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC
Since creation of function is not allowed, I've written it that way. Here is the function definition if you're interested:
CREATE FUNCTION [dbo].[DelimitedSplitN4K](
@pString NVARCHAR(4000),
@pDelimiter NCHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
E4(N) AS (SELECT 1 FROM E2 a, E2 b),
cteTally(N) AS(
SELECT TOP (ISNULL(DATALENGTH(@pString)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
),
cteLen(N1,L1) AS(
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,4000)
FROM cteStart s
)
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteLen l
;
And here is how you would use it:
SELECT TOP 50
x.Item,
COUNT(*)
FROM Posts p
CROSS APPLY dbo.DelimitedSplitN4K(p.Title, ' ') x
WHERE LTRIM(RTRIM(x.Item)) <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC
The result:
Item
-------- -------
to 3812411
in 3331522
a 2543636
How 1770915
the 1534298
with 1341632
of 1297468
and 1166664
on 970554
from 964449
for 886007
not 835979
is 704724
using 703007
I 633838
- 632441
an 548450
when 449169
file 409717
how 358745
data 335271
do 323854
can 310298
get 305922
or 266317
error 263563
use 258408
value 254392
it 251254
my 238902
function 235832
by 231025
Android 228308
as 216654
array 209157
working 207445
does 207274
Is 205613
multiple 203336
that 197826
Why 196979
into 196591
after 192056
string 189053
PHP 187018
one 182360
class 179965
if 179590
text 174878
table 169393
这篇关于从 SQL 中的一列字符串中获取最常用的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!