如何从多次出现的SQL表列中获取前十个单词。 [英] How to get top ten words from SQL Table columns which are occurred many times.

查看:96
本文介绍了如何从多次出现的SQL表列中获取前十个单词。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从SQL Server表列中取出前10个单词,这些单词大部分都是重复的。



Col1_Row1 =第一列和第一行



例如。

Col1_Row1这是一辆很好的自行车。,

Col1_Row2好吧,我喜欢骑自行车。,

Col2_Row1只有哈利可以制造真正的好自行车。





这里自行车重复3次,好2次,其他字重复1次。

所以我需要输入包含单词和重复计数的表格形式。



数据会根据时间增长。请建议最好的方法。

解决方案

除了解决方案1 ​​



- 接受你的陈述< blockquote class =quote>

Quote:

May数据会根据时间增长

sqlservercentral上有一篇文章讨论了各种字符串拆分替代方案 [ ^ ]



- 拾起Maciej在句子中发现完​​整句号发布 [ ^ ]提出了获取ri的好方法d所有不是字母字符



- 我经常使用以下函数来分割数据,因为它是合理的 - 请注意,这不是我自己的工作,不幸的是我不能现在找到原件(如果我这样做,或者其他人那么我会认可这里的工作)

创建功能[dbo]。[fnSplitString] 

@string NVARCHAR(MAX),
@delimiter CHAR(1)

RETURNS @output TABLE(splitdata NVARCHAR(MAX)

BEGIN
DECLARE @start INT,@ end INT
SELECT @start = 1,@ end = CHARINDEX(@delimiter,@ string)
WHILE @start< LEN(@string)+ 1 BEGIN
IF @end = 0
SET @end = LEN(@string)+ 1

INSERT INTO @output(splitdata)
VALUES(SUBSTRING(@string,@ start,@ end-@start))
SET @start = @end + 1
SET @end = CHARINDEX(@delimiter,@ string,@ start)

结束
返回
结束





- 我沿着临时路线走下去可能不适合大型数据表的表。我在这里包括我的努力只是为了演示对我建议的函数的调用(它还演示了使用CURSOR的替代方法,如果你受到诱惑那条路线的话)

 create table #temp 

rownum int,
datacol varchar(max)
)ON [PRIMARY]

insert in #temp
SELECT rownum = ROW_NUMBER()OVER(按ID排序),datacol
来自topwords

DECLARE @wordlist表(word varchar(max))

声明@ maxi int
SELECT @maxi = COUNT(*)FROM topwords

DECLARE @loopCount int = 1
DECLARE @txt varchar(max)
WHILE @loopCount< = @maxi
BEGIN
SELECT @txt = datacol from#temp1 WHERE rownum = @loopCount
INSERT INTO @wordlist SELECT [dbo] .fnRemoveNonAlpha(splitdata)FROM [dbo] .fnSplitString(@ txt,'')
SET @loopCount = @loopCount + 1
END

选择单词,计数(*)来自@wordlist
group by word
按次数排序(*)desc


如果您希望数据增长且性能问题,您应该考虑在桌面上添加全文索引。



然后您可以使用其中一个查询这些程序: sys.dm_fts_index_keywords [ ^ ], sys.dm_fts_index_keywords_by_document [ ^ ]或 sys.dm_fts_index_keywords_by_property [ ^ ]


所有你需要的通过使用CTE或自定义函数将句子分成单词。



试试这个(阅读评论):

  DECLARE   @ tmp   TABLE (col1  VARCHAR  255 ),col2  VARCHAR  255 ))

INSERT INTO @ tmp (col1,col2)
VALUES ' 这是一个很好的自行车。'' 只有哈利可以制作真正好的自行车。'),
' 好吧,我喜欢骑自行车。' NULL

; WITH CTE AS

- 初始部分
- 从col1获取第一个单词作为单词和其余为余数
SELECT LEFT (col1,CHARINDEX(' ',col1)-1) AS 字, RIGHT (col1,LEN(col1) - CHARINDEX(' ',col1)) AS 余数
FROM @ tmp
WHERE CHARINDEX(' ,col1)> 0
UNION ALL
- 从col2获取第一个单词作为单词a其余为余数
SELECT LEFT (col2,CHARINDEX(' ',col2)-1) AS 字, RIGHT (col2,LEN(col2) - CHARINDEX(' ',col2)) AS 余数
FROM @ tmp
WHERE CHARINDEX(' ',col2)> 0
UNION ALL
- 递归部分
- 获取另一个字
SELECT LEFT (余数,CHARINDEX(' ',余数)-1) AS 字, RIGHT (余数,LEN(余数) - CHARINDEX(' ',余数)) AS 余额
FROM CTE
WHERE CHARINDEX(' ',余数)> 0
UNION ALL
SELECT 余数 AS 字, NULL AS 余数
FROM CTE
< span class =code-keyword> WHERE CHARINDEX(' ',余数)= 0

- 删除点和计数字;)
SELECT REPLACE(word,' 。'' ')< span class =code-keyword>作为字,COUNT(字) AS CountOfWord
FROM CTE
GROUP BY REPLACE(word,' 。'' '
ORDER BY COUNT(word) DESC





如需了解更多信息,请参阅:使用公用表表达式 [ ^ ]


I need take out top 10 words from SQL Server table columns which are repeated most of times.

Col1_Row1 = first columon and first row

Eg.
Col1_Row1 "This is good bike.",
Col1_Row2 "Well, I like bike rides.",
Col2_Row1 "Only Harley can makes real good bike."


Here bike is repeated 3 times, good 2 times and other words 1 times.
So i need out put in table form which contains word and repeat count in it.

May data will grow according to time. Please suggest best approach.

解决方案

In addition to Solution 1

- Picking up on your statement

Quote:

May data will grow according to time

There is an article at sqlservercentral that discusses the performance of various string splitting alternatives[^]

- Picking up on Maciej spotting the full stop in the sentence this post[^] suggests a nice method of getting rid of all not alpha characters

- I frequently use the following function to split data as it is reasonably peformant - Note it's not my own work and unfortunately I can't find the original at the moment(if I do, or someone else does then I will accredit the work here)

CREATE FUNCTION [dbo].[fnSplitString] 
( 
    @string NVARCHAR(MAX), 
    @delimiter CHAR(1) 
) 
RETURNS @output TABLE(splitdata NVARCHAR(MAX) 
) 
BEGIN 
    DECLARE @start INT, @end INT 
    SELECT @start = 1, @end = CHARINDEX(@delimiter, @string) 
    WHILE @start < LEN(@string) + 1 BEGIN 
        IF @end = 0  
            SET @end = LEN(@string) + 1
       
        INSERT INTO @output (splitdata)  
        VALUES(SUBSTRING(@string, @start, @end - @start)) 
        SET @start = @end + 1 
        SET @end = CHARINDEX(@delimiter, @string, @start)
        
    END 
    RETURN 
END



- I went down the route of using temporary tables which might not be best for large data tables. I'm including my efforts here only to demonstrate the calls to the functions I've suggested (It also demonstrates an alternative to using CURSOR if you ever get tempted down that route)

create table #temp
(
	rownum int,
	datacol varchar(max)
) ON [PRIMARY]	

insert into #temp
SELECT rownum = ROW_NUMBER() OVER (order by id), datacol
from topwords

DECLARE @wordlist table (word varchar(max))

declare @maxi int 
SELECT @maxi = COUNT(*) FROM topwords

DECLARE @loopCount int = 1
DECLARE @txt varchar(max)
WHILE @loopCount <= @maxi
BEGIN
	SELECT @txt = datacol from #temp1 WHERE rownum = @loopCount
	INSERT INTO @wordlist SELECT [dbo].fnRemoveNonAlpha(splitdata) FROM [dbo].fnSplitString(@txt,' ')
	SET @loopCount = @loopCount + 1
END

select word, count(*) from @wordlist
group by word
order by count(*) desc


If you expect data to grow and performance is an issue you should consider adding a full text index on the table.

Then you can query it using either one of these procedures: sys.dm_fts_index_keywords[^], sys.dm_fts_index_keywords_by_document[^] or sys.dm_fts_index_keywords_by_property[^]


All you need to do is to split sentence into words via using CTE or custom function.

Try this (read comments):

DECLARE @tmp TABLE (col1 VARCHAR(255), col2 VARCHAR(255))

INSERT INTO @tmp (col1, col2)
VALUES('This is good bike.', 'Only Harley can makes real good bike.'),
('Well, I like bike rides.', NULL)

;WITH CTE AS
(
	--initial part
	--get first word from col1 As word and the rest as remainder
	SELECT LEFT(col1, CHARINDEX(' ', col1)-1) AS word, RIGHT(col1, LEN(col1) - CHARINDEX(' ', col1)) AS remainder
	FROM @tmp
	WHERE CHARINDEX(' ', col1)>0
	UNION ALL
	--get first word from col2 As word and the rest as remainder
	SELECT LEFT(col2, CHARINDEX(' ', col2)-1) AS word, RIGHT(col2, LEN(col2) - CHARINDEX(' ', col2)) AS remainder
	FROM @tmp
	WHERE CHARINDEX(' ', col2)>0
	UNION ALL
	--recursive part
	-- get another words
	SELECT LEFT(remainder, CHARINDEX(' ', remainder)-1) AS word, RIGHT(remainder, LEN(remainder) - CHARINDEX(' ', remainder)) AS remainder
	FROM CTE
	WHERE CHARINDEX(' ', remainder)>0
	UNION ALL
	SELECT remainder AS word, NULL AS remainder
	FROM CTE
	WHERE CHARINDEX(' ', remainder)=0
)
--remove dot and count words ;)
SELECT REPLACE(word, '.', '') As word, COUNT(word) AS CountOfWord
FROM CTE
GROUP BY REPLACE(word, '.', '')
ORDER BY COUNT(word) DESC



For further information, please see: Using Common Table Expressions[^]


这篇关于如何从多次出现的SQL表列中获取前十个单词。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆