如何从多次出现的SQL表列中获取前十个单词。 [英] How to get top ten words from SQL Table columns which are occurred many times.
问题描述
我需要从SQL Server表列中取出前10个单词,这些单词大部分都是重复的。
Col1_Row1 =第一列和第一行
例如。
Col1_Row1这是一辆很好的自行车。,
Col1_Row2好吧,我喜欢骑自行车。,
Col2_Row1只有哈利可以制造真正的好自行车。
这里自行车重复3次,好2次,其他字重复1次。
所以我需要输入包含单词和重复计数的表格形式。
数据会根据时间增长。请建议最好的方法。
除了解决方案1
- 接受你的陈述< blockquote class =quote>Quote:May数据会根据时间增长
sqlservercentral上有一篇文章讨论了各种字符串拆分替代方案 [ ^ ]
- 拾起Maciej在句子中发现完整句号发布 [ ^ ]提出了获取ri的好方法d所有不是字母字符
- 我经常使用以下函数来分割数据,因为它是合理的 - 请注意,这不是我自己的工作,不幸的是我不能现在找到原件(如果我这样做,或者其他人那么我会认可这里的工作)
创建功能[dbo]。[fnSplitString]
(
@string NVARCHAR(MAX),
@delimiter CHAR(1)
)
RETURNS @output TABLE(splitdata NVARCHAR(MAX)
)
BEGIN
DECLARE @start INT,@ end INT
SELECT @start = 1,@ end = CHARINDEX(@delimiter,@ string)
WHILE @start< LEN(@string)+ 1 BEGIN
IF @end = 0
SET @end = LEN(@string)+ 1
INSERT INTO @output(splitdata)
VALUES(SUBSTRING(@string,@ start,@ end-@start))
SET @start = @end + 1
SET @end = CHARINDEX(@delimiter,@ string,@ start)
结束
返回
结束
- 我沿着临时路线走下去可能不适合大型数据表的表。我在这里包括我的努力只是为了演示对我建议的函数的调用(它还演示了使用CURSOR的替代方法,如果你受到诱惑那条路线的话)
create table #temp
(
rownum int,
datacol varchar(max)
)ON [PRIMARY]
insert in #temp
SELECT rownum = ROW_NUMBER()OVER(按ID排序),datacol
来自topwords
DECLARE @wordlist表(word varchar(max))
声明@ maxi int
SELECT @maxi = COUNT(*)FROM topwords
DECLARE @loopCount int = 1
DECLARE @txt varchar(max)
WHILE @loopCount< = @maxi
BEGIN
SELECT @txt = datacol from#temp1 WHERE rownum = @loopCount
INSERT INTO @wordlist SELECT [dbo] .fnRemoveNonAlpha(splitdata)FROM [dbo] .fnSplitString(@ txt,'')
SET @loopCount = @loopCount + 1
END
选择单词,计数(*)来自@wordlist
group by word
按次数排序(*)desc
然后您可以使用其中一个查询这些程序: sys.dm_fts_index_keywords [ ^ ], sys.dm_fts_index_keywords_by_document [ ^ ]或 sys.dm_fts_index_keywords_by_property [ ^ ]
所有你需要的通过使用CTE或自定义函数将句子分成单词。
试试这个(阅读评论):
DECLARE @ tmp TABLE (col1 VARCHAR ( 255 ),col2 VARCHAR ( 255 ))
INSERT INTO @ tmp (col1,col2)
VALUES (' 这是一个很好的自行车。',' 只有哈利可以制作真正好的自行车。'),
(' 好吧,我喜欢骑自行车。', NULL )
; WITH CTE AS
(
- 初始部分
- 从col1获取第一个单词作为单词和其余为余数
SELECT LEFT (col1,CHARINDEX(' ',col1)-1) AS 字, RIGHT (col1,LEN(col1) - CHARINDEX(' ',col1)) AS 余数
FROM @ tmp
WHERE CHARINDEX(' ,col1)> 0
UNION ALL
- 从col2获取第一个单词作为单词a其余为余数
SELECT LEFT (col2,CHARINDEX(' ',col2)-1) AS 字, RIGHT (col2,LEN(col2) - CHARINDEX(' ',col2)) AS 余数
FROM @ tmp
WHERE CHARINDEX(' ',col2)> 0
UNION ALL
- 递归部分
- 获取另一个字
SELECT LEFT (余数,CHARINDEX(' ',余数)-1) AS 字, RIGHT (余数,LEN(余数) - CHARINDEX(' ',余数)) AS 余额
FROM CTE
WHERE CHARINDEX(' ',余数)> 0
UNION ALL
SELECT 余数 AS 字, NULL AS 余数
FROM CTE
< span class =code-keyword> WHERE CHARINDEX(' ',余数)= 0
)
- 删除点和计数字;)
SELECT REPLACE(word,' 。',' ')< span class =code-keyword>作为字,COUNT(字) AS CountOfWord
FROM CTE
GROUP BY REPLACE(word,' 。',' ')
ORDER BY COUNT(word) DESC
I need take out top 10 words from SQL Server table columns which are repeated most of times.
Col1_Row1 = first columon and first row
Eg.
Col1_Row1 "This is good bike.",
Col1_Row2 "Well, I like bike rides.",
Col2_Row1 "Only Harley can makes real good bike."
Here bike is repeated 3 times, good 2 times and other words 1 times.
So i need out put in table form which contains word and repeat count in it.
May data will grow according to time. Please suggest best approach.
In addition to Solution 1
- Picking up on your statementQuote:May data will grow according to time
There is an article at sqlservercentral that discusses the performance of various string splitting alternatives[^]
- Picking up on Maciej spotting the full stop in the sentence this post[^] suggests a nice method of getting rid of all not alpha characters
- I frequently use the following function to split data as it is reasonably peformant - Note it's not my own work and unfortunately I can't find the original at the moment(if I do, or someone else does then I will accredit the work here)
CREATE FUNCTION [dbo].[fnSplitString] ( @string NVARCHAR(MAX), @delimiter CHAR(1) ) RETURNS @output TABLE(splitdata NVARCHAR(MAX) ) BEGIN DECLARE @start INT, @end INT SELECT @start = 1, @end = CHARINDEX(@delimiter, @string) WHILE @start < LEN(@string) + 1 BEGIN IF @end = 0 SET @end = LEN(@string) + 1 INSERT INTO @output (splitdata) VALUES(SUBSTRING(@string, @start, @end - @start)) SET @start = @end + 1 SET @end = CHARINDEX(@delimiter, @string, @start) END RETURN END
- I went down the route of using temporary tables which might not be best for large data tables. I'm including my efforts here only to demonstrate the calls to the functions I've suggested (It also demonstrates an alternative to using CURSOR if you ever get tempted down that route)
create table #temp ( rownum int, datacol varchar(max) ) ON [PRIMARY] insert into #temp SELECT rownum = ROW_NUMBER() OVER (order by id), datacol from topwords DECLARE @wordlist table (word varchar(max)) declare @maxi int SELECT @maxi = COUNT(*) FROM topwords DECLARE @loopCount int = 1 DECLARE @txt varchar(max) WHILE @loopCount <= @maxi BEGIN SELECT @txt = datacol from #temp1 WHERE rownum = @loopCount INSERT INTO @wordlist SELECT [dbo].fnRemoveNonAlpha(splitdata) FROM [dbo].fnSplitString(@txt,' ') SET @loopCount = @loopCount + 1 END select word, count(*) from @wordlist group by word order by count(*) desc
If you expect data to grow and performance is an issue you should consider adding a full text index on the table.
Then you can query it using either one of these procedures: sys.dm_fts_index_keywords[^], sys.dm_fts_index_keywords_by_document[^] or sys.dm_fts_index_keywords_by_property[^]
All you need to do is to split sentence into words via using CTE or custom function.
Try this (read comments):
DECLARE @tmp TABLE (col1 VARCHAR(255), col2 VARCHAR(255)) INSERT INTO @tmp (col1, col2) VALUES('This is good bike.', 'Only Harley can makes real good bike.'), ('Well, I like bike rides.', NULL) ;WITH CTE AS ( --initial part --get first word from col1 As word and the rest as remainder SELECT LEFT(col1, CHARINDEX(' ', col1)-1) AS word, RIGHT(col1, LEN(col1) - CHARINDEX(' ', col1)) AS remainder FROM @tmp WHERE CHARINDEX(' ', col1)>0 UNION ALL --get first word from col2 As word and the rest as remainder SELECT LEFT(col2, CHARINDEX(' ', col2)-1) AS word, RIGHT(col2, LEN(col2) - CHARINDEX(' ', col2)) AS remainder FROM @tmp WHERE CHARINDEX(' ', col2)>0 UNION ALL --recursive part -- get another words SELECT LEFT(remainder, CHARINDEX(' ', remainder)-1) AS word, RIGHT(remainder, LEN(remainder) - CHARINDEX(' ', remainder)) AS remainder FROM CTE WHERE CHARINDEX(' ', remainder)>0 UNION ALL SELECT remainder AS word, NULL AS remainder FROM CTE WHERE CHARINDEX(' ', remainder)=0 ) --remove dot and count words ;) SELECT REPLACE(word, '.', '') As word, COUNT(word) AS CountOfWord FROM CTE GROUP BY REPLACE(word, '.', '') ORDER BY COUNT(word) DESC
For further information, please see: Using Common Table Expressions[^]
这篇关于如何从多次出现的SQL表列中获取前十个单词。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!