SQL Server 2008中列中出现的所有单词的单词计数 [英] Word count for all the words appearing in a column in SQL Server 2008
问题描述
我有一个名为'ticket_diary_comment'
的表,其中有一个名为'comment_text'
的列.此列中填充了文本数据.我想知道整个专栏中出现的所有单词的出现频率.例如:
I have a table called 'ticket_diary_comment'
with a column called 'comment_text'
. This column is populated with text data. I would like to get the frequency of all the words occurring in this entire column. Ex:
Comment_Text
I am a good guy
I am a bad guy
I am not a guy
我想要什么:
Word Frequency
I 3
good 1
bad 1
not 1
guy 3
请注意,我还删除了输出中的停用词.我知道计算特定单词的频率并不困难,但我正在寻找一种方法,可以计算出出现在停用词中的所有单词.
Notice that I have also removed the stop words in the output. I know calculating the frequency of a particular word is not difficult but I am looking for something that counts all the words appearing in a column removing the stop words.
在此问题上的任何帮助,我将不胜感激. 我还要提及的是,我必须在大型数据集(大约1 TB)上应用此查询,因此性能令人担忧.
I would appreciate any kind of help on this issue. I would also like to mention that I have to apply this query on a big-ish dataset (about 1 TB), so performance is a concern.
推荐答案
我将使用表值函数拆分字符串,然后在查询中将它们分组.像这样:
I would use a table valued function to split the strings, and then group them in a query. Something like this:
SELECT item, count(1)
FROM ticket_diary_comment
CROSS APPLY dbo.fn_SplitString(comment_text, ' ')
GROUP BY item
以及fn_SplitString
的定义:
CREATE FUNCTION [dbo].[fn_SplitString]
(
@String VARCHAR(8000),
@Delimiter VARCHAR(255)
)
RETURNS
@Results TABLE
(
ID INT IDENTITY(1, 1),
Item VARCHAR(8000)
)
AS
BEGIN
INSERT INTO @Results (Item)
SELECT SUBSTRING(@String+@Delimiter, num,
CHARINDEX(@Delimiter, @String+@Delimiter, num) - num)
FROM Numbers
WHERE num <= LEN(REPLACE(@String,' ','|'))
AND SUBSTRING(@Delimiter + @String,
num,
LEN(REPLACE(@delimiter,' ','|'))) = @Delimiter
ORDER BY num RETURN
END
此功能需要数字表,其中基本上只是CREATE TABLE Numbers(Num int)
,并且包含从1到10,000(或更多/更少,取决于需要)的所有数字.如果您的数据库中已经有一个数字表,则可以用该表/列替换现有的表.
This function requires a numbers table, which is basically just CREATE TABLE Numbers(Num int)
and contains all the numbers from 1 to 10,000 (or more/less depending on needs). If you already have a numbers table in your DB you can substitute that table/column for what you already have.
这篇关于SQL Server 2008中列中出现的所有单词的单词计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!