将带有键和注释字段的表转换为列字段中每个单词的键和行 [英] Converting a table with a key and comment field into a key and row for every word in the column field
问题描述
我有一个包含非结构化数据的表,我正在尝试分析以尝试构建关系查找.我没有使用词云软件.
我真的不知道如何解决这个问题.寻找解决方案让我找到了可以为我完成这项工作的工具,但需要花钱,而不是编码解决方案.
基本上我的数据是这样的:
<前>CK1 CK2 评论--------------------------------------------------------------1 A 这是评论.2 A 这里的另一条评论.这就是我需要创建的:
<前>CK1 CK2 字--------------------------------------------------------------1 A 这1 A 是1 A1 评论.2 A 另一个2 评论2 A 在这里.您要做的是使用空格作为分隔符来标记字符串.在 SQL 世界中,人们经常将执行此操作的函数称为拆分器".对这种类型的事物使用拆分器的潜在缺陷是如何用多个空格、制表符、CHAR(10)、CHAR(13)、CHAR() 等分隔单词. 语法不佳,例如因为在一段时间后不添加空格会导致:
"句末,下一句"
sentence.Next 作为单词返回.
我喜欢标记人类文本的方式是:
- 用空格替换任何不是字符的文本
- 替换重复的空格
- 修剪字符串
- 使用空格作为分隔符拆分新转换的字符串.
下面是我的解决方案,然后是 DDL 来创建所使用的函数.
-- 示例数据DECLARE @yourtable TABLE (CK1 INT, CK2 CHAR(1), Comment VARCHAR(8000));插入@yourtable(CK1、CK2、评论)价值观(1,'A','这是一个典型的评论......后面是另一个......'),(2,'A','此注释有双空格和制表符甚至回车返回!');- 解决方案选择 t.CK1、t.CK2、split.itemNumber、split.itemIndex、split.itemLength、split.itemFROM @yourtable AS t交叉应用 samd.patReplace(t.Comment,'[^a-zA-Z ]',' ') AS c1交叉应用 samd.removeDupChar8K(c1.newString,' ') AS c2CROSS APPLY samd.delimitedSplitAB8K(LTRIM(RTRIM(c2.NewString)),' ') AS split;
结果(为简洁起见被截断):
CK1 CK2 itemNumber itemIndex itemLength item----------- ---- -------------------- ----------- ----------- --------------1 A 1 1 4 这个1 A 2 6 2 是1 3 9 1 11 A 4 11 7 典型1 A 5 19 7 评论...2 A 1 1 4 这个2 A 2 6 7 评论2 A 3 14 3 有2 A 4 18 6 双...
请注意,我使用的拆分器基于 Jeff Moden 的 Delimited Split8K 和几个 tweeks.
使用的函数:
创建函数 dbo.rangeAB(@low bigint,@high bigint,@gap bigint,@row1 位)带有架构绑定的返回表作为返回带 L1(N) AS(选择 1发件人(值(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) T(N) -- 90 个值),L2(N) AS(从 L1 a CROSS JOIN L1 b CROSS JOIN L1 c 中选择 1),iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)选择 r.RN、r.OP、r.N1、r.N2从(选择RN = 0,OP = (@high-@low)/@gap,N1 = @低,N2 = @gap+@low哪里@row1 = 0UNION ALL -- 在下面的 TOP 语句中需要 COALESCE 用于错误处理目的选择顶部 (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))RN = i.rn,OP = (@high-@low)/@gap+(2*@row1)-i.rn,N1 = (i.rn-@row1)*@gap+@low,N2 = (i.rn-(@row1-1))*@gap+@low从 iTally AS i按 i.rn 订购) 为其中@high&@low&@gap&@row1 不为空且@high >= @low AND @gap >0;走创建函数 samd.NGrams8k(@string VARCHAR(8000), -- 输入字符串@N INT -- 请求的令牌大小)带有架构绑定的返回表作为返回选择位置 = r.RN,令牌 = SUBSTRING(@string, CHECKSUM(r.RN), @N)从 dbo.rangeAB(1, LEN(@string)+1-@N,1,1) AS r哪里@N >0 AND @N <= LEN(@string);走创建函数 samd.patReplace8K(@string VARCHAR(8000),@pattern VARCHAR(50),@replace VARCHAR(20))带有架构绑定的返回表作为返回选择新字符串 =(SELECT CASE WHEN @string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))当@pattern+@replace+@string 不为空时当 PATINDEX(@pattern,token COLLATE Latin1_General_BIN)=0 时的情况THEN ng.token ELSE @replace END ENDFROM samd.NGrams8K(@string, 1) AS ng按位置排序FOR XML PATH(''),TYPE).value('text()[1]', 'VARCHAR(8000)');走创建函数 samd.delimitedSplitAB8K(@string VARCHAR(8000), -- 输入字符串@delimiter CHAR(1) -- 分隔符)带有架构绑定的返回表作为返回选择itemNumber = ROW_NUMBER() OVER (ORDER BY d.p),itemIndex = CHECKSUM(ISNULL(NULLIF(d.p+1, 0),1)),itemLength = CHECKSUM(item.ln),item = SUBSTRING(@string, d.p+1, item.ln)FROM (VALUES (DATALENGTH(@string))) AS l(s) -- 字符串的长度交叉申请(SELECT 0 UNION ALL -- 用于处理前导分隔符选择位置FROM samd.NGrams8K(@string, 1) AS ngWHERE 令牌 = @delimiter) AS d(p) -- delimiter.position交叉应用(值(--LEAD(d.p,1,l.s+l.d)超过(按d.p排序)-(d.p+l.d)ISNULL(NULLIF(CHARINDEX(@delimiter,@string,d.p+1),0)-(d.p+1), l.s-d.p))) AS item(ln);走创建函数 dbo.RemoveDupChar8K(@string varchar(8000), @char char(1))带有架构绑定的返回表作为返回选择新字符串 =替换(替换(替换(替换(替换(替换)(替换(替换)@string 整理 LATIN1_GENERAL_BIN,复制(@char,33),@char),--33复制(@char,17),@char),--17复制(@char,9),@char),-- 9复制(@char,5),@char), -- 5复制(@char,3),@char),-- 3复制(@char,2),@char), -- 2复制(@char,2),@char);-- 2走
I have a table with unstructured data I am trying to analyze to try to build a relational lookup. I do not have use of word cloud software.
I really have no idea how to solve this problem. Searching for solutions has lead me to tools that might do this for me that cost money, not coded solutions.
Basically my data looks like this:
CK1 CK2 Comment -------------------------------------------------------------- 1 A This is a comment. 2 A Another comment here.
And this is what I need to create:
CK1 CK2 Words -------------------------------------------------------------- 1 A This 1 A is 1 A a 1 A comment. 2 A Another 2 A comment 2 A here.
What you are trying to do is tokenize a string using a space as a Delimiter. In the SQL world people often refer to functions that do this as a "Splitter". The potential pitfall of using a splitter for this type of thing is how words can be separated by multiple spaces, tabs, CHAR(10)'s, CHAR(13)'s, CHAR()'s, etc. Poor grammar, such as not adding a space after a period results in this:
" End of sentence.Next sentence"
sentence.Next is returned as a word.
The way I like to tokenize human text is to:
- Replace any text that isn't a character with a space
- Replace duplicate spaces
- Trim the string
- Split the newly transformed string using a space as the delimiter.
Below is my solution followed by the DDL to create the functions used.
-- Sample Data
DECLARE @yourtable TABLE (CK1 INT, CK2 CHAR(1), Comment VARCHAR(8000));
INSERT @yourtable (CK1, CK2, Comment)
VALUES
(1,'A','This is a typical comment...Follewed by another...'),
(2,'A','This comment has double spaces and tabs and even carriage
returns!');
-- Solution
SELECT t.CK1, t.CK2, split.itemNumber, split.itemIndex, split.itemLength, split.item
FROM @yourtable AS t
CROSS APPLY samd.patReplace(t.Comment,'[^a-zA-Z ]',' ') AS c1
CROSS APPLY samd.removeDupChar8K(c1.newString,' ') AS c2
CROSS APPLY samd.delimitedSplitAB8K(LTRIM(RTRIM(c2.NewString)),' ') AS split;
Results (truncated for brevity):
CK1 CK2 itemNumber itemIndex itemLength item
----------- ---- -------------------- ----------- ----------- --------------
1 A 1 1 4 This
1 A 2 6 2 is
1 A 3 9 1 a
1 A 4 11 7 typical
1 A 5 19 7 comment
...
2 A 1 1 4 This
2 A 2 6 7 comment
2 A 3 14 3 has
2 A 4 18 6 double
...
Note that the splitter I'm using is based of Jeff Moden's Delimited Split8K with a couple tweeks.
Functions used:
CREATE FUNCTION dbo.rangeAB
(
@low bigint,
@high bigint,
@gap bigint,
@row1 bit
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS
(
SELECT 1
FROM (VALUES
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0)) T(N) -- 90 values
),
L2(N) AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
SELECT
RN = 0,
OP = (@high-@low)/@gap,
N1 = @low,
N2 = @gap+@low
WHERE @row1 = 0
UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
SELECT TOP (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))
RN = i.rn,
OP = (@high-@low)/@gap+(2*@row1)-i.rn,
N1 = (i.rn-@row1)*@gap+@low,
N2 = (i.rn-(@row1-1))*@gap+@low
FROM iTally AS i
ORDER BY i.rn
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0;
GO
CREATE FUNCTION samd.NGrams8k
(
@string VARCHAR(8000), -- Input string
@N INT -- requested token size
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
position = r.RN,
token = SUBSTRING(@string, CHECKSUM(r.RN), @N)
FROM dbo.rangeAB(1, LEN(@string)+1-@N,1,1) AS r
WHERE @N > 0 AND @N <= LEN(@string);
GO
CREATE FUNCTION samd.patReplace8K
(
@string VARCHAR(8000),
@pattern VARCHAR(50),
@replace VARCHAR(20)
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN @string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))
WHEN @pattern+@replace+@string IS NOT NULL THEN
CASE WHEN PATINDEX(@pattern,token COLLATE Latin1_General_BIN)=0
THEN ng.token ELSE @replace END END
FROM samd.NGrams8K(@string, 1) AS ng
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'VARCHAR(8000)');
GO
CREATE FUNCTION samd.delimitedSplitAB8K
(
@string VARCHAR(8000), -- input string
@delimiter CHAR(1) -- delimiter
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
itemNumber = ROW_NUMBER() OVER (ORDER BY d.p),
itemIndex = CHECKSUM(ISNULL(NULLIF(d.p+1, 0),1)),
itemLength = CHECKSUM(item.ln),
item = SUBSTRING(@string, d.p+1, item.ln)
FROM (VALUES (DATALENGTH(@string))) AS l(s) -- length of the string
CROSS APPLY
(
SELECT 0 UNION ALL -- for handling leading delimiters
SELECT ng.position
FROM samd.NGrams8K(@string, 1) AS ng
WHERE token = @delimiter
) AS d(p) -- delimiter.position
CROSS APPLY (VALUES( --LEAD(d.p, 1, l.s+l.d) OVER (ORDER BY d.p) - (d.p+l.d)
ISNULL(NULLIF(CHARINDEX(@delimiter,@string,d.p+1),0)-(d.p+1), l.s-d.p))) AS item(ln);
GO
CREATE FUNCTION dbo.RemoveDupChar8K(@string varchar(8000), @char char(1))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT NewString =
replace(replace(replace(replace(replace(replace(replace(
@string COLLATE LATIN1_GENERAL_BIN,
replicate(@char,33), @char), --33
replicate(@char,17), @char), --17
replicate(@char,9 ), @char), -- 9
replicate(@char,5 ), @char), -- 5
replicate(@char,3 ), @char), -- 3
replicate(@char,2 ), @char), -- 2
replicate(@char,2 ), @char); -- 2
GO
这篇关于将带有键和注释字段的表转换为列字段中每个单词的键和行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!