将带有键和注释字段的表转换为列字段中每个单词的键和行 [英] Converting a table with a key and comment field into a key and row for every word in the column field

查看:24
本文介绍了将带有键和注释字段的表转换为列字段中每个单词的键和行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含非结构化数据的表,我正在尝试分析以尝试构建关系查找.我没有使用词云软件.

我真的不知道如何解决这个问题.寻找解决方案让我找到了可以为我完成这项工作的工具,但需要花钱,而不是编码解决方案.

基本上我的数据是这样的:

<前>CK1 CK2 评论--------------------------------------------------------------1 A 这是评论.2 A 这里的另一条评论.

这就是我需要创建的:

<前>CK1 CK2 字--------------------------------------------------------------1 A 这1 A 是1 A1 评论.2 A 另一个2 评论2 A 在这里.

解决方案

您要做的是使用空格作为分隔符来标记字符串.在 SQL 世界中,人们经常将执行此操作的函数称为拆分器".对这种类型的事物使用拆分器的潜在缺陷是如何用多个空格、制表符、CHAR(10)、CHAR(13)、CHAR() 等分隔单词. 语法不佳,例如因为在一段时间后不添加空格会导致:

"句末,下一句"

sentence.Next 作为单词返回.

我喜欢标记人类文本的方式是:

  1. 用空格替换任何不是字符的文本
  2. 替换重复的空格
  3. 修剪字符串
  4. 使用空格作为分隔符拆分新转换的字符串.

下面是我的解决方案,然后是 DDL 来创建所使用的函数.

-- 示例数据DECLARE @yourtable TABLE (CK1 INT, CK2 CHAR(1), Comment VARCHAR(8000));插入@yourtable(CK1、CK2、评论)价值观(1,'A','这是一个典型的评论......后面是另一个......'),(2,'A','此注释有双空格和制表符甚至回车返回!');-  解决方案选择 t.CK1、t.CK2、split.itemNumber、split.itemIndex、split.itemLength、split.itemFROM @yourtable AS t交叉应用 samd.patReplace(t.Comment,'[^a-zA-Z ]',' ') AS c1交叉应用 samd.removeDupChar8K(c1.newString,' ') AS c2CROSS APPLY samd.delimitedSplitAB8K(LTRIM(RTRIM(c2.NewString)),' ') AS split;

结果(为简洁起见被截断):

CK1 CK2 itemNumber itemIndex itemLength item----------- ---- -------------------- ----------- ----------- --------------1 A 1 1 4 这个1 A 2 6 2 是1 3 9 1 11 A 4 11 7 典型1 A 5 19 7 评论...2 A 1 1 4 这个2 A 2 6 7 评论2 A 3 14 3 有2 A 4 18 6 双...

请注意,我使用的拆分器基于 Jeff Moden 的 Delimited Split8K 和几个 tweeks.

使用的函数:

创建函数 dbo.rangeAB(@low bigint,@high bigint,@gap bigint,@row1 位)带有架构绑定的返回表作为返回带 L1(N) AS(选择 1发件人(值(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) T(N) -- 90 个值),L2(N) AS(从 L1 a CROSS JOIN L1 b CROSS JOIN L1 c 中选择 1),iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)选择 r.RN、r.OP、r.N1、r.N2从(选择RN = 0,OP = (@high-@low)/@gap,N1 = @低,N2 = @gap+@low哪里@row1 = 0UNION ALL -- 在下面的 TOP 语句中需要 COALESCE 用于错误处理目的选择顶部 (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))RN = i.rn,OP = (@high-@low)/@gap+(2*@row1)-i.rn,N1 = (i.rn-@row1)*@gap+@low,N2 = (i.rn-(@row1-1))*@gap+@low从 iTally AS i按 i.rn 订购) 为其中@high&@low&@gap&@row1 不为空且@high >= @low AND @gap >0;走创建函数 samd.NGrams8k(@string VARCHAR(8000), -- 输入字符串@N INT -- 请求的令牌大小)带有架构绑定的返回表作为返回选择位置 = r.RN,令牌 = SUBSTRING(@string, CHECKSUM(r.RN), @N)从 dbo.rangeAB(1, LEN(@string)+1-@N,1,1) AS r哪里@N >0 AND @N <= LEN(@string);走创建函数 samd.patReplace8K(@string VARCHAR(8000),@pattern VARCHAR(50),@replace VARCHAR(20))带有架构绑定的返回表作为返回选择新字符串 =(SELECT CASE WHEN @string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))当@pattern+@replace+@string 不为空时当 PATINDEX(@pattern,token COLLATE Latin1_General_BIN)=0 时的情况THEN ng.token ELSE @replace END ENDFROM samd.NGrams8K(@string, 1) AS ng按位置排序FOR XML PATH(''),TYPE).value('text()[1]', 'VARCHAR(8000)');走创建函数 samd.delimitedSplitAB8K(@string VARCHAR(8000), -- 输入字符串@delimiter CHAR(1) -- 分隔符)带有架构绑定的返回表作为返回选择itemNumber = ROW_NUMBER() OVER (ORDER BY d.p),itemIndex = CHECKSUM(ISNULL(NULLIF(d.p+1, 0),1)),itemLength = CHECKSUM(item.ln),item = SUBSTRING(@string, d.p+1, item.ln)FROM (VALUES (DATALENGTH(@string))) AS l(s) -- 字符串的长度交叉申请(SELECT 0 UNION ALL -- 用于处理前导分隔符选择位置FROM samd.NGrams8K(@string, 1) AS ngWHERE 令牌 = @delimiter) AS d(p) -- delimiter.position交叉应用(值(--LEAD(d.p,1,l.s+l.d)超过(按d.p排序)-(d.p+l.d)ISNULL(NULLIF(CHARINDEX(@delimiter,@string,d.p+1),0)-(d.p+1), l.s-d.p))) AS item(ln);走创建函数 dbo.RemoveDupChar8K(@string varchar(8000), @char char(1))带有架构绑定的返回表作为返回选择新字符串 =替换(替换(替换(替换(替换(替换)(替换(替换)@string 整理 LATIN1_GENERAL_BIN,复制(@char,33),@char),--33复制(@char,17),@char),--17复制(@char,9),@char),-- 9复制(@char,5),@char), -- 5复制(@char,3),@char),-- 3复制(@char,2),@char), -- 2复制(@char,2),@char);-- 2走

I have a table with unstructured data I am trying to analyze to try to build a relational lookup. I do not have use of word cloud software.

I really have no idea how to solve this problem. Searching for solutions has lead me to tools that might do this for me that cost money, not coded solutions.

Basically my data looks like this:

CK1          CK2          Comment
--------------------------------------------------------------
 1            A           This is a comment.
 2            A           Another comment here.

And this is what I need to create:

CK1          CK2          Words
--------------------------------------------------------------
 1            A           This
 1            A           is
 1            A           a
 1            A           comment.
 2            A           Another
 2            A           comment
 2            A           here.

解决方案

What you are trying to do is tokenize a string using a space as a Delimiter. In the SQL world people often refer to functions that do this as a "Splitter". The potential pitfall of using a splitter for this type of thing is how words can be separated by multiple spaces, tabs, CHAR(10)'s, CHAR(13)'s, CHAR()'s, etc. Poor grammar, such as not adding a space after a period results in this:

" End of sentence.Next sentence" 

sentence.Next is returned as a word.

The way I like to tokenize human text is to:

  1. Replace any text that isn't a character with a space
  2. Replace duplicate spaces
  3. Trim the string
  4. Split the newly transformed string using a space as the delimiter.

Below is my solution followed by the DDL to create the functions used.

-- Sample Data
DECLARE @yourtable TABLE (CK1 INT, CK2 CHAR(1), Comment VARCHAR(8000));
INSERT @yourtable (CK1, CK2, Comment)
VALUES
(1,'A','This is a typical comment...Follewed by another...'),
(2,'A','This comment has  double  spaces  and       tabs and even carriage
returns!');

-- Solution
SELECT      t.CK1, t.CK2, split.itemNumber, split.itemIndex, split.itemLength, split.item
FROM        @yourtable                                              AS t
CROSS APPLY samd.patReplace(t.Comment,'[^a-zA-Z ]',' ')             AS c1
CROSS APPLY samd.removeDupChar8K(c1.newString,' ')                  AS c2 
CROSS APPLY samd.delimitedSplitAB8K(LTRIM(RTRIM(c2.NewString)),' ') AS split;

Results (truncated for brevity):

CK1         CK2  itemNumber           itemIndex   itemLength  item
----------- ---- -------------------- ----------- ----------- --------------
1           A    1                    1           4           This
1           A    2                    6           2           is
1           A    3                    9           1           a
1           A    4                    11          7           typical
1           A    5                    19          7           comment
...
2           A    1                    1           4           This
2           A    2                    6           7           comment
2           A    3                    14          3           has
2           A    4                    18          6           double
... 

Note that the splitter I'm using is based of Jeff Moden's Delimited Split8K with a couple tweeks.

Functions used:

CREATE FUNCTION dbo.rangeAB
(
  @low  bigint, 
  @high bigint, 
  @gap  bigint,
  @row1 bit
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS 
(
  SELECT 1
  FROM (VALUES
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0)) T(N) -- 90 values 
),
L2(N)  AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
  SELECT
    RN = 0,
    OP = (@high-@low)/@gap,
    N1 = @low,
    N2 = @gap+@low
  WHERE @row1 = 0
  UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
  SELECT TOP (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))
    RN = i.rn,
    OP = (@high-@low)/@gap+(2*@row1)-i.rn,
    N1 = (i.rn-@row1)*@gap+@low,
    N2 = (i.rn-(@row1-1))*@gap+@low
  FROM iTally AS i
  ORDER BY i.rn
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0;
    GO

CREATE FUNCTION samd.NGrams8k
(
  @string VARCHAR(8000), -- Input string
  @N      INT            -- requested token size
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
  position   = r.RN,
  token      = SUBSTRING(@string, CHECKSUM(r.RN), @N)
FROM  dbo.rangeAB(1, LEN(@string)+1-@N,1,1) AS r
WHERE @N > 0 AND @N <= LEN(@string);
GO

    CREATE FUNCTION samd.patReplace8K
(
  @string  VARCHAR(8000),
  @pattern VARCHAR(50),
  @replace VARCHAR(20)
) 
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString = 
  (
    SELECT   CASE WHEN @string = CAST('' AS VARCHAR(8000)) THEN CAST('' AS VARCHAR(8000))
                  WHEN @pattern+@replace+@string IS NOT NULL THEN 
                    CASE WHEN PATINDEX(@pattern,token COLLATE Latin1_General_BIN)=0
                         THEN ng.token ELSE @replace END END
    FROM     samd.NGrams8K(@string, 1) AS ng
    ORDER BY ng.position
    FOR XML PATH(''),TYPE
  ).value('text()[1]', 'VARCHAR(8000)');
GO

    CREATE FUNCTION samd.delimitedSplitAB8K
(
  @string    VARCHAR(8000), -- input string
  @delimiter CHAR(1)        -- delimiter
)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
  itemNumber   = ROW_NUMBER() OVER (ORDER BY d.p),
  itemIndex    = CHECKSUM(ISNULL(NULLIF(d.p+1, 0),1)),
  itemLength   = CHECKSUM(item.ln),
  item         = SUBSTRING(@string, d.p+1, item.ln)
FROM (VALUES (DATALENGTH(@string))) AS l(s) -- length of the string
CROSS APPLY
(
  SELECT 0 UNION ALL -- for handling leading delimiters
  SELECT ng.position
  FROM   samd.NGrams8K(@string, 1) AS ng
  WHERE  token = @delimiter
) AS d(p) -- delimiter.position
CROSS APPLY (VALUES(  --LEAD(d.p, 1, l.s+l.d) OVER (ORDER BY d.p) - (d.p+l.d)
  ISNULL(NULLIF(CHARINDEX(@delimiter,@string,d.p+1),0)-(d.p+1), l.s-d.p))) AS item(ln);
GO

CREATE FUNCTION dbo.RemoveDupChar8K(@string varchar(8000), @char char(1))
RETURNS TABLE WITH SCHEMABINDING AS RETURN

SELECT NewString = 
 replace(replace(replace(replace(replace(replace(replace(
 @string COLLATE LATIN1_GENERAL_BIN,
 replicate(@char,33), @char), --33
 replicate(@char,17), @char), --17
 replicate(@char,9 ), @char), -- 9
 replicate(@char,5 ), @char), -- 5
 replicate(@char,3 ), @char), -- 3 
 replicate(@char,2 ), @char), -- 2
 replicate(@char,2 ), @char); -- 2
GO

这篇关于将带有键和注释字段的表转换为列字段中每个单词的键和行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆