有效清理表格中的字符串 [英] Efficient Cleaning of Strings in a Table

查看:36
本文介绍了有效清理表格中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在解决需要从表中存在的字符串中清除某些字符的问题.通常我会用替换来做一个简单的更新,但在这种情况下,需要删除 32 个不同的字符.

I'm currently working on a problem where certain characters need to be cleaned from strings that exist in a table. Normally I'd do a simple UPDATE with a replace, but in this case there are 32 different characters that need to be removed.

我环顾四周,找不到任何很好的解决方案来快速清理表中已经存在的字符串.

I've done some looking around and I can't find any great solutions for quickly cleaning strings that already exist in a table.

我调查过的事情:

  1. 进行一系列嵌套替换

  1. Doing a series of nested replaces

这个解决方案是可行的,但对于 32 种不同的替换,它要么需要一些丑陋的代码,要么需要 hacky 动态 sql 来构建大量的替换.

This solution is do-able, but for 32 different replaces it would require either some ugly code, or hacky dynamic sql to build a huge series of replaces.

PATINDEX 和 while 循环

PATINDEX and while loops

正如在这个答案中所见,可以模仿一种regex 替换,但我正在处理大量数据,所以我什至不敢相信改进的解决方案在数据量很大时在合理的时间内运行.

As seen in this answer it is possible to mimic a kind of regex replace, but I'm working with a lot of data so I'm hesitant to even trust the improved solution to run in a reasonable amount of time when the data volume is large.

递归 CTE

我尝试了一个 CTE 方法来解决这个问题,但是一旦行数变大,它的运行速度并没有那么快.

I tried a CTE approuch to this problem, but it didn't run terribly fast once the number of rows got large.

供参考:

CREATE TABLE #BadChar(
    id int IDENTITY(1,1),
    badString nvarchar(10),
    replaceString nvarchar(10)

);

INSERT INTO #BadChar(badString, replaceString) SELECT 'A', '^';
INSERT INTO #BadChar(badString, replaceString) SELECT 'B', '}';
INSERT INTO #BadChar(badString, replaceString) SELECT 's', '5';
INSERT INTO #BadChar(badString, replaceString) SELECT '-', ' ';

CREATE TABLE #CleanMe(
    clean_id int IDENTITY(1,1),
    DirtyString nvarchar(20)
);

DECLARE @i int;
SET @i = 0;
WHILE @i < 100000 BEGIN
    INSERT INTO #CleanMe(DirtyString) SELECT 'AAAAA';
    INSERT INTO #CleanMe(DirtyString) SELECT 'BBBBB';
    INSERT INTO #CleanMe(DirtyString) SELECT 'AB-String-BA';
    SET @i = @i + 1
END;


WITH FixedString (Step, String, cid) AS (
    SELECT 1 AS Step, REPLACE(DirtyString, badString, replaceString), clean_id
    FROM #BadChar, #CleanMe
    WHERE id = 1

    UNION ALL

    SELECT Step + 1, REPLACE(String, badString, replaceString), cid
    FROM FixedString AS T1
    JOIN #BadChar AS T2 ON T1.step + 1 = T2.id
    Join #CleanMe AS T3 on T1.cid = t3.clean_id

)
SELECT String FROM FixedString WHERE step = (SELECT MAX(STEP) FROM FixedString);

DROP TABLE #BadChar;
DROP TABLE #CleanMe;

  1. 使用 CLR

  1. Use a CLR

这似乎是许多人使用的常见解决方案,但我所处的环境并不使它成为一个很容易着手的解决方案.

It seems like this is a common solution many people use, but the environment I'm in doesn't make this a very easy one to embark on.

还有其他方法可以解决这个问题吗?或者对我已经研究过的方法有什么改进?

Are there any other ways to go about this I've over looked? Or any improvements upon the methods I've already looked into for this?

推荐答案

利用来自 Alan Burstein 的解决方案,如果您想对坏/替换字符串进行硬编码,您可以执行类似的操作.这也适用于长度超过单个字符的坏字符串/替换字符串.

Leveraging the idea from Alan Burstein's solution, you could do something like this, if you wanted to hard code the bad/replace strings. This would work for bad/replace strings longer than a single character as well.

CREATE FUNCTION [dbo].[CleanStringV1]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) WITH SCHEMABINDING AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM
 (VALUES
      ('A', '^')
    , ('B', '}')
    , ('s', '5')
    , ('-', ' ')
    ) t(badString, replaceString) 
 RETURN @string;
END;

或者,如果您有一个包含错误/替换字符串的表,则

Or, if you have a table containing the bad/replace strings, then

CREATE FUNCTION [dbo].[CleanStringV2]
(
  @String   nvarchar(4000)
)
RETURNS nvarchar(4000) AS 
BEGIN
 SELECT @string = REPLACE
  (
    @string COLLATE Latin1_General_BIN,
    badString,
    replaceString
  )
 FROM BadChar
 RETURN @string;
END;

这些区分大小写.如果您想要不区分大小写,您可以删除 COLLATE 位.我做了一些小测试,这些测试并不比嵌套 REPLACE 慢多少.第一个硬编码字符串是两者中更快的一个,几乎和嵌套 REPLACE 一样快.

These are case sensitive. You can remove the COLLATE bit if you want case insensitive. I did a few small tests, and these were not much slower than nested REPLACE. The first one with the hardcoded strings was a the faster of the two, and was nearly as fast as nested REPLACE.

这篇关于有效清理表格中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆