REGEX用于特殊字符(解码Unicode) [英] REGEX for special character (Decode Unicode)
本文介绍了REGEX用于特殊字符(解码Unicode)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我试过这段代码解码,unicode所有的工作都很好,如果注释以任何特殊字符,符号,空格等开始显示错误。
$ b $ pre $
CREATE TEMP FUNCTION DecodeUnicode(s STRING)AS(
IF(s NOT LIKE'%\\u %',s,
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x',x)AS INT64)))
FROM UNNEST(SPLIT(s,'\\u')) AS x
WHERE x!=''))
);
SELECT
原始,
DecodeUnicode(原始)AS解码
FROM(
SELECT trim(r'$ - \\\料\\\金\ u304c\\\高\\\す\\\ぎ\\\!\\\!\\\!')原始的UNION ALL
SELECT trim(r'abcd')
);
解决方案
b
$ b
#standardSQL
CREATE TEMP FUNCTION DecodeUnicode(s STRING)AS
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x',x)AS INT64)))
FROM UNNEST(SPLIT(s,'\\'))AS x
WHERE x!=''
)
);
WITH'yourTable` AS(
SELECT r'$ - \\\料\\\金\\\が\\\高\\\す\\\ぎ\\\!\\\!\ff01'as original UNION ALL
SELECT r'abcd'
),uchars AS(
SELECT DISTINCT
c,
DecodeUnicode(c)uchar
FROM`yourTable`,
UNNEST(REGEXP_EXTRACT_ALL(original,r'(\\u [abcdef0-9] {4})'))c
)
SELECT
original,
STRING_AGG (原始的,
)(IFNULL SUM(CASE char when'then 1 else 6 end)
OVER(PARTITION BY original ORDER BY pos) - CASE char when''然后0 else 5 END,
CASE char当'那么1' 6 END)x,
uchar
FROM`yourTable`,
UNNEST(REGEXP_EXTRACT_ALL(original,r'(\\u [abcdef0-9] {4})|。') )char WITH OFFSET as pos
LEFT JOIN uchars u ON uc = char
)
GROUP BY原始
- ORDER BY原始
它的作用 - 提取所有的unicode字符和解码它们并将它们替换为原始字符串,而非unicode保持原样,因此输出如下:
原始解码
$ -\\\料\\\金\\\が\\\高\\\す\\\ぎ\\\!\\\!\\\! $ - 料金が高すぎ!!!
abcd abcd
I have try this code to decode, unicode all works fine if the comment start with any special character, symbol, space and etc shows error.
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
IF(s NOT LIKE '%\\u%', s,
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\\u')) AS x
WHERE x != ''))
);
SELECT
original,
DecodeUnicode(original) AS decoded
FROM (
SELECT trim(r'$-\u6599\u91d1\u304c\u9ad8\u3059\u304e\uff01\uff01\uff01') AS original UNION ALL
SELECT trim(r'abcd')
);
解决方案
Below is for BigQuery Standard SQL
#standardSQL
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\\u')) AS x
WHERE x != ''
)
);
WITH `yourTable` AS (
SELECT r'$-\u6599\u91d1\u304c\u9ad8\u3059\u304e\uff01\uff01\uff01' AS original UNION ALL
SELECT r'abcd'
), uchars AS (
SELECT DISTINCT
c,
DecodeUnicode(c) uchar
FROM `yourTable`,
UNNEST(REGEXP_EXTRACT_ALL(original, r'(\\u[abcdef0-9]{4})')) c
)
SELECT
original,
STRING_AGG(IFNULL(uchar, x), '' ORDER BY pos) decoded
FROM (
SELECT
original,
pos,
SUBSTR(original,
SUM(CASE char WHEN '' THEN 1 ELSE 6 END)
OVER(PARTITION BY original ORDER BY pos) - CASE char WHEN '' THEN 0 ELSE 5 END,
CASE char WHEN '' THEN 1 ELSE 6 END) x,
uchar
FROM `yourTable`,
UNNEST(REGEXP_EXTRACT_ALL(original, r'(\\u[abcdef0-9]{4})|.')) char WITH OFFSET AS pos
LEFT JOIN uchars u ON u.c = char
)
GROUP BY original
-- ORDER BY original
What it does - it extracts all unicode characters and decode them and replace them in original string leaving non-unicodes stay as they are, so the output will be as below
original decoded
$-\u6599\u91d1\u304c\u9ad8\u3059\u304e\uff01\uff01\uff01 $-料金が高すぎ!!!
abcd abcd
这篇关于REGEX用于特殊字符(解码Unicode)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文