在 Bigquery 中将 Unicode 解码为本地语言 [英] Decode Unicode's to Local language in Bigquery
问题描述
我们在 Bigquery 中收到了一项调查网络钩子数据.本地语言的注释被捕获为 unicode,我们在该注释中确实有特殊字符.
We receive a survey Web-hook data in Bigquery. The comment in local language is captured as unicode and we do have special character in that comment.
示例
- 调查评论-别老是晚点,现场补行李费太贵"
- 在 Bigquery 数据中评论-u522bu8001u662fu665au70b9uff0cu73b0u573au8865u884cu674eu8d39u592au8d35"
我们找到了解码个人评论的解决方案:-
We found a solution for decode individual comment :-
CREATE TEMPORARY FUNCTION utf8convert(s STRING)
RETURNS STRING
LANGUAGE js AS """
return unescape( ( s ) );
""";
with sample AS (SELECT 'u522bu8001u662fu665a' AS S)
SELECT utf8convert(s) from sample
在带有数千条评论和不同语言的评论字段中实现此代码时,它不起作用.
When implement this code in comment field with thousand of comment and different languages its not working.
CREATE TEMPORARY FUNCTION utf8convert(s STRING)
RETURNS STRING
LANGUAGE js AS """
return unescape( ( s ) );
""";
SELECT Comment, utf8convert(Comment) as Convert
FROM `airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
where Comment is not null
运行时没有错误,但结果是 Unicode 不会更改为本地语言.结果:Unicode 本地语言
no error while running but the result is in Unicode does not change to local language. Result: local language in Unicode
我试过这个代码
I have try this code
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
IF(s NOT LIKE '%\u%', s,
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\u')) AS x
WHERE x != ''))
);
SELECT
original,
DecodeUnicode(original) AS decoded
FROM (
SELECT trim(r'$-u6599u91d1u304cu9ad8u3059u304euff01uff01uff01') AS original UNION ALL
SELECT trim(r'abcd')
);
显示错误 我认为是因为评论以特殊字符开头?.
shows error i think its because the comment start with special character?.
推荐答案
看看这是否有效.它通过转换为 Unicode 代码点然后转换为字符串,对包含 u 的字符串进行手动"解码.它也应该比使用 JavaScript 更快.
See if this works. It does the "manual" decoding for strings that have u in them by converting to Unicode code points and then to a string. It should be faster than using JavaScript, too.
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
IF(s NOT LIKE '%\u%', s,
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\u')) AS x
WHERE x != ''))
);
SELECT
original,
DecodeUnicode(original) AS decoded
FROM (
SELECT r'u522bu8001u662fu665au70b9uff0cu73b0u573au8865u884cu674eu8d39u592au8d35' AS original UNION ALL
SELECT r'abcd'
);
作为输出,返回别老是晚点,现场补行李费太贵
和abcd
.
这篇关于在 Bigquery 中将 Unicode 解码为本地语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!