在Bigquery中将Unicode解码为本地语言 [英] Decode Unicode's to Local language in Bigquery
问题描述
我们在Bigquery中收到了一个调查网络挂钩数据。当地语言的注释被捕获为unicode,并且该注释中确实有特殊字符。
-
示例
- 评论在调查中-别老是晚点,现场补行李费太贵
- 在Bigquery数据中进行评论- \u522b\u8001\u662f\ u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35
我们找到了一种解码单个评论的解决方案:-
创建临时功能utf8convert(s STRING)
中选择utf8convert
返回STRING
语言js作为
返回unescape((s));
;
和示例AS(选择'\u522b\u8001\u662f\u665a'AS S)
从示例
在带有数千条注释和不同语言的注释字段中实现此代码时,它将无法正常工作。
创建临时功能utf8convert(s STRING)
返回STRING
语言js为
return unescape((s));
;
选择注释,utf8convert(Comment)as Convert
from`airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
其中注释不为空
运行时没有错误,但结果是Unicode不会更改为本地语言。
结果:Unicode本地语言
-
我尝试过此代码
创建温度功能DecodeUnicode (s STRING)AS(
IF(s not like'%\\u%',s,
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x',x)AS AS INT64)) ))
来自UNNEST(SPLIT(s,'\\u'))AS x
WHERE x!=''))
);
选择
原始
DecodeUnicode(原始)AS解码
FROM(
SELECT trim(r'$-\u6599\u91d1\ u304c\u9ad8\u3059\u304e\uff01\uff01\uff01')作为原始联盟所有
SELECT trim(r'abcd')
);
显示错误我认为是因为注释以特殊字符开头?
查看是否可行。它通过转换为Unicode代码点然后转换为字符串来对其中包含\u的字符串进行手动解码。
创建温度功能DecodeUnicode(s STRING)AS(
IF( s不喜欢'%\\u%',s,
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x',x)AS INT64))))
来自UNNEST(SPLIT(s ,'\\u'))AS x
WHERE x!=''))
);
选择
原始
DecodeUnicode(原始)AS解码
FROM(
SELECT r'\u522b\u8001\u662f\u665a 70u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35'作为原始UNION ALL
SELECT r'abcd'
);
作为输出,返回别老是晚点,现场补行李费太贵
和 abcd
。
We receive a survey Web-hook data in Bigquery. The comment in local language is captured as unicode and we do have special character in that comment.
Example
- Comment in survey- "别老是晚点,现场补行李费太贵"
- Comment in Bigquery data- "\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35"
We found a solution for decode individual comment :-
CREATE TEMPORARY FUNCTION utf8convert(s STRING)
RETURNS STRING
LANGUAGE js AS """
return unescape( ( s ) );
""";
with sample AS (SELECT '\u522b\u8001\u662f\u665a' AS S)
SELECT utf8convert(s) from sample
When implement this code in comment field with thousand of comment and different languages its not working.
CREATE TEMPORARY FUNCTION utf8convert(s STRING)
RETURNS STRING
LANGUAGE js AS """
return unescape( ( s ) );
""";
SELECT Comment, utf8convert(Comment) as Convert
FROM `airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
where Comment is not null
no error while running but the result is in Unicode does not change to local language. Result: local language in Unicode
I have try this code
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS ( IF(s NOT LIKE '%\\u%', s, (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64))) FROM UNNEST(SPLIT(s, '\\u')) AS x WHERE x != '')) ); SELECT original, DecodeUnicode(original) AS decoded FROM ( SELECT trim(r'$-\u6599\u91d1\u304c\u9ad8\u3059\u304e\uff01\uff01\uff01') AS original UNION ALL SELECT trim(r'abcd') );
shows error i think its because the comment start with special character?.
See if this works. It does the "manual" decoding for strings that have \u in them by converting to Unicode code points and then to a string. It should be faster than using JavaScript, too.
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
IF(s NOT LIKE '%\\u%', s,
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\\u')) AS x
WHERE x != ''))
);
SELECT
original,
DecodeUnicode(original) AS decoded
FROM (
SELECT r'\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35' AS original UNION ALL
SELECT r'abcd'
);
As output, this returns 别老是晚点,现场补行李费太贵
and abcd
.
这篇关于在Bigquery中将Unicode解码为本地语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!