在Bigquery中将Unicode解码为本地语言 [英] Decode Unicode's to Local language in Bigquery

查看:126
本文介绍了在Bigquery中将Unicode解码为本地语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在Bigquery中收到了一个调查网络挂钩数据。当地语言的注释被捕获为unicode,并且该注释中确实有特殊字符。




  • 示例




    • 评论在调查中-别老是晚点,现场补行李费太贵

    • 在Bigquery数据中进行评论- \u522b\u8001\u662f\ u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35




我们找到了一种解码单个评论的解决方案:-

 创建临时功能utf8convert(s STRING)
返回STRING
语言js作为
返回unescape((s));
;
和示例AS(选择'\u522b\u8001\u662f\u665a'AS S)
从示例
中选择utf8convert

在带有数千条注释和不同语言的注释字段中实现此代码时,它将无法正常工作。

 创建临时功能utf8convert(s STRING)
返回STRING
语言js为
return unescape((s));
;
选择注释,utf8convert(Comment)as Convert
from`airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
其中注释不为空

运行时没有错误,但结果是Unicode不会更改为本地语言。
结果:Unicode本地语言




  • 我尝试过此代码

     创建温度功能DecodeUnicode (s STRING)AS(
    IF(s not like'%\\u%',s,
    (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x',x)AS AS INT64)) ))
    来自UNNEST(SPLIT(s,'\\u'))AS x
    WHERE x!=''))
    );

    选择
    原始
    DecodeUnicode(原始)AS解码
    FROM(
    SELECT trim(r'$-\u6599\u91d1\ u304c\u9ad8\u3059\u304e\uff01\uff01\uff01')作为原始联盟所有
    SELECT trim(r'abcd')
    );




显示错误我认为是因为注释以特殊字符开头?

解决方案

查看是否可行。它通过转换为Unicode代码点然后转换为字符串来对其中包含\u的字符串进行手动解码。

 创建温度功能DecodeUnicode(s STRING)AS(
IF( s不喜欢'%\\u%',s,
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x',x)AS INT64))))
来自UNNEST(SPLIT(s ,'\\u'))AS x
WHERE x!=''))
);

选择
原始
DecodeUnicode(原始)AS解码
FROM(
SELECT r'\u522b\u8001\u662f\u665a 70u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35'作为原始UNION ALL
SELECT r'abcd'
);

作为输出,返回别老是晚点,现场补行李费太贵 abcd


We receive a survey Web-hook data in Bigquery. The comment in local language is captured as unicode and we do have special character in that comment.

  • Example

    • Comment in survey- "别老是晚点,现场补行李费太贵"
    • Comment in Bigquery data- "\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35"

We found a solution for decode individual comment :-

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
    with sample AS (SELECT '\u522b\u8001\u662f\u665a' AS S)
    SELECT utf8convert(s) from sample

When implement this code in comment field with thousand of comment and different languages its not working.

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
   SELECT Comment, utf8convert(Comment) as Convert
   FROM `airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
   where Comment is not null 

no error while running but the result is in Unicode does not change to local language. Result: local language in Unicode

  • I have try this code

      CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
      IF(s NOT LIKE '%\\u%', s,
      (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\\u')) AS x
       WHERE x != ''))
      );
    
      SELECT
      original,
      DecodeUnicode(original) AS decoded
      FROM (
      SELECT trim(r'$-\u6599\u91d1\u304c\u9ad8\u3059\u304e\uff01\uff01\uff01') AS original UNION ALL
      SELECT trim(r'abcd')
      );
    

shows error i think its because the comment start with special character?.

解决方案

See if this works. It does the "manual" decoding for strings that have \u in them by converting to Unicode code points and then to a string. It should be faster than using JavaScript, too.

CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
  IF(s NOT LIKE '%\\u%', s,
     (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\\u')) AS x
      WHERE x != ''))
);

SELECT
  original,
  DecodeUnicode(original) AS decoded
FROM (
  SELECT r'\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35' AS original UNION ALL
  SELECT r'abcd'
);

As output, this returns 别老是晚点,现场补行李费太贵 and abcd.

这篇关于在Bigquery中将Unicode解码为本地语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆