在 Bigquery 中将 Unicode 解码为本地语言 [英] Decode Unicode's to Local language in Bigquery

查看:48
本文介绍了在 Bigquery 中将 Unicode 解码为本地语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在 Bigquery 中收到了一项调查网络钩子数据.本地语言的注释被捕获为 unicode,我们在该注释中确实有特殊字符.

We receive a survey Web-hook data in Bigquery. The comment in local language is captured as unicode and we do have special character in that comment.

  • 示例

  • 调查评论-别老是晚点,现场补行李费太贵"
  • 在 Bigquery 数据中评论-u522bu8001u662fu665au70b9uff0cu73b0u573au8865u884cu674eu8d39u592au8d35"

我们找到了解码个人评论的解决方案:-

We found a solution for decode individual comment :-

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
    with sample AS (SELECT 'u522bu8001u662fu665a' AS S)
    SELECT utf8convert(s) from sample

在带有数千条评论和不同语言的评论字段中实现此代码时,它不起作用.

When implement this code in comment field with thousand of comment and different languages its not working.

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
   SELECT Comment, utf8convert(Comment) as Convert
   FROM `airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
   where Comment is not null 

运行时没有错误,但结果是 Unicode 不会更改为本地语言.结果:Unicode 本地语言

no error while running but the result is in Unicode does not change to local language. Result: local language in Unicode

  • 我试过这个代码

  • I have try this code

  CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
  IF(s NOT LIKE '%\u%', s,
  (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
  FROM UNNEST(SPLIT(s, '\u')) AS x
   WHERE x != ''))
  );

  SELECT
  original,
  DecodeUnicode(original) AS decoded
  FROM (
  SELECT trim(r'$-u6599u91d1u304cu9ad8u3059u304euff01uff01uff01') AS original UNION ALL
  SELECT trim(r'abcd')
  );

显示错误 我认为是因为评论以特殊字符开头?.

shows error i think its because the comment start with special character?.

推荐答案

看看这是否有效.它通过转换为 Unicode 代码点然后转换为字符串,对包含 u 的字符串进行手动"解码.它也应该比使用 JavaScript 更快.

See if this works. It does the "manual" decoding for strings that have u in them by converting to Unicode code points and then to a string. It should be faster than using JavaScript, too.

CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
  IF(s NOT LIKE '%\u%', s,
     (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\u')) AS x
      WHERE x != ''))
);

SELECT
  original,
  DecodeUnicode(original) AS decoded
FROM (
  SELECT r'u522bu8001u662fu665au70b9uff0cu73b0u573au8865u884cu674eu8d39u592au8d35' AS original UNION ALL
  SELECT r'abcd'
);

作为输出,返回别老是晚点,现场补行李费太贵abcd.

这篇关于在 Bigquery 中将 Unicode 解码为本地语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆