在 BQ 公共数据集中获取顶级专利国家、代码 [英] Get the top patent countries, codes in a BQ public dataset

查看:40
本文介绍了在 BQ 公共数据集中获取顶级专利国家、代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用分析函数获取专利申请量排名前 2 的国家/地区,并在前 2 个国家/地区中获取前 2 种申请类型.例如,答案将如下所示:

国家 - 代码美国邮政美国国标英镑

这是我用来获取此信息的查询:

SELECT国家代码,MIN(count_country_code)count_country_code,application_kind从 (和一个 AS (选择国家代码,COUNT(country_code) OVER (PARTITION BY country_code) AS count_country_code,application_kind从`patents-public-data.patents.publications`),BAS (选择国家代码,count_country_code,DENSE_RANK() OVER(ORDER BY count_country_code DESC) AS country_code_num,application_kind,DENSE_RANK() OVER(PARTITION BY country_code ORDER BY count_country_code DESC) AS application_kind_num从一个)选择国家代码,count_country_code,application_kind从乙在哪里country_code_num <= 2AND application_kind_num <= 2) x通过...分组国家代码,application_kind订购者count_country_code DESC

但是,不幸的是,由于过度/订单/分区,我收到了内存超出"错误.这是消息:

<块引用>

查询执行期间资源超出:无法在分配的内存中执行查询.峰值使用量:限制的 112%.最大内存消费者:用于分析 OVER() 子句的排序操作:98% 其他/未归因:2%

如何在不遇到这些内存错误的情况下执行上述查询(或其他类似查询)?这可以在公共数据集上进行测试

解决方案

以下为 BigQuery Standard SQL

#standardSQL带有 AS (选择国家代码来自`patents-public-data.patents.publications`按国家/地区代码分组按计数排序(1) DESC限制 2), BAS (选择国家代码,application_kind,COUNT(1) application_kind_count来自`patents-public-data.patents.publications`WHERE country_code IN (SELECT country_code FROM A)GROUP BY country_code, application_kind), CAS (选择国家代码,application_kind,application_kind_count,DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank从 B)选择国家代码,application_kind,application_kind_count从 CWHERE application_kind_rank <= 2

结果

I am trying to use an analytic function to get the top 2 countries with patent applications, and within those top 2 countries, get the top 2 application kinds. For example, the answer will look something like this:

country  -   code 
US           P
US           A
GB           X
GB           P

Here is the query I am using to get this:

SELECT
  country_code,
  MIN(count_country_code) count_country_code,
  application_kind
FROM (
  WITH
    A AS (
    SELECT
      country_code,
      COUNT(country_code) OVER (PARTITION BY country_code) AS count_country_code,
      application_kind
    FROM
      `patents-public-data.patents.publications`),
    B AS (
    SELECT
      country_code,
      count_country_code,
      DENSE_RANK() OVER(ORDER BY count_country_code DESC) AS country_code_num,
      application_kind,
      DENSE_RANK() OVER(PARTITION BY country_code ORDER BY count_country_code DESC) AS application_kind_num
    FROM
      A)
  SELECT
    country_code,
    count_country_code,
    application_kind
  FROM
    B
  WHERE
    country_code_num <= 2
    AND application_kind_num <= 2) x
GROUP BY
  country_code,
  application_kind
ORDER BY
  count_country_code DESC

However, unfortunately, I get a "memory exceeded" error due to the over/order/partition. Here is the message:

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 112% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%

How would I go about doing the above query (or other similar queries) without running into these memory errors? This can be tested on the public dataset here.

One crude way to do it (which only works if the fields have a semi-low cardinality), would be to do it as a straightforward aggregation operation and sort the results in-memory outside the DB. For example:

解决方案

Below is for BigQuery Standard SQL

#standardSQL
WITH A AS (
  SELECT country_code
  FROM `patents-public-data.patents.publications`
  GROUP BY country_code
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    country_code,
    application_kind,
    COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications`
  WHERE country_code IN (SELECT country_code FROM A)
  GROUP BY country_code, application_kind
), C AS (
  SELECT
    country_code,
    application_kind,
    application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B
)
SELECT
  country_code,
  application_kind,
  application_kind_count
FROM C
WHERE application_kind_rank <= 2  

with result

这篇关于在 BQ 公共数据集中获取顶级专利国家、代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆