在BQ公共数据集中获取顶级专利国家/地区的代码 [英] Get the top patent countries, codes in a BQ public dataset

查看:138
本文介绍了在BQ公共数据集中获取顶级专利国家/地区的代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用一种分析功能来获取专利申请量排名前2的国家,而在那些排名前2位的国家中,则申请专利的前2个国家.例如,答案将如下所示:

I am trying to use an analytic function to get the top 2 countries with patent applications, and within those top 2 countries, get the top 2 application kinds. For example, the answer will look something like this:

country  -   code 
US           P
US           A
GB           X
GB           P

这是我正在使用的查询:

Here is the query I am using to get this:

SELECT
  country_code,
  MIN(count_country_code) count_country_code,
  application_kind
FROM (
  WITH
    A AS (
    SELECT
      country_code,
      COUNT(country_code) OVER (PARTITION BY country_code) AS count_country_code,
      application_kind
    FROM
      `patents-public-data.patents.publications`),
    B AS (
    SELECT
      country_code,
      count_country_code,
      DENSE_RANK() OVER(ORDER BY count_country_code DESC) AS country_code_num,
      application_kind,
      DENSE_RANK() OVER(PARTITION BY country_code ORDER BY count_country_code DESC) AS application_kind_num
    FROM
      A)
  SELECT
    country_code,
    count_country_code,
    application_kind
  FROM
    B
  WHERE
    country_code_num <= 2
    AND application_kind_num <= 2) x
GROUP BY
  country_code,
  application_kind
ORDER BY
  count_country_code DESC

但是,不幸的是,由于超量/订单/分区,我收到了内存超出"错误.这是消息:

However, unfortunately, I get a "memory exceeded" error due to the over/order/partition. Here is the message:

查询执行期间超出了资源:无法在分配的内存中执行查询.高峰使用:限制的112%.内存使用大户:用于分析OVER()子句的排序操作:98%其他/未分配:2%

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 112% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%

如何进行上述查询(或其他类似的查询)而又不会遇到这些内存错误?可以在公共数据集

How would I go about doing the above query (or other similar queries) without running into these memory errors? This can be tested on the public dataset here.

一种简单的方法(仅当字段的基数为半低时才有效)是将其作为简单的聚合操作并将结果存储在数据库外部的内存中.例如:

One crude way to do it (which only works if the fields have a semi-low cardinality), would be to do it as a straightforward aggregation operation and sort the results in-memory outside the DB. For example:

推荐答案

以下是BigQuery标准SQL

Below is for BigQuery Standard SQL

#standardSQL
WITH A AS (
  SELECT country_code
  FROM `patents-public-data.patents.publications`
  GROUP BY country_code
  ORDER BY COUNT(1) DESC
  LIMIT 2
), B AS (
  SELECT
    country_code,
    application_kind,
    COUNT(1) application_kind_count
  FROM `patents-public-data.patents.publications`
  WHERE country_code IN (SELECT country_code FROM A)
  GROUP BY country_code, application_kind
), C AS (
  SELECT
    country_code,
    application_kind,
    application_kind_count,
    DENSE_RANK() OVER(PARTITION BY country_code ORDER BY application_kind_count DESC) AS application_kind_rank
  FROM B
)
SELECT
  country_code,
  application_kind,
  application_kind_count
FROM C
WHERE application_kind_rank <= 2  

有结果

这篇关于在BQ公共数据集中获取顶级专利国家/地区的代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆