Google使用SQL的大查询:当有多个受让人时,将受让人姓名和统一受理人姓名关联起来 [英] Google's Big Query using SQL: Associate the assignee name and harmonized assignee name when there are multiple assignees

查看:211
本文介绍了Google使用SQL的大查询:当有多个受让人时,将受让人姓名和统一受理人姓名关联起来的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是用Google的Big Query patents-patents-patents.patents.publications_201710表格创建一个表格,该表格使用标准SQL,其中有一行用于publication_number,assignee和assignee_harmonized.name,其中publication_number重复用于具有多个记录的记录受让人。以下是我想要的输出的一个例子:
$ b

publication_number | assignee | assignee_harm



US-6044964-A | Sony株式会社| SONY CORP

US-6044964-A |数字音频光盘公司| DIGITAL AUDIO DISC CORP



美国-8746747-B2 | IPS Corporation-Weld-On Division | IPS CORPORATION焊接部门



US-8746747-B2 | null | MCPHERSON TERRY R



我尝试了以下基于UNNEST建议的查询,该建议发现于本文

  #standard SQL 
SELECT
p.publication_number,
p.assignee,
a.name AS assignee_harm
FROM
` patents-public-data.patents.publications_201710` AS p ,
UNNEST(assignee_harmonized)AS
WHERE
p.publication_number IN('US-6044964-A',
'US-8746747-B2')

Howeve r,输出显示如下:

行| publication_number | assignee | assignee_harm


1 | US-6044964 -A | Sony Corporation | SONY CORP

|| Digital Audio Disc Corporation |

2 | US-6044964 -A |索尼公司| DIGITAL AUDIO DISC CORP

||数字音频光盘公司|||

3 | US -8746747-B2 | IPS Corporation-Weld-On Division | MCPHERSON TERRY R

4 | US-8746747-B2 | IPS Corporation-Weld-On Division | IPS CORPORATION- WELD ON DIVISION



您可以看到Sony Corporation受让人与第2行中的DIGITAL AUDIO DISC CORP统一名称不适当地关联,出现类似问题第3行。另外,第1行和第2行每行都包含两行,但不重复publication_number标识符。我没有看到一个简单的方法来做到这一点,因为受让人的数量并不总是等于assignee_harmonized.name的数量,并且它们并不总是以相同的顺序出现(否则我可以尝试创建两个表并以某种方式合并它们)。另一方面,必须有一种方法将受让人变量与其统一价值assignee_harmonized.name相关联,否则失去了统一价值的目的。你可以请建议一个查询(或一组查询),当有多个受让人或多个assignee_harmonized.name或两者都会产生所需的输出?

解决方案

您正在查询一个字符串和两个数组 - 整个事情基本如下所示:

  {
publication_number:US-8746747-B2,
受让人:[
IPS Corporation-Weld-On分部
],
assignee_harm:[
MCPHERSON TERRY R,
IPS CORPORATION-分部焊接
]
}

所以这就是数据,你需要决定如何处理它们的组合......要么交叉连接所有的东西:

 #标准SQL 
SELECT
p.publication_number,
受让人,
assignee_harmonized.name AS assignee_harm
FROM
`patents-public-data.patents.publications_201710`由于p
('US-6044964-A','US-8746747-B2')
<美国专利号为US-6044964-A,US-8746747-B2'的美国专利申请人, / code>

..它给你关系数据..或者把它作为两个单独的数组:

 #标准SQL 
SELECT
p.publication_number,
受让人,
ARRAY((SELECT name FROM p。 ('US-6044964-A','Assignee_harmonized))AS assignee_harm
FROM
` patents-public-data.patents.publications_201710` AS p
where
p.publication_number IN )

您也可以将此嵌套结果另存为bq中的表格。

My goal is to create a table from Google's Big Query patents-public-data.patents.publications_201710 table using standard SQL that has one row for the publication_number, assignee and assignee_harmonized.name where the publication_number is repeated for records that have multiple assignees. Here's an example of my desired output:

publication_number|assignee|assignee_harm

US-6044964-A|Sony Corporation|SONY CORP

US-6044964-A|Digital Audio Disc Corporation|DIGITAL AUDIO DISC CORP

US-8746747-B2|IPS Corporation—Weld-On Division|IPS CORPORATION—WELD ON DIVISION

US-8746747-B2|null|MCPHERSON TERRY R

I've tried the following query based off of the UNNEST suggestion found in this post

#standard SQL
SELECT
  p.publication_number,
  p.assignee,
  a.name AS assignee_harm
FROM
  `patents-public-data.patents.publications_201710` AS p,
  UNNEST(assignee_harmonized) AS a
WHERE
  p.publication_number IN ('US-6044964-A',
    'US-8746747-B2')

However, the output appears as follows:

row|publication_number|assignee|assignee_harm

1|US-6044964-A|Sony Corporation|SONY CORP

||Digital Audio Disc Corporation|

2|US-6044964-A|Sony Corporation|DIGITAL AUDIO DISC CORP

||Digital Audio Disc Corporation|

3|US-8746747-B2|IPS Corporation—Weld-On Division|MCPHERSON TERRY R

4|US-8746747-B2|IPS Corporation—Weld-On Division|IPS CORPORATION—WELD ON DIVISION

You can see that the "Sony Corporation" assignee is inappropriately associated with the "DIGITAL AUDIO DISC CORP" harmonized name in row 2 with a similar issue appearing in row 3. Also, rows 1 and 2 contain two lines each but don't repeat the publication_number identifier. I don't see a straightforward way to do this because the number of "assignee" doesn't always equal the number of "assignee_harmonized.name" and they don't always appear in the same order (otherwise I could try creating two tables and merging them somehow). On the other hand, there has to be a way to associate the "assignee" variable with its harmonized value "assignee_harmonized.name", otherwise the purpose of having a harmonized value is lost. Could you please suggest a query (or set of queries) that will produce the desired output when there are either multiple "assignee" or multiple "assignee_harmonized.name" or both?

解决方案

You're querying for a string and two arrays - the whole thing basically looks like this:

  {
    "publication_number": "US-8746747-B2",
    "assignee": [
      "IPS Corporation—Weld-On Division"
    ],
    "assignee_harm": [
      "MCPHERSON TERRY R",
      "IPS CORPORATION—WELD ON DIVISION"
    ]
  }

So that's the data and you somehow need to decide how to treat the combination of them ... either you cross join everything:

#standard SQL
SELECT
  p.publication_number,
  assignee,
  assignee_harmonized.name  AS assignee_harm
FROM
  `patents-public-data.patents.publications_201710` AS p
  ,p.assignee assignee
  ,p.assignee_harmonized AS assignee_harmonized
WHERE
  p.publication_number IN ('US-6044964-A','US-8746747-B2')

.. which gives you relational data .. or you leave it as two separate arrays:

#standard SQL
SELECT
  p.publication_number,
  assignee,
  ARRAY( (SELECT name FROM p.assignee_harmonized)) AS assignee_harm
FROM
  `patents-public-data.patents.publications_201710` AS p
WHERE
  p.publication_number IN ('US-6044964-A','US-8746747-B2')

You can save this nested result as a table in bq as well.

这篇关于Google使用SQL的大查询:当有多个受让人时,将受让人姓名和统一受理人姓名关联起来的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆