Google使用SQL的大查询:当有多个受让人时,将受让人姓名和统一受理人姓名关联起来 [英] Google's Big Query using SQL: Associate the assignee name and harmonized assignee name when there are multiple assignees
问题描述
我的目标是用Google的Big Query patents-patents-patents.patents.publications_201710表格创建一个表格,该表格使用标准SQL,其中有一行用于publication_number,assignee和assignee_harmonized.name,其中publication_number重复用于具有多个记录的记录受让人。以下是我想要的输出的一个例子:
$ b
publication_number | assignee | assignee_harm
US-6044964-A | Sony株式会社| SONY CORP
US-6044964-A |数字音频光盘公司| DIGITAL AUDIO DISC CORP
美国-8746747-B2 | IPS Corporation-Weld-On Division | IPS CORPORATION焊接部门
US-8746747-B2 | null | MCPHERSON TERRY R
我尝试了以下基于UNNEST建议的查询,该建议发现于本文
#standard SQL
SELECT
p.publication_number,
p.assignee,
a.name AS assignee_harm
FROM
` patents-public-data.patents.publications_201710` AS p ,
UNNEST(assignee_harmonized)AS
WHERE
p.publication_number IN('US-6044964-A',
'US-8746747-B2')
Howeve r,输出显示如下:
行| publication_number | assignee | assignee_harm
1 | US-6044964 -A | Sony Corporation | SONY CORP
|| Digital Audio Disc Corporation |
2 | US-6044964 -A |索尼公司| DIGITAL AUDIO DISC CORP
||数字音频光盘公司|||
3 | US -8746747-B2 | IPS Corporation-Weld-On Division | MCPHERSON TERRY R
4 | US-8746747-B2 | IPS Corporation-Weld-On Division | IPS CORPORATION- WELD ON DIVISION
您可以看到Sony Corporation受让人与第2行中的DIGITAL AUDIO DISC CORP统一名称不适当地关联,出现类似问题第3行。另外,第1行和第2行每行都包含两行,但不重复publication_number标识符。我没有看到一个简单的方法来做到这一点,因为受让人的数量并不总是等于assignee_harmonized.name的数量,并且它们并不总是以相同的顺序出现(否则我可以尝试创建两个表并以某种方式合并它们)。另一方面,必须有一种方法将受让人变量与其统一价值assignee_harmonized.name相关联,否则失去了统一价值的目的。你可以请建议一个查询(或一组查询),当有多个受让人或多个assignee_harmonized.name或两者都会产生所需的输出?
您正在查询一个字符串和两个数组 - 整个事情基本如下所示:
{
publication_number:US-8746747-B2,
受让人:[
IPS Corporation-Weld-On分部
],
assignee_harm:[
MCPHERSON TERRY R,
IPS CORPORATION-分部焊接
]
}
所以这就是数据,你需要决定如何处理它们的组合......要么交叉连接所有的东西:
#标准SQL
SELECT
p.publication_number,
受让人,
assignee_harmonized.name AS assignee_harm
FROM
`patents-public-data.patents.publications_201710`由于p
('US-6044964-A','US-8746747-B2')
<美国专利号为US-6044964-A,US-8746747-B2'的美国专利申请人, / code>
..它给你关系数据..或者把它作为两个单独的数组:
#标准SQL
SELECT
p.publication_number,
受让人,
ARRAY((SELECT name FROM p。 ('US-6044964-A','Assignee_harmonized))AS assignee_harm
FROM
` patents-public-data.patents.publications_201710` AS p
where
p.publication_number IN )
您也可以将此嵌套结果另存为bq中的表格。
My goal is to create a table from Google's Big Query patents-public-data.patents.publications_201710 table using standard SQL that has one row for the publication_number, assignee and assignee_harmonized.name where the publication_number is repeated for records that have multiple assignees. Here's an example of my desired output:
publication_number|assignee|assignee_harm
US-6044964-A|Sony Corporation|SONY CORP
US-6044964-A|Digital Audio Disc Corporation|DIGITAL AUDIO DISC CORP
US-8746747-B2|IPS Corporation—Weld-On Division|IPS CORPORATION—WELD ON DIVISION
US-8746747-B2|null|MCPHERSON TERRY R
I've tried the following query based off of the UNNEST suggestion found in this post
#standard SQL
SELECT
p.publication_number,
p.assignee,
a.name AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p,
UNNEST(assignee_harmonized) AS a
WHERE
p.publication_number IN ('US-6044964-A',
'US-8746747-B2')
However, the output appears as follows:
row|publication_number|assignee|assignee_harm
1|US-6044964-A|Sony Corporation|SONY CORP
||Digital Audio Disc Corporation|
2|US-6044964-A|Sony Corporation|DIGITAL AUDIO DISC CORP
||Digital Audio Disc Corporation|
3|US-8746747-B2|IPS Corporation—Weld-On Division|MCPHERSON TERRY R
4|US-8746747-B2|IPS Corporation—Weld-On Division|IPS CORPORATION—WELD ON DIVISION
You can see that the "Sony Corporation" assignee is inappropriately associated with the "DIGITAL AUDIO DISC CORP" harmonized name in row 2 with a similar issue appearing in row 3. Also, rows 1 and 2 contain two lines each but don't repeat the publication_number identifier. I don't see a straightforward way to do this because the number of "assignee" doesn't always equal the number of "assignee_harmonized.name" and they don't always appear in the same order (otherwise I could try creating two tables and merging them somehow). On the other hand, there has to be a way to associate the "assignee" variable with its harmonized value "assignee_harmonized.name", otherwise the purpose of having a harmonized value is lost. Could you please suggest a query (or set of queries) that will produce the desired output when there are either multiple "assignee" or multiple "assignee_harmonized.name" or both?
You're querying for a string and two arrays - the whole thing basically looks like this:
{
"publication_number": "US-8746747-B2",
"assignee": [
"IPS Corporation—Weld-On Division"
],
"assignee_harm": [
"MCPHERSON TERRY R",
"IPS CORPORATION—WELD ON DIVISION"
]
}
So that's the data and you somehow need to decide how to treat the combination of them ... either you cross join everything:
#standard SQL
SELECT
p.publication_number,
assignee,
assignee_harmonized.name AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p
,p.assignee assignee
,p.assignee_harmonized AS assignee_harmonized
WHERE
p.publication_number IN ('US-6044964-A','US-8746747-B2')
.. which gives you relational data .. or you leave it as two separate arrays:
#standard SQL
SELECT
p.publication_number,
assignee,
ARRAY( (SELECT name FROM p.assignee_harmonized)) AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p
WHERE
p.publication_number IN ('US-6044964-A','US-8746747-B2')
You can save this nested result as a table in bq as well.
这篇关于Google使用SQL的大查询:当有多个受让人时,将受让人姓名和统一受理人姓名关联起来的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!