查找出现在数据集中多于一行的所有两个词组 [英] find all two word phrases that appear in more than one row in a dataset

查看:25
本文介绍了查找出现在数据集中多于一行的所有两个词组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们想运行一个查询,该查询返回出现在多行中的两个词组.所以例如取字符串数据忍者".由于它出现在我们数据集中的不止一行,查询应该返回它.通过查询数据集中的行中的两个相邻单词组合(形成一个短语),查询应该从我们数据集中的所有行中找到所有这样的短语.这两个相邻的单词组合应该来自我们加载到 BigQuery 的数据集

We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery

我们如何在 Google BigQuery 中编写此查询?

How can we write this query in Google BigQuery?

数据集只是一长串英文句子.

The dataset is simply a long list of English sentences.

推荐答案

好消息:BigQuery 现在支持 SPLIT().检查 https://stackoverflow.com/a/24172995/132438.

Good news: BigQuery now supports SPLIT(). Check https://stackoverflow.com/a/24172995/132438.

这是一个黑客,但我碰巧喜欢一个黑客:)

This is a hack, but a hack I happen to like :).

在目前的形式中,它仅适用于超过 2 个单词的句子,并且只提取前 6 对.您可以从这里扩展和测试.

In its current form, it only works for sentences with more than 2 words, and it only extracts the 6 first pairs. You can extend and test from here.

对您的数据进行尝试,然后反馈.

Try it on your data, and please report back.

SELECT pairs, COUNT(*) c FROM
(
SELECT REGEXP_REPLACE(title, '([^\s]+ ){0}([^\s]* [^\s]+).*', '\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\s]+ ){1}([^\s]* [^\s]+).*', '\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\s]+ ){2}([^\s]* [^\s]+).*', '\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\s]+ ){3}([^\s]* [^\s]+).*', '\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\s]+ ){4}([^\s]* [^\s]+).*', '\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\s]+ ){5}([^\s]* [^\s]+).*', '\2') pairs, title
FROM [bigquery-samples:reddit.full]
)
WHERE pairs != title
GROUP EACH BY pairs
HAVING c > 1
LIMIT 1000

结果可能包含 NSFW 字词.样本数据集来自一个尚未清理"过的在线社区.如果您对某些词敏感,请避免运行查询.

Results might contain NSFW words. The sample dataset comes from an online community that has not been "cleaned up". Abstain from running query if you are sensitive to some words.

这篇关于查找出现在数据集中多于一行的所有两个词组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆