查找数据集中出现在多行中的所有两个单词短语 [英] find all two word phrases that appear in more than one row in a dataset

查看:147
本文介绍了查找数据集中出现在多行中的所有两个单词短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们想运行一个查询,返回出现在两行以上的两个单词。因此对于例如采取字符串数据忍者。由于它出现在我们数据集的多行中,所以查询应该返回。查询应查找数据集中所有行的所有这些短语,方法是查询数据集中行中的两个相邻单词组合(形成短语)。这两个相邻的单词组合应该来自我们加载到BigQuery中的数据集



我们如何在Google BigQuery中编写此查询?



数据集只是一长串英文句子。解析方案

好消息:BigQuery现在支持SPLIT()。检查 https://stackoverflow.com/a/24172995/132438






这是一个黑客攻击,但是我碰巧喜欢这样的攻击:)。

它只适用于多于2个单词的句子,并且它只提取6个第一对。您可以从这里进行扩展和测试。



试用您的数据,并请回报。



<$ p $ ($ {$ c $ SELECT $'$'$'$'$'$'$'$'$'$' \\\s] * [^ \\s] +)。*','\\2')pairs,title
FROM [bigquery-samples:reddit.full]
) ,

SELECT REGEXP_REPLACE(title,'([^ \\s] +){1}([^ \\\s] * [^ \\s] +) *','\\2')对,标题
FROM [bigquery-samples:reddit.full]
),

SELECT REGEXP_REPLACE(title,' ([^ \\\s] +){2}([^ \\s] * [^ \\s] +)。*','\\2')对,title
FROM [bigquery-samples:reddit.full]
),

SELECT REGEXP_REPLACE(title,'([^ \\s] +){3} ^ \\\s] * [^ \\s] +)。*','\\2')pairs,title
FROM [bigquery-samples:reddit.full]
),

SELECT REGEXP_REPLACE(title,'([^ \\\s] +){4}([ ^ \\\s] * [^ \\s] +)。*','\\2')pairs,title
FROM [bigquery-samples:reddit.full]
),

SELECT REGEXP_REPLACE(title,'([^ \\\s] +){5}([^ \\\s] * [^ \\s] +)。*','\\2')对,标题
FROM [bigquery-samples:reddit.full]

WHERE pairs!= title
GROUP每对成对
含有c> 1
LIMIT 1000

结果可能包含NSFW单词。示例数据集来自尚未清理的在线社区。如果您对某些单词敏感,请放弃查询。


We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery

How can we write this query in Google BigQuery?

The dataset is simply a long list of English sentences.

解决方案

Good news: BigQuery now supports SPLIT(). Check https://stackoverflow.com/a/24172995/132438.


This is a hack, but a hack I happen to like :).

In its current form, it only works for sentences with more than 2 words, and it only extracts the 6 first pairs. You can extend and test from here.

Try it on your data, and please report back.

SELECT pairs, COUNT(*) c FROM
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){0}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){1}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){2}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){3}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){4}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){5}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
)
WHERE pairs != title
GROUP EACH BY pairs
HAVING c > 1
LIMIT 1000

Results might contain NSFW words. The sample dataset comes from an online community that has not been "cleaned up". Abstain from running query if you are sensitive to some words.

这篇关于查找数据集中出现在多行中的所有两个单词短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆