如何使用BigQuery识别停用词？ [英] How to identify stopwords with BigQuery?

查看：122 发布时间：2018/5/7 17:43:43 sql google-bigquery text-analysis

本文介绍了如何使用BigQuery识别停用词？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在阅读reddit的评论。我正在使用一些常用的限制词列表，但我想为此数据集创建一个自定义列表。我怎样才能用SQL做到这一点？

解决方案

识别停用词的一种方法是查看大多数文档中显示的。

此查询中的步骤：

过滤相关性，质量选择您的子积分，选择最低分数，选择最小长度）。

Unescape reddit HTML编码值。

决定什么算作一个词在这种情况下 r'[az] {1,20} \'？[az] +'）。

无论每个评论重复多少次，每个文档每个文档只能计数一次（注释）。

获取顶部x个单词根据他们出示的文件数量来计算。

查询：

<$ p $ （
SELECT CONCAT（link_id，'/'，id）id，REGEXP_EXTRACT_ALL（
REGEXP_REPLACE（REGEXP_REPLACE（LOWER（body）），$ p $ #standardSQL WITH word_by_post AS '&'，'&'），r'& [az] {2,4};'，'*'），r'[az] {1,20} \' ？[az] +'）words FROM`fh-bigquery.reddit_comments.2017_07` WHERE body NOT IN（'[deleted]'，'[removed]'） AND subreddit IN （'AskReddit'，'funny'，'movies'） AND score> 100 ），words_per_doc AS（ SELECT id，word FROM words_by_post，UNNEST（words） word WHERE ARRAY_LENGTH（words）> 20 GROUP BY id，word ） SELECT word，COUNT（*）docs_with_word FROM words_per_doc GROUP BY 1 ORDER BY docs_with_word DESC LIMIT 100

I'm looking at reddit comments. I'm using some common stopword lists, but I want to create a custom one for this dataset. How can I do this with SQL?
解决方案
One approach to identify stopwords is to look at the ones that show up in most documents.

Steps in this query:

Filter posts for relevancy, quality (choose your subreddits, choose a minimum score, choose a minimum length).

Unescape reddit HTML encoded values.

Decide what counts as a word (in this case r'[a-z]{1,20}\'?[a-z]+').

Each word counts only once per doc (comment), regardless of how many times it's repeated in each comment.

Get the top x words by counting on how many documents they showed up.

Query:
#standardSQL WITH words_by_post AS ( SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL( REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*') , r'[a-z]{1,20}\'?[a-z]+') words FROM `fh-bigquery.reddit_comments.2017_07` WHERE body NOT IN ('[deleted]', '[removed]') AND subreddit IN ('AskReddit', 'funny', 'movies') AND score > 100 ), words_per_doc AS ( SELECT id, word FROM words_by_post, UNNEST(words) word WHERE ARRAY_LENGTH(words) > 20 GROUP BY id, word ) SELECT word, COUNT(*) docs_with_word FROM words_per_doc GROUP BY 1 ORDER BY docs_with_word DESC LIMIT 100

这篇关于如何使用BigQuery识别停用词？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用BigQuery识别停用词？ [英] How to identify stopwords with BigQuery?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用BigQuery识别停用词？ [英] How to identify stopwords with BigQuery?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭