如何使用BigQuery识别停用词? [英] How to identify stopwords with BigQuery?

查看:122
本文介绍了如何使用BigQuery识别停用词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读reddit的评论。我正在使用一些常用的限制词列表,但我想为此数据集创建一个自定义列表。我怎样才能用SQL做到这一点?

解决方案

识别停用词的一种方法是查看大多数文档中显示的。



此查询中的步骤:


  1. 过滤相关性,质量选择您的子积分,选择最低分数,选择最小长度)。
  2. Unescape reddit HTML编码值。

  3. 决定什么算作一个词在这种情况下 r'[az] {1,20} \'?[az] +')。

  4. 无论每个评论重复多少次,每个文档每个文档只能计数一次(注释)。

  5. 获取顶部x个单词根据他们出示的文件数量来计算。

查询:

<$ p $ (
SELECT CONCAT(link_id,'/',id)id,REGEXP_EXTRACT_ALL(
REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body)),$ p $ #standardSQL
WITH word_by_post AS '&','&'),r'& [az] {2,4};','*')
,r'[az] {1,20} \' ?[az] +')words
FROM`fh-bigquery.reddit_comments.2017_07`
WHERE body NOT IN('[deleted]','[removed]')
AND subreddit IN ('AskReddit','funny','movies')
AND score> 100
),words_per_doc AS(
SELECT id,word
FROM words_by_post,UNNEST(words) word
WHERE ARRAY_LENGTH(words)> 20
GROUP BY id,word


SELECT word,COUNT(*)docs_with_word
FROM words_per_doc
GROUP BY 1
ORDER BY docs_with_word DESC
LIMIT 100


I'm looking at reddit comments. I'm using some common stopword lists, but I want to create a custom one for this dataset. How can I do this with SQL?

解决方案

One approach to identify stopwords is to look at the ones that show up in most documents.

Steps in this query:

  1. Filter posts for relevancy, quality (choose your subreddits, choose a minimum score, choose a minimum length).
  2. Unescape reddit HTML encoded values.
  3. Decide what counts as a word (in this case r'[a-z]{1,20}\'?[a-z]+').
  4. Each word counts only once per doc (comment), regardless of how many times it's repeated in each comment.
  5. Get the top x words by counting on how many documents they showed up.

Query:

#standardSQL
WITH words_by_post AS (
  SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
    REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&amp;', '&'), r'&[a-z]{2,4};', '*')
      , r'[a-z]{1,20}\'?[a-z]+') words
  FROM `fh-bigquery.reddit_comments.2017_07`  
  WHERE body NOT IN ('[deleted]', '[removed]')
  AND subreddit IN ('AskReddit', 'funny', 'movies')
  AND score > 100
), words_per_doc AS (
  SELECT id, word
  FROM words_by_post, UNNEST(words) word
  WHERE ARRAY_LENGTH(words) > 20
  GROUP BY id, word
)

SELECT word, COUNT(*) docs_with_word
FROM words_per_doc
GROUP BY 1
ORDER BY docs_with_word DESC
LIMIT 100

这篇关于如何使用BigQuery识别停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆