如何使用SQL(BigQuery)计算TF / IDF [英] How can I compute TF/IDF with SQL (BigQuery)

查看:224
本文介绍了如何使用SQL(BigQuery)计算TF / IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对reddit注释进行文本分析,并且我想在BigQuery中计算TF-IDF。

解决方案

这个查询在5个阶段工作:


  1. 获取我感兴趣的所有reddit文章。标准化单词(LOWER,只有字母和',unescape一些HTML)。将这些单词拆分为数组。

  2. 计算每个文档中每个单词的tf(词频) - 统计每个文档中出现多少次,相对于单词中的单词数对于每个单词,计算包含它的文档的数量。

  3. 从(3.)开始,获取idf(逆文档频率):包含单词的文档的反比部分,通过将文档总数除以包含该术语的文档数得到,然后取该商的对数

  4. 将tf * idf乘以获得tf-idf。


  5. 该查询设法通过一次传递来完成此操作, (b

      #standardSQL 
    WITH word_by_post AS(
    SELECT CONCAT(link_id,'/',id )id,REGEXP_EXTRACT_ALL(
    REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body),'&','&'),r'& [az] {2,4};','*')
    ,r'[az] {2,20} \ ['az] +')words
    ,COUNT(*)OVER()docs_n
    FROM`fh-bigquery.reddit_comments.2017_07`
    WHERE body NOT IN('[deleted]', '[删除]')
    AND subreddit ='电影'
    AND分数> 100
    ),words_tf AS(
    SELECT id,word,COUNT(*)/ ARRAY_LENGTH(ANY_VALUE(words))tf,ARRAY_LENGTH(ANY_VALUE(words))words_in_doc
    ,ANY_VALUE(docs_n) docs_n
    FROM words_by_post,UNNEST(words)word
    GROUP BY id,word
    HAVING words_in_doc> 30
    ),docs_idf AS(
    SELECT tf.id,word, tf.tf,ARRAY_LENGTH(tfs)docs_with_word,LOG(docs_n / ARRAY_LENGTH(tfs))idf
    FROM(
    SELECT word,ARRAY_AGG(STRUCT(tf,id,words_in_doc))tfs,ANY_VALUE(docs_n) docs_n
    FROM words_tf
    GROUP BY 1
    ),UNNEST(tfs)tf



    SELECT *,tf * idf tfidf
    FROM docs_idf
    WHERE docs_with_word> 1
    ORDER BY tfidf DESC
    LIMIT 1000


    I'm doing text analysis over reddit comments, and I want to calculate the TF-IDF within BigQuery.

    解决方案

    This query works on 5 stages:

    1. Obtain all reddit posts I'm interested in. Normalize words (LOWER, only letters and ', unescape some HTML). Split those words into an array.
    2. Calculate the tf (term frequency) for each word in each doc - count how many times it shows up in each doc, relative to the number of words in said doc.
    3. For each word, calculate the number of docs that contain it.
    4. From (3.), obtain idf (inverse document frequency): "inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient"
    5. Multiply tf*idf to obtain tf-idf.

    This query manages to do this on one pass, by passing the obtained values up the chain.

    #standardSQL
    WITH words_by_post AS (
      SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
        REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*')
          , r'[a-z]{2,20}\'?[a-z]+') words
      , COUNT(*) OVER() docs_n
      FROM `fh-bigquery.reddit_comments.2017_07`  
      WHERE body NOT IN ('[deleted]', '[removed]')
      AND subreddit = 'movies'
      AND score > 100
    ), words_tf AS (
      SELECT id, word, COUNT(*) / ARRAY_LENGTH(ANY_VALUE(words)) tf, ARRAY_LENGTH(ANY_VALUE(words)) words_in_doc
        , ANY_VALUE(docs_n) docs_n
      FROM words_by_post, UNNEST(words) word
      GROUP BY id, word
      HAVING words_in_doc>30
    ), docs_idf AS (
      SELECT tf.id, word, tf.tf, ARRAY_LENGTH(tfs) docs_with_word, LOG(docs_n/ARRAY_LENGTH(tfs)) idf
      FROM (
        SELECT word, ARRAY_AGG(STRUCT(tf, id, words_in_doc)) tfs, ANY_VALUE(docs_n) docs_n
        FROM words_tf
        GROUP BY 1
      ), UNNEST(tfs) tf
    )    
    
    
    SELECT *, tf*idf tfidf
    FROM docs_idf
    WHERE docs_with_word > 1
    ORDER BY tfidf DESC
    LIMIT 1000
    

    这篇关于如何使用SQL(BigQuery)计算TF / IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆