BigQuery/DataPrep:提取字数的有效方法;将 HTML 转换为纯文本 [英] BigQuery / DataPrep: Efficient way to extract word counts; to convert HTML to plaintext

查看:47
本文介绍了BigQuery/DataPrep:提取字数的有效方法;将 HTML 转换为纯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存储在 BigQuery 中的大约 470 万个文档的表格.有些是纯文本,有些是 HTML.它们每个大约 2k 个令牌,变化很大.我主要使用 DataPrep 进行处理.

我想提取这些标记并计算 TF-IDF 值.>

令牌计数

执行此操作是耗时较多的步骤之一:

id,文档1、foo bar foo baz"2、foo bar bar qux"

然后变成这样:

id, word, count1, 富, 21, 酒吧, 11, 巴兹, 12, 富, 12, 酒吧, 22, qux, 1

一种方法是这样的:

  1. extractlist on document by {alphanum-underscore}+<代码>id,词表1, ["foo", "bar", "foo", "baz"]2, ["foo", "bar", "bar", "qux"]
  2. 扁平化词表<代码>身份证,字1、富1、酒吧1、富1、巴兹2、富2、酒吧2、酒吧2、qux
  3. 按组聚合:id、word、value:count()<代码>ID,字,计数1, 富, 21, 酒吧, 11, 巴兹, 12, 富, 12, 酒吧, 22, qux, 1

然而,步骤 2 &3 很慢,尤其是大文件.

理想情况下,我可以有一个函数将 ["foo", "bar", "foo", "baz"] 转换为 {"foo":2, "bar":1, "baz":1}.这不需要先展平然后分组操作来提取计数,后续展平会更小(因为它是根据唯一项而不是每个项进行操作).

不过,我还没有想出任何方法可以在 DataPrep 中做到这一点.:-/

有什么更有效的方法可以做到这一点?

HTML 到纯文本

我的源数据是纯文本和html的组合.3.7M 文档中只有大约 800k 有明文可用.

我想以某种合理的方式将 html 转换为纯文本(例如,相当于 Nokogiri #content),以便在这种规模下工作,以便我可以对结果进行标记提取.

我可以启动一个执行 bq 查询 的集群,摄取 html,用 nokogiri 处理它,然后将它输出到一个处理过的表中.但这有点复杂,需要大量的 I/O.

有没有更简单/更有效的方法来做到这一点?

解决方案

我认为您可以在 BigQuery 中完成所有操作
下面应该给你一个好的开始
每个文档和整个语料库中都有词频
并且去除了 html 以及只是数字的单词
您现在可以在此处添加任何额外的处理,包括 TF-IDF

#standardSQLWITH removed_html AS (SELECT id, REGEXP_REPLACE(document, r'<[^>]*>', ' ') AS 文档从`你的表`),words_in_documents AS (选择标识,大批(SELECT AS STRUCT word, COUNT(1) AS cntFROM UNNEST(REGEXP_EXTRACT_ALL(document, r'[\w_]+')) AS word按字分组没有 REGEXP_CONTAINS(word, r'^\d+$')) 作为词来自已移除_html),words_in_corpus AS (SELECT word, SUM(cnt) AS cntFROM words_in_documents, UNNEST(words) AS words按字分组)选择 *FROM words_in_corpus

您可以使用问题中的虚拟数据来测试/玩这个

#standardSQLWITH `yourTable` AS (SELECT 1 AS id, "foo bar, foo baz" AS 文档 UNION ALLSELECT 2, "foo bar bar qux" UNION ALL选择 3, '''<h5 id="last_value">LAST_VALUE</h5><pre class="codehilite"><code>LAST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])</code></pre>'''),remove_html AS (SELECT id, REGEXP_REPLACE(document, r'<[^>]*>', ' ') AS 文档从`你的表`),words_in_documents AS (选择标识,大批(SELECT AS STRUCT word, COUNT(1) AS cntFROM UNNEST(REGEXP_EXTRACT_ALL(document, r'[\w_]+')) AS word按字分组没有 REGEXP_CONTAINS(word, r'^\d+$')) 作为词来自已移除_html),words_in_corpus AS (SELECT word, SUM(cnt) AS cntFROM words_in_documents, UNNEST(words) AS words按字分组)选择 *FROM words_in_corpus按 cnt DESC 排序

I have a table of ~4.7M documents stored in BigQuery. Some are plaintext, some HTML. They're around 2k tokens per, with wide variation. I'm mainly using DataPrep to do my processing.

I want to extract those tokens and calculate TF-IDF values.

Token counting

One of the more time-intensive steps is taking this:

id, document
1, "foo bar foo baz"
2, "foo bar bar qux"

And turning it into this:

id, word, count
1, foo, 2
1, bar, 1
1, baz, 1
2, foo, 1
2, bar, 2
2, qux, 1

One way to do it is this:

  1. extractlist on document by {alphanum-underscore}+ id, wordlist 1, ["foo", "bar", "foo", "baz"] 2, ["foo", "bar", "bar", "qux"]
  2. flatten wordlist id, word 1, foo 1, bar 1, foo 1, baz 2, foo 2, bar 2, bar 2, qux
  3. aggregate by group: id, word, value: count() id, word, count 1, foo, 2 1, bar, 1 1, baz, 1 2, foo, 1 2, bar, 2 2, qux, 1

However, steps 2 & 3 are very slow, especially with large documents.

Ideally, I'd be able to have a function that converts ["foo", "bar", "foo", "baz"] into {"foo":2, "bar":1, "baz":1}. That wouldn't require the flatten-then-group operation to extract the count, and the subsequent flatten would be smaller (since it's operating on unique terms rather than each term).

I've not figured out any way to do that in DataPrep, however. :-/

What's a more efficient way to do this?

HTML to plaintext

My source data is a combination of plaintext and html. Only about 800k of the 3.7M documents have plaintext available.

I'd like to convert the html to plaintext in some reasonable way (e.g. the equivalent of Nokogiri #content) that would work at this scale, so that I can then do token extraction on the result.

I could spin up a cluster that does bq query, ingests the html, processes it with nokogiri, and outputs it to a processed table. But that's kinda complicated and requires a lot of i/o.

Is there an easier / more efficient way to do this?

解决方案

I think you can do all within BigQuery
Below should give you good start
There you have words frequency in each document and in whole corpus
And html is stripped out as well as words which are just digits
You can now add here any extra processing including TF-IDF

#standardSQL
WITH removed_html AS (
  SELECT id, REGEXP_REPLACE(document, r'<[^>]*>', ' ') AS document
  FROM `yourTable`
),
words_in_documents AS (
  SELECT id, 
    ARRAY(
      SELECT AS STRUCT word, COUNT(1) AS cnt 
      FROM UNNEST(REGEXP_EXTRACT_ALL(document, r'[\w_]+')) AS word 
      GROUP BY word
      HAVING NOT REGEXP_CONTAINS(word, r'^\d+$')
    ) AS words
  FROM removed_html
),
words_in_corpus AS (
  SELECT word, SUM(cnt) AS cnt
  FROM words_in_documents, UNNEST(words) AS words
  GROUP BY word
)
SELECT * 
FROM words_in_corpus

You can test / play with this using dummy data from your question

#standardSQL
WITH `yourTable` AS (
  SELECT 1 AS id, "foo bar, foo baz" AS document UNION ALL
  SELECT 2, "foo bar bar qux" UNION ALL
  SELECT 3, '''
<h5 id="last_value">LAST_VALUE</h5>
<pre class="codehilite"><code>LAST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])</code></pre>
  '''
),
removed_html AS (
  SELECT id, REGEXP_REPLACE(document, r'<[^>]*>', ' ') AS document
  FROM `yourTable`
),
words_in_documents AS (
  SELECT id, 
    ARRAY(
      SELECT AS STRUCT word, COUNT(1) AS cnt 
      FROM UNNEST(REGEXP_EXTRACT_ALL(document, r'[\w_]+')) AS word 
      GROUP BY word
      HAVING NOT REGEXP_CONTAINS(word, r'^\d+$')
    ) AS words
  FROM removed_html
),
words_in_corpus AS (
  SELECT word, SUM(cnt) AS cnt
  FROM words_in_documents, UNNEST(words) AS words
  GROUP BY word
)
SELECT * 
FROM words_in_corpus
ORDER BY cnt DESC

这篇关于BigQuery/DataPrep:提取字数的有效方法;将 HTML 转换为纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆