选择行,其中数组包含bigquery中的多个值之一(理想情况下,使用dbplyr) [英] Select rows where array contains one of several values in bigquery (ideally with dbplyr)

查看:48
本文介绍了选择行,其中数组包含bigquery中的多个值之一(理想情况下,使用dbplyr)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在bigquery上有大量推文,现在想过滤那些至少包含一个#hashtags列表的推文.主题标签保存在数组列中(从R中的列表列上载).如何在该数组中的任何位置选择包含多个值之一的行?

I have a large set of tweets on bigquery and now want to filter those that contain at least one of a list of hashtags. The hashtags are saved in an array column (uploaded from a list column in R). How can I select rows that contain one of multiple values in any place in that array?

我将在R中用于分析的代码下面.毫无疑问, dbplyr 无法翻译 purrr 部分,我很高兴学习自己创建SQL,但尚未找到一个好的起点.感谢您的指导.

Below the code that I would use for the analysis in R. Unsurprisingly, dbplyr cannot translate the purrr part, and I am happy to learn to create the SQL myself, but haven't yet found a good starting point. Thanks for any pointers.

PS:我尚未将Tweets上传到bigquery,它们当前存在于80 GB的RDS中.如果有任何简单的数据转换可以简化此操作,那么我仍可以在上传时将其包括在内.

PS: I have not yet uploaded the Tweets to bigquery, they currently live in 80 GB of RDS filed. If any simple data transformation would make this easier, I could still include that while uploading.

tweets_sample <- tibble::tribble(
  ~text, ~hashtags,
  "Hello", list("World", "You"),
  "Goodbye", list("Friend", "You"),
  "Not", list("interested")
)

hashtag_list <- c("World", "interested")

tweets_sample %>% filter(purrr::map_lgl(hashtags, ~ .x %in% hashtag_list %>%
                                   any()))

推荐答案

这里的困难之处在于,您的hashtags列的类型为list或array.根据此问题 dbplyr的翻译,不会出现更高级的数据类型,例如数组建立良好的基础.

The difficult part here is that your hashtags column is of type list or array. As per this question dbplyr translation for more advanced data types like arrays does not appear to be well established.

两种替代方法:

  1. 将主题标签转换为字符串并使用文本搜索(grep).

  1. Convert your hashtags to a character string and use text search (grep).

在R中将bigquery查询写为字符串并将其附加到现有连接.这是一个示例:

Write a bigquery query as a character string in R and attach it to an existing connection. Here is an example:

db_connection = DBI::dbConnect( ... ) # connect to database
remote_tbl = dplyr::tbl(db_connection, from = "remote_table_name")

# build SQL query
sql_query <- glue::glue("SELECT *\n",
                        "FROM (\n",
                        "{dbplyr::sql_render(remote_tbl)}\n",
                        ") alias\n",
  
new_remote_table = dplyr::tbl(db_connection, dbplyr::sql(sql_query))

这篇关于选择行,其中数组包含bigquery中的多个值之一(理想情况下,使用dbplyr)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆