如何在R中的任何匹配子句上保留JOIN? [英] How to LEFT JOIN on ANY of the matching clauses in R?

查看:43
本文介绍了如何在R中的任何匹配子句上保留JOIN?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您能帮我吗?

我有一个数据框( df1 ),其中包含网站CMS中发布的所有文章的索引.在当前的 URL 中有一列,在发布后被更改的原始URL中有一列(列名 Origin ):

I have a dataframe (df1) that has index of all articles published in the website's CMS. There's a column for current URL and a column of original URLs in case they were changed after publication (column name Origin):

<身体>
URL 来源 ArticleID 作者类别费用
https://example.com/article1 https://example.com/article 001 AuthorName 政治 120美元
https://example.com/article2 https://example.com/article2 002 AuthorName 财务 68美元
URL Origin ArticleID Author Category Cost
https://example.com/article1 https://example.com/article 001 AuthorName Politics 120 USD
https://example.com/article2 https://example.com/article2 002 AuthorName Finance 68 USD

接下来,我有一个巨大的数据框( df2 ),其中包含一个时间范围内的网络分析导出.它有一个日期,URL和浏览量只有1列.

Next I have an huge dataframe (df2)with web analytics export for a timeframe. It has a date, just 1 column for URL and number of pageviews.

<身体>
PageviewDate URL 浏览量
2019-01-01 https://example.com/article 224544
2019-01-01 https://example.com/article1 656565

我该如何与第一个数据框保持连接,但在 URL = URL Origin = URL

How do I left join this with first dataframe but matching on either URL = URL OR Origin = URL

因此最终结果将如下所示:

So that end result would look like this:

<身体>
PageviewDate 浏览量 ArticleID 作者类别
2019-01-01 881109 001 AuthorName 政治

881109 是将与同一文章相关的 224544 656565 相加的结果

i.e 881109 is the result of adding up 224544 and 656565 that both related to the same article

我想我正在寻找的等效于SQL语法,例如:

I guess what I'm looking for is the equivalent of SQL syntax like:

LEFT JOIN ...`enter code here`
ON URL = URL
OR Origin = URL```

推荐答案

您可以获取长格式的数据帧1( df1 ),以便 Origin URL 在同一列中,然后与第二个数据帧( df2 )执行联接.

You could get dataframe 1 (df1) in long format so that both Origin and URL are in the same column and then perform the join with second dataframe (df2).

library(dplyr)
library(tidyr)

df1 %>%
  pivot_longer(cols = c(URL, Origin), values_to = 'URL') %>%
  inner_join(df2, by = 'URL') %>%
  select(-name)

#  ArticleID Author     Category name   URL                          PageviewDate Pageviews
#      <int> <chr>      <chr>    <chr>  <chr>                        <chr>            <int>
#1         1 AuthorName Politics URL    https://example.com/article1 2019-01-01      656565
#2         1 AuthorName Politics Origin https://example.com/article  2019-01-01      224544

数据

df1 <- structure(list(URL = c("https://example.com/article1", "https://example.com/article2"
), Origin = c("https://example.com/article", "https://example.com/article2"
), ArticleID = 1:2, Author = c("AuthorName", "AuthorName"), 
Category = c("Politics", "Finance")), class = "data.frame",row.names =c(NA, -2L))


df2 <- structure(list(PageviewDate = c("2019-01-01", "2019-01-01"), 
    URL = c("https://example.com/article", "https://example.com/article1"), 
Pageviews = c(224544L, 656565L)), class = "data.frame", row.names = c(NA, -2L))

这篇关于如何在R中的任何匹配子句上保留JOIN?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆