如何在R中的任何匹配子句上保留JOIN? [英] How to LEFT JOIN on ANY of the matching clauses in R?
问题描述
您能帮我吗?
我有一个数据框( df1
),其中包含网站CMS中发布的所有文章的索引.在当前的 URL
中有一列,在发布后被更改的原始URL中有一列(列名 Origin
):
I have a dataframe (df1
) that has index of all articles published in the website's CMS. There's a column for current URL
and a column of original URLs in case they were changed after publication (column name Origin
):
URL | 来源 | ArticleID | 作者 | 类别 | 费用 |
---|---|---|---|---|---|
https://example.com/article1 | https://example.com/article | 001 | AuthorName | 政治 | 120美元 |
https://example.com/article2 | https://example.com/article2 | 002 | AuthorName | 财务 | 68美元 |
URL | Origin | ArticleID | Author | Category | Cost |
---|---|---|---|---|---|
https://example.com/article1 | https://example.com/article | 001 | AuthorName | Politics | 120 USD |
https://example.com/article2 | https://example.com/article2 | 002 | AuthorName | Finance | 68 USD |
接下来,我有一个巨大的数据框( df2
),其中包含一个时间范围内的网络分析导出.它有一个日期,URL和浏览量只有1列.
Next I have an huge dataframe (df2
)with web analytics export for a timeframe. It has a date, just 1 column for URL and number of pageviews.
PageviewDate | URL | 浏览量 |
---|---|---|
2019-01-01 | https://example.com/article | 224544 |
2019-01-01 | https://example.com/article1 | 656565 |
我该如何与第一个数据框保持连接,但在 URL
= URL
或 Origin
= URL
How do I left join this with first dataframe but matching on either URL
= URL
OR Origin
= URL
因此最终结果将如下所示:
So that end result would look like this:
PageviewDate | 浏览量 | ArticleID | 作者 | 类别 |
---|---|---|---|---|
2019-01-01 | 881109 | 001 | AuthorName | 政治 |
即 881109
是将与同一文章相关的 224544
和 656565
相加的结果
i.e 881109
is the result of adding up 224544
and 656565
that both related to the same article
我想我正在寻找的等效于SQL语法,例如:
I guess what I'm looking for is the equivalent of SQL syntax like:
LEFT JOIN ...`enter code here`
ON URL = URL
OR Origin = URL```
推荐答案
您可以获取长格式的数据帧1( df1
),以便 Origin
和URL
在同一列中,然后与第二个数据帧( df2
)执行联接.
You could get dataframe 1 (df1
) in long format so that both Origin
and URL
are in the same column and then perform the join with second dataframe (df2
).
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = c(URL, Origin), values_to = 'URL') %>%
inner_join(df2, by = 'URL') %>%
select(-name)
# ArticleID Author Category name URL PageviewDate Pageviews
# <int> <chr> <chr> <chr> <chr> <chr> <int>
#1 1 AuthorName Politics URL https://example.com/article1 2019-01-01 656565
#2 1 AuthorName Politics Origin https://example.com/article 2019-01-01 224544
数据
df1 <- structure(list(URL = c("https://example.com/article1", "https://example.com/article2"
), Origin = c("https://example.com/article", "https://example.com/article2"
), ArticleID = 1:2, Author = c("AuthorName", "AuthorName"),
Category = c("Politics", "Finance")), class = "data.frame",row.names =c(NA, -2L))
df2 <- structure(list(PageviewDate = c("2019-01-01", "2019-01-01"),
URL = c("https://example.com/article", "https://example.com/article1"),
Pageviews = c(224544L, 656565L)), class = "data.frame", row.names = c(NA, -2L))
这篇关于如何在R中的任何匹配子句上保留JOIN?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!