按子段对数据帧进行排序 [英] Ordering a dataframe by its subsegments

查看:28
本文介绍了按子段对数据帧进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我和我的团队正在处理成千上万个具有相似段的网址.一些 URL 在我们感兴趣的位置有一个段(seg"、复数、segs").其他类似的 URL 在我们感兴趣的位置有不同的段.我们需要对由 URL 和相关唯一段组成的数据帧进行排序在感兴趣的位置,显示那些独特段的频率.

My team and I are dealing with many thousands of URLs that have similar segments. Some URLs have one segment ("seg", plural, "segs") in a position of interest to us. Other similar URLs have a different seg in the position of interest to us. We need to sort a dataframe consisting of URLs and associated unique segs in the position of interest, showing the frequency of those unique segs.

这是一个简化的例子:

 url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
 seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
 df <- data.frame(url,seg)

我们正在寻找以下内容:

We are looking for the following:

url freq seg 
 1   3    a   in other words, url #1 appears three times each with a seg = "a",
 2   2    b   in other words: url #2 appears twice each with a seg = "b",
 3   3    c   in other words: url #3 appears three times with a seg = "c", 
 3   2    x                                  two times with a seg = "x", and, 
 3   1    y                                  once with a seg = "y"
 4   1    d   etc.

我可以使用循环和几个小步骤到达那里,但我相信有一种更优雅的方法来做到这一点.这是我的不雅方法:

I can get there using a loop and several small steps, but am convinced there is a more elegant way of doing this. Here's my inelegant approach:

创建具有 num.unique 行和三列(url、freq、seg)的空数据框

Create empty dataframe with num.unique rows and three columns (url, freq, seg)

 result <- data.frame(url=0, Freq=0, seg=0)

确定唯一网址

 unique.df.url <- unique(df$url)

遍历数据框

 for (xx in unique.df.url) {
   url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
   freq.df.url <- data.frame(table(url.seg))  # summarize the frequency distribution of the segs by url
   result <- rbind(result,freq.df.url)  # append a new data.frame onto the last one
 }

消除数据框中频率 = 0 的行

Eliminate rows in the dataframe where Frequency = 0

 result.freq <- result[which(result$Freq |0), ]

按 URL 对数据框进行排序

Sort the dataframe by URL

 result.order <- result.freq[order(result.freq$url), ]

这产生了预期的结果,但由于它太不优雅了,我担心一旦我们转向规模,所需的时间将令人望而却步,或者至少是一个问题.有什么建议?

This yields the desired results, but since it is so inelegant, I am concerned that once we move to scale, the time required will be prohibitive or at least a concern. Any suggestions?

推荐答案

在基础 R 中,您可以这样做:

In base R you can do this :

aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)

#   seg url freq
# 1   a   1    3
# 2   b   2    2
# 3   c   3    3
# 4   x   3    2
# 5   y   3    1
# 6   d   4    1

$<- 的诀窍就是在任何地方添加一个值为 1 的列 freq,而无需更改源表.

The trick with $<- is just to add a column freq of value 1 everywhere, without changing your source table.

另一种可能性:

subset(as.data.frame(table(df[2:1])),Freq!=0)
#    seg url Freq
# 1    a   1    3
# 8    b   2    2
# 15   c   3    3
# 17   x   3    2
# 18   y   3    1
# 22   d   4    1

这里我使用 [2:1] 来切换列的顺序,以便 table 以所需的方式对结果进行排序.

Here I use [2:1] to switch the order of columns so table orders the results in the required way.

这篇关于按子段对数据帧进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆