dplyr：组中的最大值，不包括每行中的值？

Z时代
2024-01-10
分类：问答

我有如下所示的数据帧：dplyr：组中的最大值，不包括每行中的值？

> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4)) 
> df 
Source: local data frame [6 x 2] 
    g x 
1 A 7 
2 A 3 
3 B 5 
4 B 9 
5 B 2 
6 C 4

我知道如何与最大x值加列各组g：

> df %>% group_by(g) %>% mutate(x_max = max(x)) 
Source: local data frame [6 x 3] 
Groups: g 
    g x x_max 
1 A 7  7 
2 A 3  7 
3 B 5  9 
4 B 9  9 
5 B 2  9 
6 C 4  4

但我想是得到的是每个组的最大值x的值g,不包括x的值，每行。

对于给定的例子中，所需的输出应该是这样的：

Source: local data frame [6 x 3] Groups: g g x x_max x_max_exclude 1 A 7 7 3 2 A 3 7 7 3 B 5 9 9 4 B 9 9 5 5 B 2 9 9 6 C 4 4 NA

我想我可能能够使用row_number()删除特定的元素，并采取了什么仍然是最大的，但命中警告消息和得到不正确-Inf输出：

> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r])) 
Source: local data frame [6 x 5] 
Groups: g 
    g x x_max r x_max_exclude 
1 A 7  7 1   -Inf 
2 A 3  7 2   -Inf 
3 B 5  9 1   -Inf 
4 B 9  9 2   -Inf 
5 B 2  9 3   -Inf 
6 C 4  4 1   -Inf 
Warning messages: 
1: In max(c(4, 9, 2)[-1:3]) : 
    no non-missing arguments to max; returning -Inf 
2: In max(c(4, 9, 2)[-1:3]) : 
    no non-missing arguments to max; returning -Inf 
3: In max(c(4, 9, 2)[-1:3]) : 
    no non-missing arguments to max; returning -Inf

什么是最{可读性，简洁，高效}办法让dplyr这个输出？任何洞察到为什么我的尝试使用row_number()不起作用也将不胜感激。谢谢您的帮助。

回答：

你可以尝试：

df %>% 
    group_by(g) %>% 
    arrange(desc(x)) %>% 
    mutate(max = ifelse(x == max(x), x[2], max(x)))

其中给出：

#Source: local data frame [6 x 3] #Groups: g # # g x max #1 A 7 3 #2 A 3 7 #3 B 9 5 #4 B 5 9 #5 B 2 9 #6 C 4 NA

基准

我试过的解决方案至今在benchma RK：

df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE), 
       x = sample(1:10, 10e5, replace = TRUE)) 
library(microbenchmark) 
mbm <- microbenchmark(
    steven = df %>% 
    group_by(g) %>% 
    arrange(desc(x)) %>% 
    mutate(max = ifelse(x == max(x), x[2], max(x))), 
    eric = df %>% 
    group_by(g) %>% 
    mutate(x_max = max(x), 
      x_max2 = sort(x, decreasing = TRUE)[2], 
      x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
    select(-x_max2), 
    arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g], 
    times = 50 
)

@ Arun的data.table溶液是最快：

# Unit: milliseconds # expr min lq mean median uq max neval cld # steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b # eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c # arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a

回答：

有趣的问题。下面是使用data.table一个办法：

require(data.table) 
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]

的想法是为了通过列x并在这些指标中，我们按g。由于我们有有序的索引，因此对于第一个.N-1行，最大值是.N处的值。对于第.N行，这是第.N-1行的值。

.N是一个特殊的变量，它保存每个组中的观察值的数量。

我会留给你和/或dplyr专家来翻译这个（或用另一种方法回答）。

回答：

这是我迄今为止最好的。不知道是否有更好的方法。

df %>% 
    group_by(g) %>% 
    mutate(x_max = max(x), 
     x_max2 = sort(x, decreasing = TRUE)[2], 
     x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
    select(-x_max2)

回答：

另一种方式与功能：我们编写一个名为max_exclude函数，它是你所描述的操作

df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)) 
Source: local data frame [6 x 3] 
Groups: g 
    g x x_max_exclude 
1 A 7    3 
2 A 3    7 
3 B 5    9 
4 B 9    5 
5 B 2    9 
6 C 4   NA

。

max_exclude <- function(v) { 
    res <- c() 
    for(i in seq_along(v)) { 
    res[i] <- suppressWarnings(max(v[-i])) 
    } 
    res <- ifelse(!is.finite(res), NA, res) 
    as.numeric(res) 
}

它与base R太：

df$x_max_exclude <- with(df, ave(x, g, FUN=max_exclude)) 
Source: local data frame [6 x 3] 
    g x x_max_exclude 
1 A 7    3 
2 A 3    7 
3 B 5    9 
4 B 9    5 
5 B 2    9 
6 C 4   NA

基准

这里有一个教训孩子，for循环当心！

big.df <- data.frame(g=rep(LETTERS[1:4], each=1e3), x=sample(10, 4e3, replace=T)) 
microbenchmark(
    plafort_dplyr = big.df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)), 
    plafort_ave = big.df$x_max_exclude <- with(big.df, ave(x, g, FUN=max_exclude)), 
    StevenB = (big.df %>% 
    group_by(g) %>% 
    mutate(max = ifelse(row_number(desc(x)) == 1, x[row_number(desc(x)) == 2], max(x))) 
    ), 
    Eric = df %>% 
    group_by(g) %>% 
    mutate(x_max = max(x), 
      x_max2 = sort(x, decreasing = TRUE)[2], 
      x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
    select(-x_max2), 
    Arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g] 
) 
Unit: milliseconds 
      expr  min  lq  mean median  uq  max neval 
plafort_dplyr 75.219042 85.207442 89.247409 88.203225 90.627663 179.553166 100 
    plafort_ave 75.907798 84.604180 87.136122 86.961251 89.431884 104.884294 100 
     StevenB 4.436973 4.699226 5.207548 4.931484 5.364242 11.893306 100 
      Eric 7.233057 8.034092 8.921904 8.414720 9.060488 15.946281 100 
      Arun 1.789097 2.037235 2.410915 2.226988 2.423638 9.326272 100

以上是 dplyr：组中的最大值，不包括每行中的值？的全部内容，来源链接： utcz.com/qa/257098.html

dplyr：组中的最大值，不包括每行中的值？

回答：

回答：

回答：

回答：

基准

其他人也看了：