dplyr 행의 하위 집합에서 여러 열을 변경 / 바꾸기

Question 1

dplyr 기반 워크 플로를 시도하는 중입니다 (대부분 data.table을 사용하는 대신). 이에 상응하는 dplyr 솔루션을 찾을 수없는 문제가 발생했습니다. . 일반적으로 단일 조건에 따라 여러 열을 조건부로 업데이트 / 교체해야하는 시나리오에 직면합니다. 다음은 내 data.table 솔루션과 함께 몇 가지 예제 코드입니다.

library(data.table)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit', 
         `:=`(qty.exit = qty,
              cf = 0,
              delta.watts = 13)]

이 같은 문제에 대한 간단한 dplyr 솔루션이 있습니까? 조건을 여러 번 입력 할 필요가 없기 때문에 ifelse 사용을 피하고 싶습니다. 이것은 간단한 예이지만 때로는 단일 조건을 기반으로하는 많은 할당이 있습니다.

도움을 주셔서 미리 감사드립니다!

Question 2

이러한 솔루션은 (1) 파이프 라인을 유지하고, (2) 입력을 덮어 쓰지 않으며 , (3) 조건을 한 번만 지정하면됩니다.

1a) mutate_cond 파이프 라인에 통합 할 수있는 데이터 프레임 또는 데이터 테이블에 대한 간단한 함수를 만듭니다. 이 함수는 다음과 mutate같지만 조건을 충족하는 행에서만 작동합니다.

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last 이것은 다시 비슷 mutate하지만 group_by(아래 예에서와 같이) 내에서만 사용 되며 모든 그룹이 아닌 마지막 그룹에서만 작동하는 데이터 프레임 또는 데이터 테이블에 대한 대체 함수입니다 . TRUE> FALSE이므로 group_by조건을 지정하면 해당 조건을 mutate_last충족하는 행에서만 작동합니다.

mutate_last <- function(.data, ...) {
  n <- n_groups(.data)
  indices <- attr(.data, "indices")[[n]] + 1
  .data[indices, ] <- .data[indices, ] %>% mutate(...)
  .data
}


DF %>% 
   group_by(is.exit = measure == 'exit') %>%
   mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
   ungroup() %>%
   select(-is.exit)

2) 조건 제거 나중에 제거 할 추가 열로 만들어 조건을 제거합니다. 그런 다음 그림과 같이 논리와 함께 ifelse, replace또는 산술을 사용 합니다. 이것은 데이터 테이블에서도 작동합니다.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

3) sqldfupdate 데이터 프레임을 위해 파이프 라인에서 sqldf 패키지를 통해 SQL 을 사용할 수 있습니다 (하지만 변환하지 않는 한 데이터 테이블은 아닙니다. 이것은 dplyr의 버그를 나타낼 수 있습니다. dplyr 문제 1579 참조 ). 이 코드의 존재로 인해이 코드의 입력을 원치 않게 수정하는 것처럼 보일 수 update있지만 실제로 update는 실제 입력이 아닌 임시로 생성 된 데이터베이스의 입력 복사본에 대해 작동합니다.

library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

4) row_case_when 또한 티블 반환 : case_when으로 벡터화하는 방법에row_case_when 정의 된 내용을 확인하세요 . . 유사한 구문을 사용 case_when하지만 행에 적용됩니다.

library(dplyr)

DF %>%
  row_case_when(
    measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
    TRUE ~ data.frame(qty.exit, cf, delta.watts)
  )

참고 1 : 우리는 이것을 다음과 같이 사용했습니다.DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

참고 2 : 행의 하위 집합 업데이트를 쉽게 지정하는 방법에 대한 문제는 dplyr 문제 134 , 631 , 1518 및 1573 에서 631 이 주 스레드이고 1573 이 여기에 대한 답변 검토입니다.

Question 3

의 magrittr양방향 파이프로 이를 수행 할 수 있습니다 %<>%.

library(dplyr)
library(magrittr)

dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                    cf = 0,  
                                    delta.watts = 13)

이것은 타이핑의 양을 줄이지 만 여전히 data.table.

Question 4

내가 좋아하는 해결책은 다음과 같습니다.

mutate_when <- function(data, ...) {
  dots <- eval(substitute(alist(...)))
  for (i in seq(1, length(dots), by = 2)) {
    condition <- eval(dots[[i]], envir = data)
    mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
    data[condition, names(mutations)] <- mutations
  }
  data
}

예를 들어 다음과 같은 것을 작성할 수 있습니다.

mtcars %>% mutate_when(
  mpg > 22,    list(cyl = 100),
  disp == 160, list(cyl = 200)
)

상당히 읽기 쉽습니다.하지만 성능은 떨어질 수 있습니다.

Question 5

eipi10이 위에서 보여 주듯이, DT는 pass-by-reference 의미론을 사용하고 dplyr는 pass-by-value를 사용하기 때문에 dplyr에서 부분 집합을 대체하는 간단한 방법은 없습니다. dplyr은 ifelse()전체 벡터 에서를 사용해야하는 반면 DT는 하위 집합을 수행하고 참조로 업데이트합니다 (전체 DT 반환). 따라서이 연습에서는 DT가 훨씬 더 빠릅니다.

또는 먼저 하위 집합을 만든 다음 업데이트하고 마지막으로 다시 결합 할 수 있습니다.

dt.sub <- dt[dt$measure == "exit",] %>%
  mutate(qty.exit= qty, cf= 0, delta.watts= 13)

dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])

그러나 DT는 훨씬 더 빠를 것입니다. (eipi10의 새로운 답변을 사용하도록 편집 됨)

library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt= {dt <- dt[measure == 'exit', 
                            `:=`(qty.exit = qty,
                                 cf = 0,
                                 delta.watts = 13)]},
               eipi10= {dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                cf = 0,  
                                delta.watts = 13)},
               alex= {dt.sub <- dt[dt$measure == "exit",] %>%
                 mutate(qty.exit= qty, cf= 0, delta.watts= 13)

               dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])})


Unit: microseconds
expr      min        lq      mean   median       uq      max neval cld
     dt  591.480  672.2565  747.0771  743.341  780.973 1837.539   100  a 
 eipi10 3481.212 3677.1685 4008.0314 3796.909 3936.796 6857.509   100   b
   alex 3412.029 3637.6350 3867.0649 3726.204 3936.985 5424.427   100   b

Question 6

나는 이것을 우연히 발견 mutate_cond()했고 @G가 정말 좋아 했습니다. Grothendieck은 새로운 변수를 처리하는 것이 유용 할 것이라고 생각했습니다. 따라서 아래에는 두 가지 추가 사항이 있습니다.

관련없는 : 두 번째 마지막 줄에 더 많은 비트를 만들어 dplyr사용하여filter()

처음에 세 개의 새 줄은에서 사용할 변수 이름을 가져 mutate()오고 mutate()발생 하기 전에 데이터 프레임에서 새 변수를 초기화 합니다. 나머지 data.frameusing에 대해 새 변수가 초기화되며 기본값 new_init은 누락 ( NA)으로 설정됩니다.

mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
  # Initialize any new variables as new_init
  new_vars <- substitute(list(...))[-1]
  new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
  .data[, new_vars] <- new_init

  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
  .data
}

다음은 홍채 데이터를 사용하는 몇 가지 예입니다.

Petal.Length88로 변경하십시오 Species == "setosa". 이 새로운 버전은 물론 원래 기능에서도 작동합니다.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)

위와 동일하지만 새 변수 x( NA조건에 포함되지 않은 행) 도 만듭니다 . 이전에는 불가능했습니다.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)

위와 동일하지만 조건에 포함되지 않은 행 x은 FALSE로 설정됩니다.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)

이 예제는 값이 다른 여러 새 변수를 초기화 new_init하기 위해 list로 설정 하는 방법 을 보여줍니다 . 여기서는 제외 된 행이 다른 값을 사용하여 초기화되는 두 개의 새 변수가 생성됩니다 ( , 로 x초기화 됨 ).FALSEyNA

iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
                  x = TRUE, y = Sepal.Length ^ 2,
                  new_init = list(FALSE, NA))

Question 7

mutate_cond는 훌륭한 함수이지만 조건을 생성하는 데 사용 된 열에 NA가 있으면 오류가 발생합니다. 조건부 돌연변이는 단순히 그러한 행을 그대로 두어야한다고 생각합니다. 이는 조건이 TRUE 일 때 행을 반환하지만 FALSE 및 NA가있는 두 행을 모두 생략하는 filter ()의 동작과 일치합니다.

이 작은 변경으로 기능은 매력처럼 작동합니다.

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
    condition <- eval(substitute(condition), .data, envir)
    condition[is.na(condition)] = FALSE
    .data[condition, ] <- .data[condition, ] %>% mutate(...)
    .data
}

Question 8

나는 실제로 dplyr이것을 훨씬 쉽게 만들 수있는 어떤 변화도 보지 못했다 . case_when하나의 열에 대해 여러 다른 조건과 결과가있을 때 유용하지만 하나의 조건에 따라 여러 열을 변경하려는 경우에는 도움이되지 않습니다. 마찬가지로 recode한 열에서 여러 다른 값을 바꾸는 경우 입력을 저장하지만 한 번에 여러 열에서 그렇게하는 데 도움이되지 않습니다. 마지막으로 mutate_at등은 데이터 프레임의 행이 아닌 열 이름에만 조건을 적용합니다. 잠재적으로 mutate_at에 대한 함수를 작성할 수 있지만 다른 열에 대해 다르게 작동하는 방법을 알 수 없습니다.

그것은 내가 nestform tidyr과 mapfrom을 사용하여 접근하는 방법 입니다 purrr.

library(data.table)
library(dplyr)
library(tidyr)
library(purrr)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

dt2 <- dt %>% 
  nest(-measure) %>% 
  mutate(data = if_else(
    measure == "exit", 
    map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
    data
  )) %>%
  unnest()

Question 9

한 가지 간결한 해결책은 필터링 된 하위 집합에서 변형을 수행 한 다음 테이블의 종료되지 않은 행을 다시 추가하는 것입니다.

library(dplyr)

dt %>% 
    filter(measure == 'exit') %>%
    mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
    rbind(dt %>% filter(measure != 'exit'))

Question 10

생성하여 rlang, 그로 텐 디크 1A의 예의 약간 수정 된 버전에 대한 필요성 제거 가능 envir같은 인자를 enquo()포착 환경을 .p자동으로 생성된다.

mutate_rows <- function(.data, .p, ...) {
  .p <- rlang::enquo(.p)
  .p_lgl <- rlang::eval_tidy(.p, .data)
  .data[.p_lgl, ] <- .data[.p_lgl, ] %>% mutate(...)
  .data
}

dt %>% mutate_rows(measure == "exit", qty.exit = qty, cf = 0, delta.watts = 13)

Question 11

데이터 세트를 분할하고 TRUE부품 에 대해 정기적 인 mutate 호출을 수행 할 수 있습니다.

dplyr 0.8 은 group_split그룹별로 분할 하는 기능 (그리고 그룹은 호출에서 직접 정의 할 수 있음)을 특징으로하므로 여기서 사용하지만 base::split잘 작동합니다.

library(tidyverse)
df1 %>%
  group_split(measure == "exit", keep=FALSE) %>% # or `split(.$measure == "exit")`
  modify_at(2,~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
  bind_rows()

#    site space measure qty qty.exit delta.watts          cf
# 1     1     4     led   1        0        73.5 0.246240409
# 2     2     3     cfl  25        0        56.5 0.360315879
# 3     5     4     cfl   3        0        38.5 0.279966850
# 4     5     3  linear  19        0        40.5 0.281439486
# 5     2     3  linear  18        0        82.5 0.007898384
# 6     5     1  linear  29        0        33.5 0.392412729
# 7     5     3  linear   6        0        46.5 0.970848817
# 8     4     1     led  10        0        89.5 0.404447182
# 9     4     1     led  18        0        96.5 0.115594622
# 10    6     3  linear  18        0        15.5 0.017919745
# 11    4     3     led  22        0        54.5 0.901829577
# 12    3     3     led  17        0        79.5 0.063949974
# 13    1     3     led  16        0        86.5 0.551321441
# 14    6     4     cfl   5        0        65.5 0.256845013
# 15    4     2     led  12        0        29.5 0.340603733
# 16    5     3  linear  27        0        63.5 0.895166931
# 17    1     4     led   0        0        47.5 0.173088800
# 18    5     3  linear  20        0        89.5 0.438504370
# 19    2     4     cfl  18        0        45.5 0.031725246
# 20    2     3     led  24        0        94.5 0.456653397
# 21    3     3     cfl  24        0        73.5 0.161274319
# 22    5     3     led   9        0        62.5 0.252212124
# 23    5     1     led  15        0        40.5 0.115608182
# 24    3     3     cfl   3        0        89.5 0.066147321
# 25    6     4     cfl   2        0        35.5 0.007888337
# 26    5     1  linear   7        0        51.5 0.835458916
# 27    2     3  linear  28        0        36.5 0.691483644
# 28    5     4     led   6        0        43.5 0.604847889
# 29    6     1  linear  12        0        59.5 0.918838163
# 30    3     3  linear   7        0        73.5 0.471644760
# 31    4     2     led   5        0        34.5 0.972078100
# 32    1     3     cfl  17        0        80.5 0.457241602
# 33    5     4  linear   3        0        16.5 0.492500255
# 34    3     2     cfl  12        0        44.5 0.804236607
# 35    2     2     cfl  21        0        50.5 0.845094268
# 36    3     2  linear  10        0        23.5 0.637194873
# 37    4     3     led   6        0        69.5 0.161431896
# 38    3     2    exit  19       19        13.0 0.000000000
# 39    6     3    exit   7        7        13.0 0.000000000
# 40    6     2    exit  20       20        13.0 0.000000000
# 41    3     2    exit   1        1        13.0 0.000000000
# 42    2     4    exit  19       19        13.0 0.000000000
# 43    3     1    exit  24       24        13.0 0.000000000
# 44    3     3    exit  16       16        13.0 0.000000000
# 45    5     3    exit   9        9        13.0 0.000000000
# 46    2     3    exit   6        6        13.0 0.000000000
# 47    4     1    exit   1        1        13.0 0.000000000
# 48    1     1    exit  14       14        13.0 0.000000000
# 49    6     3    exit   7        7        13.0 0.000000000
# 50    2     4    exit   3        3        13.0 0.000000000

행 순서가 중요하면 tibble::rowid_to_column먼저 사용한 다음 dplyr::arrangeon을 사용 rowid하고 마지막에 선택하십시오.

데이터

df1 <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50),
                 stringsAsFactors = F)

Question 12

이 답변은 이전에 언급되지 않은 것 같습니다. 거의 '기본' data.table솔루션 만큼 빠르게 실행됩니다 .

사용하다 base::replace()

df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
                          cf = replace( cf, measure == 'exit', 0 ),
                          delta.watts = replace( delta.watts, measure == 'exit', 13 ) )

replace는 대체 값을 재활용하므로 열 값 qty을 colums에 입력 qty.exit하려면 하위 집합 qty 도해야합니다. 따라서 qty[ measure == 'exit']첫 번째 대체에서 ..

이제는 measure == 'exit'항상 다시 입력하고 싶지 않을 것입니다 . 따라서 해당 선택을 포함하는 인덱스 벡터를 만들고 위의 함수에서 사용할 수 있습니다.

#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )

df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
               cf = replace( cf, index.v, 0 ),
               delta.watts = replace( delta.watts, index.v, 13 ) )

벤치 마크

# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
# data.table   1.005018 1.053370 1.137456 1.112871 1.186228 1.690996   100
# wimpel       1.061052 1.079128 1.218183 1.105037 1.137272 7.390613   100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995   100

Question 13

일반적인 dplyr 구문을 사용 within하는 대신 기본에서 사용할 수 있습니다 .

dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
              delta.watts[measure == 'exit'] <- 13)

파이프와 잘 통합되는 것 같고 내부에서 원하는 모든 것을 할 수 있습니다.