특정 문자열을 포함하는 행 필터링

188

문자열이 포함 된 행을 기준으로 사용하여 데이터 프레임을 필터링해야합니다 RTB.

을 사용하고 dplyr있습니다.

d.del <- df %.%
  group_by(TrackingPixel) %.%
  summarise(MonthDelivery = as.integer(sum(Revenue))) %.%
  arrange(desc(MonthDelivery))

함수 filter를 사용할 수 있다는 것을 알고 dplyr있지만 문자열의 내용을 확인하도록 알려주는 방법을 정확히 알지 못합니다.

특히 열의 내용을 확인하고 싶습니다 TrackingPixel. 문자열에 레이블이 포함되어 있으면 RTB결과에서 행을 제거하고 싶습니다.

r filter dplyr

— 지안루카
소스

27

나는 결코 사용 dplyr하지는 않았지만, 도움을보고 아마도 어쩌면 ?dplyr::filter제안 할 filter(df, !grepl("RTB",TrackingPixel))것입니까?

— thelatemail

1

이것은 실제로 내가 달성하고자하는 것에 가깝습니다. 유일한 문제는 레이블을 포함하고 다른 문자열은 RTB표시하지 않는 문자열을 유지하는 것입니다.

— Gianluca

방금 스텔스 편집을 넣었습니다. 이제 !앞에 추가하여 반전 grepl합니다. 다시 시도하십시오.

— thelatemail

4

또는의 invert및 value인수를 사용하십시오 grep. 정규 표현식을 사용하면 텍스트 작업이 수천 배 더 쉬워집니다.

— Rich Scriven

4

@thelatemail grepl은 postgres에서 작동하지 않습니다 .MySQL 용입니까?

— Statwonk

255

질문에 대한 답변은 이미 @latemail에 의해 위의 의견에 게시되었습니다. 다음과 filter같은 두 번째 및 후속 인수에 정규식을 사용할 수 있습니다 .

dplyr::filter(df, !grepl("RTB",TrackingPixel))

원본 데이터를 제공하지 않았으므로 mtcars데이터 세트 를 사용하여 장난감 예제를 추가합니다 . Mazda 또는 Toyota에서 생산 한 자동차에만 관심이 있다고 상상해보십시오.

mtcars$type <- rownames(mtcars)
dplyr::filter(mtcars, grepl('Toyota|Mazda', type))

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

Toyota와 Mazda 자동차를 제외하고 다른 방법으로 원한다면 filter명령은 다음과 같습니다.

dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))

— 알렉스
소스

열 이름에 공백이 있으면 어떻게됩니까? 추적 픽셀과 같은

— MySchizoBuddy

3

통계 패키지가 아닌 dplyr 패키지의 필터 기능을 사용하고 있는지 확인하십시오.

— JHowIX

2

@MySchizoBuddy : 열 이름에 공백이 있으면 백틱을 사용하여 변수를 선택할 수 있습니다. 위의 예를 수정 : mtcars$`my type` <- rownames(mtcars)다음mtcars %>% filter(grepl('Toyota|Mazda', `my type`))

— alex23lemm

13

객체가 SQL로 변환되지 않은 tbl_sqlas 인 경우에는 작동 grepl하지 않습니다.

— David LeBauer

옵션 1은 dplyr이 마지막으로로드되었는지 확인하는 것입니다. 옵션 2는 접두사 dplyr :: filter입니다.

— userJT

157

해결책

사용할 수 str_detect의 stringr에 포함 패키지 tidyverse패키지. str_detect반환 True또는 False지정된 벡터 일부 특정 문자열을 포함하는지에. 이 부울 값을 사용하여 필터링 할 수 있습니다. 패키지에 대한 자세한 내용은 스트링거 소개를 참조하십시오 stringr.

library(tidyverse)
# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
# ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
# ✔ tibble  1.4.2     ✔ dplyr   0.7.4
# ✔ tidyr   0.7.2     ✔ stringr 1.2.0
# ✔ readr   1.1.1     ✔ forcats 0.3.0
# ─ Conflicts ───────────────────── tidyverse_conflicts() ─
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag()    masks stats::lag()

mtcars$type <- rownames(mtcars)
mtcars %>%
  filter(str_detect(type, 'Toyota|Mazda'))
# mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
# 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
# 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
# 3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
# 4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

Stringr에 대한 좋은 점

우리는 오히려 사용해야합니다 stringr::str_detect()보다 base::grepl(). 다음과 같은 이유가 있기 때문입니다.

stringr패키지에서 제공하는 함수 는 prefix str_를 사용하여 코드를보다 쉽게 읽을 수있게합니다.
stringrpackage 함수의 첫 번째 인수 는 항상 data.frame (또는 value)이며 매개 변수가옵니다. (Paolo 감사합니다)

object <- "stringr"
# The functions with the same prefix `str_`.
# The first argument is an object.
stringr::str_count(object) # -> 7
stringr::str_sub(object, 1, 3) # -> "str"
stringr::str_detect(object, "str") # -> TRUE
stringr::str_replace(object, "str", "") # -> "ingr"
# The function names without common points.
# The position of the argument of the object also does not match.
base::nchar(object) # -> 7
base::substr(object, 1, 3) # -> "str"
base::grepl("str", object) # -> TRUE
base::sub("str", "", object) # -> "ingr"

기준

벤치 마크 테스트 결과는 다음과 같습니다. 큰 데이터 프레임의 경우 str_detect더 빠릅니다.

library(rbenchmark)
library(tidyverse)

# The data. Data expo 09. ASA Statistics Computing and Graphics 
# http://stat-computing.org/dataexpo/2009/the-data.html
df <- read_csv("Downloads/2008.csv")
print(dim(df))
# [1] 7009728      29

benchmark(
  "str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},
  "grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},
  replications = 10,
  columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 2      grepl           10  16.480    1.513    16.195    0.248
# 1 str_detect           10  10.891    1.000     9.594    1.281

— 게이 쿠
소스

1

grep보다 스트링거가 더 나은 옵션 인 이유는 무엇입니까?

— CameronNemo

2

stringr@CameronNemo 패키지에서 제공하는 기능 은 접두사 str_로 시작하여 코드를보다 쉽게 읽을 수 있습니다. 최근의 최신 R 코드에서는 stringr를 사용하는 것이 좋습니다.

— Keiku

3

나는 이것이 매우 개인적인 취향이라고 생각하고 내가 할 @CameronNemo에 동의 base R좋은 같습니다 stringr. 벤치마킹과 같은 '하드 팩트'를 제공하고 "권장"(누가 권장합니까?)을 언급 할 경우이 점을 높이 평가할 것입니다. 감사합니다

— Tjebo

2

또 다른 이유는 tidyverse 프레임 워크의 일관성입니다. 함수의 첫 번째 인수는 항상 data.frame (또는 값)이며 매개 변수가됩니다.

— Paolo

22

이것은 다른 것과 비슷하지만 preferred stringr::str_detect및 dplyr 사용 rownames_to_column합니다.

library(tidyverse)

mtcars %>% 
  rownames_to_column("type") %>% 
  filter(stringr::str_detect(type, 'Toyota|Mazda') )

#>             type  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1      Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2  Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 4  Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

reprex 패키지 (v0.2.0)로 2018-06-26에 작성되었습니다 .

— 쐐기풀
소스

1

str_detect에 stringr패키지

— jsta

3

최신 across()구문이 포함 된 편집

다음은 또는 previous를 tidyverse사용하는 다른 솔루션 입니다. 장점은 둘 이상의 열로 쉽게 확장 할 수 있다는 것 입니다.filter(across())filter_at

와 솔루션도 아래 filter_all에서 문자열 찾기 위해 어떤 사용하여 열을 diamonds문자열 "V"를 찾고, 예를 들어

library(tidyverse)

하나의 열에 만있는 문자열

# for only one column... extendable to more than one creating a column list in `across` or `vars`!
mtcars %>% 
  rownames_to_column("type") %>% 
  filter(across(type, ~ !grepl('Toyota|Mazda', .))) %>%
  head()
#>                type  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1        Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 2    Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 3 Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 4           Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 5        Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 6         Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

현재 대체 된 구문은 다음과 같습니다.

mtcars %>% 
  rownames_to_column("type") %>% 
  filter_at(.vars= vars(type), all_vars(!grepl('Toyota|Mazda',.)))

모든 열의 문자열 :

# remove all rows where any column contains 'V'
diamonds %>%
  filter(across(everything(), ~ !grepl('V', .))) %>%
  head
#> # A tibble: 6 x 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
#> 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
#> 3  0.31 Good    J     SI2      63.3    58   335  4.34  4.35  2.75
#> 4  0.3  Good    J     SI1      64      55   339  4.25  4.28  2.73
#> 5  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
#> 6  0.31 Ideal   J     SI2      62.2    54   344  4.35  4.37  2.71

현재 대체 된 구문은 다음과 같습니다.

diamonds %>% 
  filter_all(all_vars(!grepl('V', .))) %>%
  head

나는 다음에 대한 대안을 찾으려고했지만 즉시 좋은 해결책을 찾지 못했습니다.

    #get all rows where any column contains 'V'
    diamonds %>%
    filter_all(any_vars(grepl('V',.))) %>%
      head
    #> # A tibble: 6 x 10
    #>   carat cut       color clarity depth table price     x     y     z
    #>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
    #> 1 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
    #> 2 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
    #> 3 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
    #> 4 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
    #> 5 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
    #> 6 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49

업데이트 : 이 답변의 사용자 Petr Kajzar에게 감사드립니다 .

diamonds %>%
   filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)

— 테보
소스