여러 캡처 그룹이있는 R의 정규식 그룹 캡처

Question 1

R에서 정규식 일치에서 그룹 캡처를 추출 할 수 있습니까? 지금까지 나는 전혀 말할 수있는 grep, grepl, regexpr, gregexpr, sub, 또는 gsub그룹 캡처를 반환합니다.

이렇게 인코딩 된 문자열에서 키-값 쌍을 추출해야합니다.

\((.*?) :: (0\.[0-9]+)\)

항상 여러 개의 풀 매치 그립을 수행하거나 외부 (R이 아닌) 처리를 수행 할 수 있지만 R 내에서 모든 작업을 수행 할 수 있기를 바랐습니다. 이러한 기능을 제공하는 함수 나 패키지가 있습니까?

Question 2

str_match(), stringr패키지에서이 작업을 수행합니다. 일치하는 각 그룹에 대해 하나의 열 (및 전체 일치에 대해 하나)이있는 문자 행렬을 반환합니다.

> s = c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
> str_match(s, "\\((.*?) :: (0\\.[0-9]+)\\)")
     [,1]                         [,2]       [,3]          
[1,] "(sometext :: 0.1231313213)" "sometext" "0.1231313213"
[2,] "(moretext :: 0.111222)"     "moretext" "0.111222"

Question 3

gsub는 귀하의 예에서 다음을 수행합니다.

gsub("\\((.*?) :: (0\\.[0-9]+)\\)","\\1 \\2", "(sometext :: 0.1231313213)")
[1] "sometext 0.1231313213"

따옴표에서 \ s를 이중 이스케이프해야 정규식에서 작동합니다.

도움이 되었기를 바랍니다.

Question 4

시도 regmatches()하고 regexec():

regmatches("(sometext :: 0.1231313213)",regexec("\\((.*?) :: (0\\.[0-9]+)\\)","(sometext :: 0.1231313213)"))
[[1]]
[1] "(sometext :: 0.1231313213)" "sometext"                   "0.1231313213"

Question 5

gsub ()는이를 수행하고 캡처 그룹 만 반환 할 수 있습니다.

그러나 이것이 작동하려면 gsub () 도움말에 언급 된대로 캡처 그룹 외부의 요소를 명시 적으로 선택해야합니다.

(...) 대체되지 않은 문자형 벡터 'x'의 요소는 변경되지 않고 반환됩니다.

따라서 선택할 텍스트가 문자열 중간에있는 경우 캡처 그룹 앞뒤에. *를 추가하면 반환 만 할 수 있습니다.

gsub(".*\\((.*?) :: (0\\.[0-9]+)\\).*","\\1 \\2", "(sometext :: 0.1231313213)") [1] "sometext 0.1231313213"

Question 6

펄 호환 정규식을 좋아합니다. 아마도 다른 누군가도 그렇게 할 것입니다 ...

다음은 Perl 호환 정규식을 수행하고 내가 익숙한 다른 언어의 기능과 일치하는 함수입니다.

regexpr_perl <- function(expr, str) {
  match <- regexpr(expr, str, perl=T)
  matches <- character(0)
  if (attr(match, 'match.length') >= 0) {
    capture_start <- attr(match, 'capture.start')
    capture_length <- attr(match, 'capture.length')
    total_matches <- 1 + length(capture_start)
    matches <- character(total_matches)
    matches[1] <- substr(str, match, match + attr(match, 'match.length') - 1)
    if (length(capture_start) > 1) {
      for (i in 1:length(capture_start)) {
        matches[i + 1] <- substr(str, capture_start[[i]], capture_start[[i]] + capture_length[[i]] - 1)
      }
    }
  }
  matches
}

Question 7

이것이 제가이 문제를 해결하는 방법입니다. 두 개의 별도 정규식을 사용하여 첫 번째 및 두 번째 캡처 그룹을 일치시키고 두 개의 gregexpr호출을 실행 한 다음 일치하는 하위 문자열을 가져옵니다.

regex.string <- "(?<=\\().*?(?= :: )"
regex.number <- "(?<= :: )\\d\\.\\d+"

match.string <- gregexpr(regex.string, str, perl=T)[[1]]
match.number <- gregexpr(regex.number, str, perl=T)[[1]]

strings <- mapply(function (start, len) substr(str, start, start+len-1),
                  match.string,
                  attr(match.string, "match.length"))
numbers <- mapply(function (start, len) as.numeric(substr(str, start, start+len-1)),
                  match.number,
                  attr(match.number, "match.length"))

Question 8

와 솔루션 strcapture로부터 utils:

x <- c("key1 :: 0.01",
       "key2 :: 0.02")
strcapture(pattern = "(.*) :: (0\\.[0-9]+)",
           x = x,
           proto = list(key = character(), value = double()))
#>    key value
#> 1 key1  0.01
#> 2 key2  0.02

Question 9

에서 제안으로 stringr패키지,이 중 하나를 사용하여 수행 할 수 있습니다 str_match()또는 str_extract().

설명서에서 수정 :

library(stringr)

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", 
             "239 923 8115 and 842 566 4692",
             "Work: 579-499-7527", "$1000",
             "Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

그룹 추출 및 결합 :

str_extract_all(strings, phone, simplify=T)
#      [,1]           [,2]          
# [1,] "219 733 8965" ""            
# [2,] "329-293-8753" ""            
# [3,] ""             ""            
# [4,] "239 923 8115" "842 566 4692"
# [5,] "579-499-7527" ""            
# [6,] ""             ""            
# [7,] "543.355.3679" ""

출력 행렬로 그룹 표시 (2+ 열에 관심 있음) :

str_match_all(strings, phone)
# [[1]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "219 733 8965" "219" "733" "8965"
# 
# [[2]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "329-293-8753" "329" "293" "8753"
# 
# [[3]]
#      [,1] [,2] [,3] [,4]
# 
# [[4]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "239 923 8115" "239" "923" "8115"
# [2,] "842 566 4692" "842" "566" "4692"
# 
# [[5]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "579-499-7527" "579" "499" "7527"
# 
# [[6]]
#      [,1] [,2] [,3] [,4]
# 
# [[7]]
#      [,1]           [,2]  [,3]  [,4]  
# [1,] "543.355.3679" "543" "355" "3679"

Question 10

이것은 unglue 패키지를 사용하여 수행 할 수 있으며 선택한 답변에서 예제를 사용합니다.

# install.packages("unglue")
library(unglue)

s <- c("(sometext :: 0.1231313213)", "(moretext :: 0.111222)")
unglue_data(s, "({x} :: {y})")
#>          x            y
#> 1 sometext 0.1231313213
#> 2 moretext     0.111222

또는 데이터 프레임에서 시작

df <- data.frame(col = s)
unglue_unnest(df, col, "({x} :: {y})",remove = FALSE)
#>                          col        x            y
#> 1 (sometext :: 0.1231313213) sometext 0.1231313213
#> 2     (moretext :: 0.111222) moretext     0.111222

선택적으로 명명 된 capture를 사용하여 풀기 패턴에서 원시 정규식을 가져올 수 있습니다.

unglue_regex("({x} :: {y})")
#>             ({x} :: {y}) 
#> "^\\((.*?) :: (.*?)\\)$"

unglue_regex("({x} :: {y})",named_capture = TRUE)
#>                     ({x} :: {y}) 
#> "^\\((?<x>.*?) :: (?<y>.*?)\\)$"

더 많은 정보 : https://github.com/moodymudskipper/unglue/blob/master/README.md