R의 'tm'(텍스트 마이닝) 패키지에서 VectorSource 및 VCorpus 란 무엇입니까

9

tm 패키지에 정확히 VectorSource와 VCorpus가 무엇인지 확실하지 않습니다.

이것에 대한 문서가 명확하지 않습니다. 누구든지 간단한 용어로 나를 이해할 수 있습니까?

r text-mining

— me
소스

12

"Corpus"는 텍스트 문서 모음입니다.

tm의 VCorpus는 "Volatile"코퍼스를 의미합니다. 즉, 코퍼스가 메모리에 저장되며이를 포함하는 R 개체가 파괴되면 소멸됩니다.

이것을 메모리 외부에 저장된 PCorpus 또는 Permanent Corpus와 대조하십시오.

tm을 사용하여 VCorpus를 만들려면 "Source"개체를 매개 변수로 VCorpus 메서드에 전달해야합니다. 이 메소드를 사용하여 사용 가능한 소스를 찾을 수 있습니다.
getSources ()

[1] "DataframeSource" "DirSource" "URISource" "VectorSource"
[5] "XMLSource" "ZipSource"

Source는 directory 나 URI 등과 같은 입력 위치를 추상화합니다. VectorSource는 문자형 벡터만을위한 것입니다

간단한 예 :

char 벡터가 있다고 가정 해보십시오.

입력 <-c ( '이것은 1 행입니다.', '이것은 2 번째입니다')

소스 만들기-vecSource <-VectorSource (input)

그런 다음 모음을 생성하십시오-VCorpus (vecSource)

도움이 되었기를 바랍니다. 자세한 내용은 여기를 참조하십시오-https: //cran.r-project.org/web/packages/tm/vignettes/tm.pdf

— 인디
소스

5

실용적인 측면에서 큰 차이가있다 Corpus및 VCorpus.

CorpusSimpleCorpus기본값으로 사용 하므로 일부 기능을 VCorpus사용할 수 없습니다. 즉시 명백한 점 SimpleCorpus은 대시, 밑줄 또는 기타 문장 부호를 유지할 수 없다는 것입니다. SimpleCorpus또는 Corpus자동으로 제거 VCorpus하지 않습니다. Corpus에 대한 도움말에서 찾을 수있는 다른 제한 사항 이 있습니다 ?SimpleCorpus.

예를 들면 다음과 같습니다.

# Read a text file from internet
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)

# load the data as a corpus
C.mlk <- Corpus(VectorSource(text))
C.mlk
V.mlk <- VCorpus(VectorSource(text))
V.mlk

출력은 다음과 같습니다.

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 46
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 46

물체를 검사하는 경우 :

# inspect the content of the document
inspect(C.mlk[1:2])
inspect(V.mlk[1:2])

Corpus텍스트의 압축 을 푼다는 것을 알 수 있습니다 .

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 2
[1]                                                                                                                                            
[2] And so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.


<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2
[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 0
[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 139

동안은 VCorpus개체 내에서 함께 유지합니다.

이제 두 가지 모두에 대해 행렬 변환을 수행한다고 가정 해 보겠습니다.

dtm.C.mlk <- DocumentTermMatrix(C.mlk)
length(dtm.C.mlk$dimnames$Terms)
# 168

dtm.V.mlk <- DocumentTermMatrix(V.mlk)
length(dtm.V.mlk$dimnames$Terms)
# 187

마지막으로 내용을 보자. 이것은 Corpus:

grep("[[:punct:]]", dtm.C.mlk$dimnames$Terms, value = TRUE)
# character(0)

그리고 VCorpus:

grep("[[:punct:]]", dtm.V.mlk$dimnames$Terms, value = TRUE)

[1] "alabama,"       "almighty,"      "brotherhood."   "brothers."     
 [5] "california."    "catholics,"     "character."     "children,"     
 [9] "city,"          "colorado."      "creed:"         "day,"          
[13] "day."           "died,"          "dream."         "equal."        
[17] "exalted,"       "faith,"         "gentiles,"      "georgia,"      
[21] "georgia."       "hamlet,"        "hampshire."     "happens,"      
[25] "hope,"          "hope."          "injustice,"     "justice."      
[29] "last!"          "liberty,"       "low,"           "meaning:"      
[33] "men,"           "mississippi,"   "mississippi."   "mountainside," 
[37] "nation,"        "nullification," "oppression,"    "pennsylvania." 
[41] "plain,"         "pride,"         "racists,"       "ring!"         
[45] "ring,"          "ring."          "self-evident,"  "sing."         
[49] "snow-capped"    "spiritual:"     "straight;"      "tennessee."    
[53] "thee,"          "today!"         "together,"      "together."     
[57] "tomorrow,"      "true."          "york."

문장 부호가있는 단어를 살펴보십시오. 그것은 큰 차이입니다. 그렇지 않습니까?

— f0nzie
소스