XML 패키지를 사용하여 HTML 테이블을 R 데이터 프레임으로 스크랩

153

XML 패키지를 사용하여 html 테이블을 긁는 방법은 무엇입니까?

브라질 축구 팀 의 위키 백과 페이지를 예로 들어 보겠습니다 . 나는 이것을 R로 읽고 "브라질이 FIFA 공인 팀과 대결 한 모든 경기 목록"테이블을 data.frame으로 얻고 싶습니다. 어떻게해야합니까?

— 에두아르도 레오니
소스

11

,는 XPath 선택기를 해결하려면 selectorgadget.com/ 체크 아웃 - 그것은 끝내

— 해들리

144

… 또는 더 짧은 시도 :

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

선택된 테이블이 페이지에서 가장 긴 테이블입니다.

tables[[which.max(n.rows)]]

— 짐 지
소스

readHTMLTable 도움말은 또한 htmlParse (), getNodeSet (), textConnection () 및 read.table ()을 사용하여 HTML PRE 요소에서 일반 텍스트 테이블을 읽는 예를 제공합니다.

— Dave X

48

library(RCurl)
library(XML)

# Download page using RCurl
# You may need to set proxy details, etc.,  in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})

# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body 
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]

#In this case, the required table is the only one with class "wikitable sortable"  
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable  <- tables[which(tableclasses=="wikitable sortable")]$table

#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))

# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
   tablerow <- thetable$children[[i]]$children
   opponent <- tablerow[[1]]$children[[2]]$children$text$value
   others <- unname(sapply(tablerow[-1], function(x) x$children$text$value)) 
   content <- rbind(content, c(opponent, others))
}

# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)

추가하기 위해 편집 :

샘플 출력

                     Opponent Played Won Drawn Lost Goals for Goals against  % Won
    1               Argentina     94  36    24   34       148           150  38.3%
    2                Paraguay     72  44    17   11       160            61  61.1%
    3                 Uruguay     72  33    19   20       127            93  45.8%
    ...

— 리치 코튼
소스

7

이 게시물을 찾을 수있을만큼 운 좋은 사람이라면,이 다른 유용한 게시물에 설명 된대로 사용자가 "사용자 에이전트"정보를 추가하지 않으면이 스크립트가 실행되지 않을 것입니다. stackoverflow.com/questions/9056705/…

— Rguy

26

Xpath를 사용하는 또 다른 옵션.

library(RCurl)
library(XML)

theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)

# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))

# Clean up the results
content[,1] <- gsub("Â ", "", content[,1])
tablehead <- gsub("Â ", "", tablehead)
names(content) <- tablehead

이 결과를 생성합니다

> head(content)
   Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina     94  36    24   34       148           150 38.3%
2  Paraguay     72  44    17   11       160            61 61.1%
3   Uruguay     72  33    19   20       127            93 45.8%
4     Chile     64  45    12    7       147            53 70.3%
5      Peru     39  27     9    3        83            27 69.2%
6    Mexico     36  21     6    9        69            34 58.3%

— 학습자
소스

xpath 사용에 대한 훌륭한 전화. 사소한 점 : // * /를 //로 변경하여 경로 인수를 약간 단순화 할 수 있습니다. 예를 들어 "// table [@ class = 'wikitable sortable'] / tr / th"

— Richie Cotton

"스크립트는 연락처 정보가 포함 된 유익한 User-Agent 문자열을 사용해야합니다. 그렇지 않으면 통지없이 IP가 차단 될 수 있습니다." [2] "이 방법을 구현할 수있는 방법이 있습니까?

— pssguy

2

옵션 (RCurlOptions = list (useragent = "zzzz")). 다른 대안 및 토론 은 omegahat.org/RCurl/FAQ.html 섹션 "런타임"을 참조하십시오 .

— 학습자

25

그만큼 rvest과 함께는 xml2HTML 웹 페이지를 구문 분석에 대한 또 다른 인기있는 패키지입니다.

library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)

구문은 xml패키지 보다 사용하기 쉽고 대부분의 웹 페이지에서 패키지는 필요한 모든 옵션을 제공합니다.

— Dave2e
소스

read_html은 " 'file : ///Users/grieb/Auswertungen/tetyana-snp-2016/data/snp-nexus/15/SNP%20Annotation%20Tool.html'오류가 현재 작업 디렉토리에 없습니다. / Users / grieb / Auswertungen / tetyana-snp-2016 / code '). "

— scs