수백만 점이 존재할 때 데이터를보다 효율적으로 플롯하는 통계적 방법?

31

수백만 개의 포인트가 존재할 때 R이 플롯을 생성하는 데 오랜 시간이 걸릴 수 있습니다. 포인트가 개별적으로 플롯 된 경우 놀랍지 않습니다. 또한, 그러한 음모는 종종 너무 복잡하고 조밀하여 유용하지 않습니다. 많은 점들이 겹치면서 검은 덩어리를 형성하며 많은 시간을 그 덩어리에 더 많은 점을 그리는 데 소비됩니다.

표준 산점도에서 큰 데이터 를 나타내는 통계적 대안이 있습니까? 밀도도를 고려했지만 다른 대안이 있습니까? $n$

r data-visualization

— 알렉스 스토 다드
소스

1

선형 도표가있는 일부 솔루션은 stats.stackexchange.com/questions/35220/…을 참조하십시오 .

— whuber

13

이것은 준비된 솔루션이없는 어려운 작업입니다 (물론 밀도 플롯은 아무도 신경 쓰지 않는 유혹적인 결함이기 때문에 물론입니다). 그래서 어떻게 할 수 있습니까?

그것들이 실제로 겹치는 경우 (즉, 정확히 동일한 X & Y 좌표를 가짐) 알파를 사용하지 않는 경우 가장 좋은 아이디어는을 사용하여 겹침을 줄이는 것입니다 unique(알파를 사용하면 그러한 그룹에 합산 될 수 있음).

그렇지 않은 경우 좌표를 가장 가까운 픽셀로 수동으로 반올림하고 이전 방법을 사용할 수 있습니다 (더티 솔루션 임).

마지막으로 밀도 플롯을 사용하여 가장 밀도가 높은 영역의 포인트를 하위 샘플링하는 데만 사용할 수 있습니다. 반면에 이것은 정확히 같은 음모를 만들지 않으며 정확하게 조정되지 않으면 아티팩트가 발생할 수 있습니다.

5

unique반올림을 사용 하거나 반올림 하여 겹침을 줄이면 바이어스 된 (기만적인) 플롯이 발생할 수 있습니다. 가벼움 또는 해바라기 플롯과 같은 일부 그래픽 수단을 통해 겹치는 양을 표시하는 것이 중요합니다.

— whuber

44

Dan Carr의 종이 / 방법을 구현 하는 hexbin 패키지를 보십시오 . PDF 비네팅은 내가 아래에 인용 자세한 내용이 있습니다 :

1. 개요

육각형 비닝은 n이 큰 데이터 세트의 구조를 시각화하는 데 유용한 이변 량 히스토그램의 한 형태입니다. 육각 비닝의 기본 개념은 매우 간단합니다.

세트 위의 xy 평면 (range (x), range (y))은 육각형의 규칙적인 격자로 세분화됩니다.

각 육각형에 해당하는 포인트 수를 세고 데이터 구조에 저장

카운트가 0보다 큰 육각형은 색상 램프를 사용하거나 카운트에 비례하여 육각형의 반경을 변경하여 플로팅됩니다. 기본 알고리즘은 데이터 셋의 구조를 표시하는 데 매우 빠르고 효과적입니다. $n \ge 10^6$

그리드의 크기와 색상 램프의 컷이 영리한 방식으로 선택되면 데이터 고유의 구조가 비닝 플롯에 나타나야합니다. 히스토그램에 적용되는 것과 동일한 경고가 육각형 비닝에 적용되며 비닝 매개 변수를 선택할 때주의를 기울여야합니다.

— 더크 에델 뷰텔
소스

4

좋은 일입니다. 의사가 지시 한 것.

— Roman Luštrik

13

(+1) 관심 분야 smoothScatter {RColorBrewer}및 densCols {grDevices}. 유전자 데이터에서 수천에서 수백만 점까지 잘 작동한다는 것을 확인할 수 있습니다.

— chl

2

3D 데이터가있는 경우 어떻게합니까? (scatterplot3d의 경우 너무 많음)

— 스칸

다른 사람들을 시간을 절약하기 위해-제안 된 2 개의 주석으로 smoothScatter가 훨씬 더 나은 기본값 / 기능을 갖도록했습니다.

— Charlie

16

마지막 단락을 완전히 이해하지 못했음을 인정해야합니다.

"밀도 플롯을 찾고 있지는 않지만 (이들은 종종 유용하지만) 간단한 플롯 호출과 동일한 출력을 원하지만 가능한 경우 수백만 개의 초과 플롯보다 훨씬 더 빠릅니다."

또한 어떤 유형의 플롯 (기능)을 찾고 있는지 명확하지 않습니다.

메트릭 변수가 있다고 가정하면 육각형 비닝 플롯 또는 sunnflower 플롯이 유용 할 수 있습니다. 추가 참조는

Unwin / Theus / Hofmann 의 대용량 데이터 세트 그래픽
" 고밀도 산점도 " 에 대한 Quick-R
ggplot2의 stat_hexbin

— 번드 와이즈
소스

6

이 질문에 대한 또 다른 대답은 rgl 패키지인데, OpenGL을 사용하여 수백만 점을 그릴 수 있습니다. 또한 포인트 크기 (예 : 3)를 지정하고 축소하여 이러한 질량 중심을 모 놀리 식 블록으로 보거나 확대하고 모 놀리 식으로 사용 된 구조를 확인하십시오-포인트 크기는 일정하지만 화면상의 거리 확대 / 축소에 따라 다릅니다. 알파 레벨도 사용할 수 있습니다.

— Robi5
소스

5

Here's a file I call bigplotfix.R. If you source it, it will define a wrapper for plot.xy which "compresses" the plot data when it is very large. The wrapper does nothing if the input is small, but if the input is large then it breaks it into chunks and just plots the maximum and minimum x and y value for each chunk. Sourcing bigplotfix.R also rebinds graphics::plot.xy to point to the wrapper (sourcing multiple times is OK).

Note that plot.xy is the "workhorse" function for the standard plotting methods like plot(), lines(), and points(). Thus you can continue to use these functions in your code with no modification, and your large plots will be automatically compressed.

This is some example output. It's essentially plot(runif(1e5)), with points and lines, and with and without the "compression" implemented here. The "compressed points" plot misses the middle region due to the nature of the compression, but the "compressed lines" plot looks much closer to the uncompressed original. The times are for the png() device; for some reason points are much faster in the png device than in the X11 device, but the speed-ups in X11 are comparable (X11(type="cairo") was slower than X11(type="Xlib") in my experiments).

내가 쓴 이유 plot()는 큰 데이터 세트 (예 : WAV 파일)에서 실수로 실행 하는 데 지 쳤기 때문 입니다. 이러한 경우 플롯이 완료 될 때까지 몇 분 동안 대기하고 신호로 R 세션을 종료합니다 (최근의 명령 기록 및 변수가 손실 됨). 각 세션 전에이 파일을로드하는 것을 기억할 수 있다면 실제로이 경우 유용한 플롯을 얻을 수 있습니다. 작은 경고 메시지는 플롯 데이터가 "압축"된시기를 나타냅니다.

# bigplotfix.R
# 28 Nov 2016

# This file defines a wrapper for plot.xy which checks if the input
# data is longer than a certain maximum limit. If it is, it is
# downsampled before plotting. For 3 million input points, I got
# speed-ups of 10-100x. Note that if you want the output to look the
# same as the "uncompressed" version, you should be drawing lines,
# because the compression involves taking maximum and minimum values
# of blocks of points (try running test_bigplotfix() for a visual
# explanation). Also, no sorting is done on the input points, so
# things could get weird if they are out of order.
test_bigplotfix = function() {
  oldpar=par();
  par(mfrow=c(2,2))
  n=1e5;
  r=runif(n)
  bigplotfix_verbose<<-T
  mytitle=function(t,m) { title(main=sprintf("%s; elapsed=%0.4f s",m,t["elapsed"])) }
  mytime=function(m,e) { t=system.time(e); mytitle(t,m); }

  oldbigplotfix_maxlen = bigplotfix_maxlen
  bigplotfix_maxlen <<- 1e3;

  mytime("Compressed, points",plot(r));
  mytime("Compressed, lines",plot(r,type="l"));
  bigplotfix_maxlen <<- n
  mytime("Uncompressed, points",plot(r));
  mytime("Uncompressed, lines",plot(r,type="l"));
  par(oldpar);
  bigplotfix_maxlen <<- oldbigplotfix_maxlen
  bigplotfix_verbose <<- F
}

bigplotfix_verbose=F

downsample_xy = function(xy, n, xlog=F) {
  msg=if(bigplotfix_verbose) { message } else { function(...) { NULL } }
  msg("Finding range");
  r=range(xy$x);
  msg("Finding breaks");
  if(xlog) {
    breaks=exp(seq(from=log(r[1]),to=log(r[2]),length.out=n))
  } else {
    breaks=seq(from=r[1],to=r[2],length.out=n)
  }
  msg("Calling findInterval");
  ## cuts=cut(xy$x,breaks);
  # findInterval is much faster than cuts!
  cuts = findInterval(xy$x,breaks);
  if(0) {
    msg("In aggregate 1");
    dmax = aggregate(list(x=xy$x, y=xy$y), by=list(cuts=cuts), max)
    dmax$cuts = NULL;
    msg("In aggregate 2");
    dmin = aggregate(list(x=xy$x, y=xy$y), by=list(cuts=cuts), min)
    dmin$cuts = NULL;
  } else { # use data.table for MUCH faster aggregates
    # (see http://stackoverflow.com/questions/7722493/how-does-one-aggregate-and-summarize-data-quickly)
    suppressMessages(library(data.table))
    msg("In data.table");
    dt = data.table(x=xy$x,y=xy$y,cuts=cuts)
    msg("In data.table aggregate 1");
    dmax = dt[,list(x=max(x),y=max(y)),keyby="cuts"]
    dmax$cuts=NULL;
    msg("In data.table aggregate 2");
    dmin = dt[,list(x=min(x),y=min(y)),keyby="cuts"]
    dmin$cuts=NULL;
    #  ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']
  }
  msg("In rep, rbind");
  # interleave rows (copied from a SO answer)
  s <- rep(1:n, each = 2) + (0:1) * n
  xy = rbind(dmin,dmax)[s,];
  xy
}

library(graphics);
# make sure we don't create infinite recursion if someone sources
# this file twice
if(!exists("old_plot.xy")) {
  old_plot.xy = graphics::plot.xy
}

bigplotfix_maxlen = 1e4

# formals copied from graphics::plot.xy
my_plot.xy = function(xy, type, pch = par("pch"), lty = par("lty"),
  col = par("col"), bg = NA, cex = 1, lwd = par("lwd"),
  ...) {

  if(bigplotfix_verbose) {
    message("In bigplotfix's plot.xy\n");
  }

  mycall=match.call();
  len=length(xy$x)
  if(len>bigplotfix_maxlen) {
    warning("bigplotfix.R (plot.xy): too many points (",len,"), compressing to ",bigplotfix_maxlen,"\n");
    xy = downsample_xy(xy, bigplotfix_maxlen, xlog=par("xlog"));
    mycall$xy=xy
  }
  mycall[[1]]=as.symbol("old_plot.xy");

  eval(mycall,envir=parent.frame());
}

# new binding solution adapted from Henrik Bengtsson
# https://stat.ethz.ch/pipermail/r-help/2008-August/171217.html
rebindPackageVar = function(pkg, name, new) {
  # assignInNamespace() no longer works here, thanks nannies
  ns=asNamespace(pkg)
  unlockBinding(name,ns)
  assign(name,new,envir=asNamespace(pkg),inherits=F)
  assign(name,new,envir=globalenv())
  lockBinding(name,ns)
}
rebindPackageVar("graphics", "plot.xy", my_plot.xy);

— 변성
소스

0

어쩌면 나는 내 방법에 대해 기각 당할 것입니다. 연구 전문가 중 한 사람이 좋은 데이터를 카테고리로 변환하여 사람들에게 비명을 지르는 것에 대한 나쁜 기억이 있습니다 (물론, 나는 이제 며칠 동안 동의합니다). 어쨌든, 산점도에 대해 이야기하고 있다면 같은 문제가 있습니다. 이제 숫자 데이터가 있으면 분석을 위해 데이터를 분류하는 것이 의미가 없습니다. 그러나 시각화는 다른 이야기입니다. 내가 가장 잘 작동하는 것은 먼저 cut 함수를 사용하여 독립 변수를 그룹으로 나누는 것입니다. 당신은 그룹의 수를 가지고 놀 수 있고, (2) 단순히 IV의 컷 버전에 대해 DV를 플로팅합니다. R은 역겨운 산포도 대신 상자 그림을 생성합니다. 플롯에서 특이 치를 제거하는 것이 좋습니다 (플롯 명령에서 outline = FALSE 옵션 사용). 다시 말하지만, 나는 분류하고 분석하여 절대적으로 좋은 수치 데이터를 낭비하지 않을 것입니다. 너무 많은 문제가 있습니다. 비록 그것이 논쟁의 주제라는 것을 알고 있지만 그러나 데이터에서 시각적으로 어떤 의미를 갖도록하기 위해 특별히 그렇게하는 것은 내가 본 데이터에서 크게 해를 끼치 지 않습니다. 나는 10M만큼 큰 데이터를 플로팅했지만이 방법으로 여전히 이해할 수 있습니다. 희망이 도움이됩니다! 친애하는! 그것에서 본 ve. 나는 10M만큼 큰 데이터를 플로팅했지만이 방법으로 여전히 이해할 수 있습니다. 희망이 도움이됩니다! 친애하는! 그것에서 본 ve. 나는 10M만큼 큰 데이터를 플로팅했지만이 방법으로 여전히 이해할 수 있습니다. 희망이 도움이됩니다! 친애하는!

— Mgarvey
소스

0

For large time series, I have grown to love smoothScatter (part of base R no less). I often have to include some additional data, and preserving the basic plot API is really helpful, for instance:

set.seed(1)
ra <- rnorm(n = 100000, sd = 1, mean = 0)
smoothScatter(ra)
abline(v=25000, col=2)
text(25000, 0, "Event 1", col=2)

Which gives you (if you pardon the design):

It's always available and works well with enormous datasets, so it's nice to at least take a look at what you have.

— Josh Rumbut
소스