빅 데이터로 SVD 및 PCA를 수행하는 방법은 무엇입니까?

29

대용량 데이터 세트 (약 8GB)가 있습니다. 기계 학습을 사용하여 분석하고 싶습니다. 따라서 효율성을 위해 데이터 차원을 줄이려면 SVD와 PCA를 사용해야한다고 생각합니다. 그러나 MATLAB 및 Octave는 이러한 큰 데이터 세트를로드 할 수 없습니다.

이러한 대량의 데이터로 SVD를 수행하는 데 어떤 도구를 사용할 수 있습니까?

bigdata data-mining dimensionality-reduction

— 데이비드 에스
소스

안녕하세요, DS에 오신 것을 환영합니다! 아마도 데이터 세트를 약간 정교하게 만들 수 있습니다. 행과 열이 몇 개입니까? 이는 가능한 솔루션에 영향을 줄 수 있습니다.

— S. Kolassa-복원 Monica Monica

23711341 rows, and 8 columns. I could try to remove 1-2 columns. They does not seem to related to my problem.

— David S.

You should sample rows before columns here. Is there a reason you cant randomly sample rows to reduce data size? I'm assuming rows here are related to users or something

— cwharland

Sorry if I did not made myself clear. My goal is to do PCA. I think SVD on sample data cannot help me to do PCA, right?

— David S.

PCA is usually implemented by computing SVD on the covariance matrix. Computing the covariance matrix is an embarrassingly parallel task, so it should scale easily with the number of records.

— Anony-Mousse

41

First of all, dimensionality reduction is used when you have many covariated dimensions and want to reduce problem size by rotating data points into new orthogonal basis and taking only axes with largest variance. With 8 variables (columns) your space is already low-dimensional, reducing number of variables further is unlikely to solve technical issues with memory size, but may affect dataset quality a lot. In your concrete case it's more promising to take a look at online learning methods. Roughly speaking, instead of working with the whole dataset, these methods take a little part of them (often referred to as "mini-batches") at a time and build a model incrementally. (I personally like to interpret word "online" as a reference to some infinitely long source of data from Internet like a Twitter feed, where you just can't load the whole dataset at once).

But what if you really wanted to apply dimensionality reduction technique like PCA to a dataset that doesn't fit into a memory? Normally a dataset is represented as a data matrix X of size n x m, where n is number of observations (rows) and m is a number of variables (columns). Typically problems with memory come from only one of these two numbers.

Too many observations (n >> m)

When you have too many observations, but the number of variables is from small to moderate, you can build the covariance matrix incrementally. Indeed, typical PCA consists of constructing a covariance matrix of size m x m and applying singular value decomposition to it. With m=1000 variables of type float64, a covariance matrix has size 1000*1000*8 ~ 8Mb, which easily fits into memory and may be used with SVD. So you need only to build the covariance matrix without loading entire dataset into memory - pretty tractable task.

또는 데이터 세트에서 작은 대표 표본을 선택 하고 공분산 행렬을 근사화 할 수 있습니다 . 이 행렬은 평소와 동일한 속성을 가지지 만 조금 덜 정확합니다.

변수가 너무 많습니다 (n << m)

반면에 변수 가 너무 많으면 공분산 행렬 자체가 메모리에 맞지 않습니다. 예를 들어 640x480 이미지로 작업하는 경우 모든 관측치에 640 * 480 = 307200 변수가 있으므로 703Gb 공분산 행렬이 생성됩니다! 그것은 컴퓨터의 메모리 나 클러스터의 메모리에 유지하려는 것이 아닙니다. 따라서 공분산 행렬을 작성하지 않고 치수를 줄여야합니다.

내가 가장 좋아하는 방법은 Random Projection 입니다. 만약 세트가있는 경우 즉, X 사이즈 N X m을 , 일부 드문 드문 한 랜덤 행렬에 의해 곱 수 R 크기의 m X K (와 K << m 이상) 새로운 매트릭스 구하는 X ' 훨씬 더 작은 크기를 N X K 원래 속성 과 거의 동일한 속성 을 사용합니다. 왜 작동합니까? PCA가 직교 축 세트 (주성분)를 찾고 데이터를 첫 번째 k에 투영하는 것을 목표로하고 있다는 것을 알아야합니다. 합니다. 희소 난수 벡터는nearly orthogonal and thus may also be used as a new basis.

And, of course, you don't have to multiply the whole dataset X by R - you can translate every observation x into the new basis separately or in mini-batches.

There's also somewhat similar algorithm called Random SVD. I don't have any real experience with it, but you can find example code with explanations here.

As a bottom line, here's a short check list for dimensionality reduction of big datasets:

If you have not that many dimensions (variables), simply use online learning algorithms.
If there are many observations, but a moderate number of variables (covariance matrix fits into memory), construct the matrix incrementally and use normal SVD.
If number of variables is too high, use incremental algorithms.

— ffriend
소스

3

Overall, I like your answer but the opening sentence is not quite right. PCA isn't suited for many dimensions with low variance; rather, it is suited for many dimensions with correlated variance. For a given data set, the variance could be high in all dimensions but as long as there is high covariance, then PCA can still yield significant dimensionality reduction.

— bogatron

1

@bogatron: good catch, thanks. In fact, I was referring to high/low variance in some dimensions, possibly not original ones. E.g. in this picture these dimensions are defined by 2 arrows, not original x/y axes. PCA seeks to find these new axes and sorts them by the value of variance along each axis. Anyway, as you pointed out, it was a bad wording, so I tried to reformulate my idea. Hopefully, now it's more clear.

— ffriend

That makes sense to me. +1.

— bogatron

7

Don't bother.

First rule of programming- which also applies to data science: get everything working on a small test problem.

so take a random sample of your data of say 100,000 rows. try different algorithms etc. once you have got everything working to your satisfaction, you can try larger (and larger) data sets - and see how the test error reduces as you add more data.

furthermore you do not want to apply svd to only 8 columns: you apply it when you have a lot of columns.

— seanv507
소스

1

+1 for you do not want to apply svd to only 8 columns: you apply it when you have a lot of columns.

— S. Kolassa - Reinstate Monica

6

PCA is usually implemented by computing SVD on the covariance matrix.

Computing the covariance matrix is an embarrassingly parallel task, so it scales linear with the number of records, and is trivial to distribute on multiple machines!

Just do one pass over your data to compute the means. Then a second pass to compute the covariance matrix. This can be done with map-reduce easily - essentially it's the same as computing the means again. Sum terms as in covariance are trivial to parallelize! You may only need to pay attention to numerics when summing a lot of values of similar magnitude.

Things get different when you have a huge number of variables. But on an 8 GB system, you should be able to run PCA on up to 20.000 dimensions in-memory with the BLAS libraries. But then you may run into the problem that PCA isn't all that reliable anymore, because it has too many degrees of freedom. In other words: it overfits easily. I've seen the recommendation of having at least 10*d*d records (or was it d^3). So for 10000 dimensions, you should have at least a billion records (of 10000 dimensions... that is a lot!) for the result to be statistically reliable.

— Anony-Mousse
소스

1

Although you can probably find some tools that will let you do it on a single machine, you're getting into the range where it make sense to consider "big data" tools like Spark, especially if you think your data set might grow. Spark has a component called MLlib which supports PCA and SVD. The documentation has examples.

— Emre
소스

1

We implemented SVD to a larger data set using PySpark. We also compared consistency across different packages. Here is the link.

— sergulaydore
소스

0

I would reccomend python if you lazily evaluate the file you will have a miniscule memory footprint, and numpy/scipy give you access to all of the tools Octave/Matlab would.

— ragingSloth
소스