베이지안 관점에서 LASSO 및 릿지 : 튜닝 매개 변수는 어떻습니까?

LASSO 및 능선과 같은 페널티 회귀 추정기는 특정 이전의 베이지안 추정기에 해당한다고합니다. 고정 튜닝 매개 변수의 경우 이전에 해당하는 구체적인 것이 있다고 생각합니다 (베이지안 통계에 대해 충분히 알지 못함).

이제 잦은 주의자는 교차 검증을 통해 튜닝 매개 변수를 최적화합니다. 그렇게하는 베이지안이 있습니까, 전혀 사용됩니까? 아니면 베이지안 접근 방식이 데이터를보기 전에 튜닝 매개 변수를 효과적으로 수정합니까? (후자는 예측 성능에 해로울 것 같습니다.)

bayesian lasso ridge-regression

— 리차드 하디
소스

완전히 베이지안 접근법은 주어진 사전으로 시작하여 수정하지 않을 것이라고 생각합니다. 그러나 하이퍼 파라미터 값을 최적화 하는 경험적인 베이 접근 방식 도 있습니다 ( 예 : stats.stackexchange.com/questions/24799 참조) .

— amoeba는 Reinstate Monica

추가 질문 (주 Q의 일부 일 수 있음) : 정규화 매개 변수에 앞서 교차 유효성 검사 프로세스를 대체하는 방법이 있습니까?

— kjetil b halvorsen

베이지안은 일반적으로 분산 매개 변수에 해당하므로 튜닝 매개 변수를 미리 지정할 수 있습니다. 이것은 보통 완전하게 유지하기 위해 CV를 피하기 위해 수행되는 것입니다. 또는 REML을 사용하여 정규화 매개 변수를 최적화 할 수 있습니다.

— 남자

추신 : 현상금을 목표로하는 사람들에게 내 의견에 주목하십시오 : 나는 잦은 교차 검증과 동등한 MAP 추정을 유도하는 사전 답변을보고 싶습니다.

— statslearner2

@ statslearner2 Richard의 질문을 잘 해결한다고 생각합니다. 귀하의 현상금이 리처드의 Q.보다 (A hyperprior에 대해)보다 좁은 측면에 초점을 맞추고있는 것 같다

— 아메바는 분석 재개 모니카 말한다

LASSO 및 능선과 같은 페널티 회귀 추정기는 특정 이전의 베이지안 추정기에 해당한다고합니다.

네 맞습니다. 로그 우도 함수의 최대화와 매개 변수에 대한 페널티 함수와 관련된 최적화 문제가있을 때마다, 이는 페널티 함수가 이전 커널의 로그로 간주되는 사후 최대치와 수학적으로 동일합니다. 이를 확인하려면, 우리는 페널티 기능이 있다고 가정 튜닝 매개 변수 사용하여 . 이 경우 목적 함수는 다음과 같이 작성할 수 있습니다. $^\dagger$ $w$ $\lambda$

\begin{aligned} H_{x} (θ | λ) & = ℓ_{x} (θ) - w (θ | λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ)}{\int L_{x} (θ) π (θ | λ) d θ}) + const \\ = \ln π (θ | x, λ) + const, \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta|\lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta|\mathbf{x}, \lambda) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

우리가 사용 전에 여기서 $\pi(\theta|\lambda) \propto \exp ( -w(\theta|\lambda))$ . 최적화에서 튜닝 파라미터는 이전 분포에서 고정 하이퍼 파라미터로 취급됩니다. 고정 튜닝 매개 변수를 사용하여 클래식 최적화를 수행하는 경우 고정 하이퍼 파라미터를 사용하여 베이지안 최적화를 수행하는 것과 같습니다. LASSO 및 릿지 회귀 분석의 경우 페널티 함수 및 해당하는 해당 기능은 다음과 같습니다.

\begin{aligned} LASSO Regression & π (θ | λ) & = \prod_{k = 1}^{m} Laplace (0, \frac{1}{λ}) = \prod_{k = 1}^{m} \frac{λ}{2} \cdot \exp (- λ | θ_{k} |), \\ Ridge Regression & π (θ | λ) & = \prod_{k = 1}^{m} Normal (0, \frac{1}{2 λ}) = \prod_{k = 1}^{m} \sqrt{λ / π} \cdot \exp (- λ θ_{k}^{2}) . \end{aligned}

$\begin{equation} \begin{aligned} \text{LASSO Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Laplace} \Big( 0, \frac{1}{\lambda} \Big) = \prod_{k=1}^m \frac{\lambda}{2} \cdot \exp ( -\lambda |\theta_k| ), \\[6pt] \text{Ridge Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Normal} \Big( 0, \frac{1}{2\lambda} \Big) = \prod_{k=1}^m \sqrt{\lambda/\pi} \cdot \exp ( -\lambda \theta_k^2 ). \\[6pt] \end{aligned} \end{equation}$

전자의 방법은 절대 크기에 따라 회귀 계수에 불이익을줍니다. 이는 이전에 0에 위치한 Laplace를 부과하는 것과 같습니다. 후자의 방법은 회귀 계수를 제곱 크기에 따라 불이익을줍니다. 이는 제로에 위치한 법선 사전을 부과하는 것과 같습니다.

이제 잦은 주의자는 교차 검증을 통해 튜닝 매개 변수를 최적화합니다. 그렇게하는 베이지안이 있습니까, 전혀 사용됩니까?

잦은 방법이 최적화 문제로 가정 될 수있는 한 (가설 테스트 또는 이와 유사한 것을 포함하는 것이 아니라), 동등한 사전을 사용하는 베이지안 유사성이있을 것이다. 잦은 사람들이 튜닝 파라미터 $\lambda$ 를 알 수없는 것으로 취급하고이를 데이터로부터 추정 할 수있는 것처럼, 베이지안은 유사하게 하이퍼 파라미터 $\lambda$ 를 알 수없는 것으로 취급 할 수있다. 전체 베이지안 분석에서 이것은 하이퍼 파라미터에 고유의 사전을 부여하고이 이전의 사후 최대 값을 찾는 것과 관련이 있으며, 이는 다음 목적 함수를 최대화하는 것과 유사합니다.

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - h (λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ)) \cdot \exp (- h (λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ) π (λ)}{\int L_{x} (θ) π (θ | λ) π (λ) d θ}) + const \\ = \ln π (θ, λ | x) + const . \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - h(\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \cdot \exp ( -h(\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta, \lambda|\mathbf{x}) + \text{const}. \\[6pt] \end{aligned} \end{equation}$

이 방법은 분석가가 이전에 특정 하이퍼 파라미터를 선택하는 것이 편하지 않은 경우 베이지안 분석에서 실제로 사용되며,이를 알려지지 않은 것으로 취급하고 분포를 제공함으로써 이전의 확산을 더욱 확산 시키려고합니다. (이것은 관심있는 매개 변수 $\theta$ 보다 더 확산되는 암시 적 방법입니다 .)

(Comment from statslearner2 below) I'm looking for numerical equivalent MAP estimates. For instance, for a fixed penalty Ridge there is a gaussian prior that will give me the MAP estimate exactly equal the ridge estimate. Now, for k-fold CV ridge, what is the hyper-prior that would give me the MAP estimate which is similar to the CV-ridge estimate?

Before proceeding to look at $K$ -fold cross-validation, it is first worth noting that, mathematically, the maximum a posteriori (MAP) method is simply an optimisation of a function of the parameter $\theta$ and the data $\mathbf{x}$ . If you are willing to allow improper priors then the scope encapsulates any optimisation problem involving a function of these variables. Thus, any frequentist method that can be framed as a single optimisation problem of this kind has a MAP analogy, and any frequentist method that cannot be framed as a single optimisation of this kind does not have a MAP analogy.

In the above form of model, involving a penalty function with a tuning parameter, $K$ -fold cross-validation is commonly used to estimate the tuning parameter $\lambda$ . For this method you partition the data vector $\mathbb{x}$ into $K$ sub-vectors $\mathbf{x}_1,...,\mathbf{x}_K$ . For each of sub-vector $k=1,...,K$ you fit the model with the "training" data $\mathbf{x}_{-k}$ and then measure the fit of the model with the "testing" data $\mathbf{x}_k$ . 각 적합치에서 모델 매개 변수에 대한 추정값을 얻습니다. 그러면 테스트 데이터에 대한 예측이 제공됩니다. 그런 다음 실제 테스트 데이터와 비교하여 "손실"을 측정 할 수 있습니다.

\begin{matrix} Estimator & \hat{θ} (x_{- k}, λ), \\ Predictions & {\hat{x}}_{k} (x_{- k}, λ), \\ Testing loss & L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ) . \end{matrix}

$\begin{matrix} \text{Estimator} & & \hat{\theta}(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Predictions} & & \hat{\mathbf{x}}_k(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Testing loss} & & \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda). \\[6pt] \end{matrix}$

The loss measures for each of the $K$ "folds" can then be aggregated to get an overall loss measure for the cross-validation:

L (x, λ) = \sum_{k} L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ)

$\mathscr{L}(\mathbf{x}, \lambda) = \sum_k \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda)$

One then estimates the tuning parameter by minimising the overall loss measure:

\hat{λ} \equiv \hat{λ} (x) \equiv \underset{λ}{arg min} L (x, λ) .

$\hat{\lambda} \equiv \hat{\lambda}(\mathbf{x}) \equiv \underset{\lambda}{\text{arg min }} \mathscr{L}(\mathbf{x}, \lambda).$

We can see that this is an optimisation problem, and so we now have two seperate optimisation problems (i.e., the one described in the sections above for $\theta$ , and the one described here for $\lambda$ ). Since the latter optimisation does not involve $\theta$ , we can combine these optimisations into a single problem, with some technicalities that I discuss below. To do this, consider the optimisation problem with objective function:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ), \end{aligned}

where $\delta > 0$ is a weighting value on the tuning-loss. As $\delta \rightarrow \infty$ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from $K$ -fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking $\delta = \infty$ screws up the optimisation problem, but if we take $\delta$ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.

From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting and $K$ -fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrarily accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ) \\ = \ln (\frac{L_{x}^{*} (θ, λ) π (θ, λ)}{\int L_{x}^{*} (θ, λ) π (θ, λ) d θ}) + const, \end{aligned}

$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda)}{\int L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda) d\theta} \Bigg) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

where $L_\mathbf{x}^*(\theta, \lambda) \propto \exp( \ell_\mathbf{x}(\theta) - \delta \mathscr{L}(\mathbf{x}, \lambda))$ and $\pi (\theta, \lambda) \propto \exp( -w(\theta|\lambda))$ , with a fixed (and very large) hyper-parameter $\delta$ .

$^\dagger$ This gives an improper prior in cases where the penalty does not correspond to the logarithm of a sigma-finite density.

— Reinstate Monica
소스

Ok +1 already, but for the bounty I'm looking for these more precise answers.

— statslearner2

1. I do not get how (since frequentists generally use classical hypothesis tests, etc., which have no Bayesian equivalent) connects to the rest of what I or you are saying; parameter tuning has nothing to do with hypothesis tests, or does it? 2. Do I understand you correctly that there is no Bayesian equivalent to frequentist regularized estimation when the tuning parameter is selected by cross validation? What about empirical Bayes that amoeba mentions in the comments to the OP?

— Richard Hardy

3. Since regularization with cross validation seems to be quite effective for, say, prediction, doesn't point 2. suggest that the Bayesian approach is somehow inferior?

— Richard Hardy

@Ben, thanks for your explicit answer and the subsequent clarifications. You have once again done a wonderful job! Regarding 3., yes, it was quite a jump; it certainly is not a strict logical conclusion. But looking at your points w.r.t. 2. (that a Bayesian method can approximate the frequentist penalized optimization with cross validation), I no longer think that Bayesian must be "inferior". The last quibble on my side is, could you perhaps explain how the last, complicated formula could arise in practice in the Bayesian paradigm? Is it something people would normally use or not?

— Richard Hardy

@Ben (ctd) My problem is that I know little about Bayes. Once it gets technical, I may easily lose the perspective. So I wonder whether this complicated analogy (the last formula) is something that is just a technical possibility or rather something that people routinely use. In other words, I am interested in whether the idea behind cross validation (here in the context of penalized estimation) is resounding in the Bayesian world, whether its advantages are utilized there. Perhaps this could be a separate question, but a short description will suffice for this particular case.

— Richard Hardy

Indeed most penalized regression methods correspond to placing a particular type of prior to the regression coefficients. For example, you get the LASSO using a Laplace prior, and the ridge using a normal prior. The tuning parameters are the “hyperparameters” under the Bayesian formulation for which you can place an additional prior to estimate them; for example, for in the case of the ridge it is often assumed that the inverse variance of the normal distribution has a $\chi^2$ prior. However, as one would expect, resulting inferences can be sensitive to the choice of the prior distributions for these hyperparameters. For example, for the horseshoe prior there are some theoretical results that you should place such a prior for the hyperparameters that it would reflect the number of non-zero coefficients you expect to have.

A nice overview of the links between penalized regression and Bayesian priors is given, for example, by Mallick and Yi.

— Dimitris Rizopoulos
소스

Thank you for your answer! The linked paper is quite readable, which is nice.

— Richard Hardy

This does not answer the question, can you elaborate to explain how does the hyper-prior relate to k-fold CV?

— statslearner2