1과 2에 대한 대답은 '아니오'이지만 존재 정리를 해석 할 때는주의가 필요합니다.
릿지 추정기의 변화
하자 페널티 아래 능선 추정 될 K , 및하자 β 모델에 대한 true 매개 변수가 될 Y = X β + ε . λ 1 , … , λ p 를 X T X 의 고유 값 이라고합시다 .
Hoerl & Kennard 식 4.2-4.5에서 위험 (예상되는 L 2 오류 기준)은 다음과 같습니다.β∗^kβY=Xβ+ϵλ1,…,λpXTX
L2
어디까지 I는 말할 수 ( X T X+k 개의 I의 P ) -(2)= ( X T X+k는 I (P) ) -1 ( X T X+k 개의 I의 (P) ) -1. 그들은γ1은 ^ β ∗ −β의 내부 곱의 분산에 대한 해석을 가지고있지만γ2는
E([β∗^−β]T[β∗^−β])=σ2∑j=1pλj/(λj+k)2+k2βT(XTX+kIp)−2β=γ1(k)+γ2(k)=R(k)
(XTX+kIp)−2=(XTX+kIp)−1(XTX+kIp)−1.γ1β∗^−βγ2 편견의 내부 산물입니다.
XTX=Ip
R(k)=pσ2+k2βTβ(1+k)2.
R′(k)=2k(1+k)βTβ−(pσ2+k2βTβ)(1+k)3
k임k → 0+아르 자형'(k)=−2pσ2<0, we conclude that there is some
k∗>0 such that
R(k∗)<R(0).
The authors remark that orthogonality is the best that you can hope for in terms of the risk at k=0, and that as the condition number of XTX increases, limk→0+R′(k) approaches −∞.
Comment
There appears to be a paradox here, in that if p=1 and X is constant, then we are just estimating the mean of a sequence of Normal(β,σ2) variables, and we know the the vanilla unbiased estimate is admissible in this case. This is resolved by noticing that the above reasoning merely provides that a minimizing value of k exists for fixed βTβ. But for any k, we can make the risk explode by making βTβ large, so this argument alone does not show admissibility for the ridge estimate.
Why is ridge regression usually recommended only in the case of correlated predictors?
H&K's risk derivation shows that if we think that βTβ is small, and if the design XTX is nearly-singular, then we can achieve large reductions in the risk of the estimate. I think ridge regression isn't used ubiquitously because the OLS estimate is a safe default, and that the invariance and unbiasedness properties are attractive. When it fails, it fails honestly--your covariance matrix explodes. There is also perhaps a philosophical/inferential point, that if your design is nearly singular, and you have observational data, then the interpretation of β as giving changes in EY for unit changes in X is suspect--the large covariance matrix is a symptom of that.
But if your goal is solely prediction, the inferential concerns no longer hold, and you have a strong argument for using some sort of shrinkage estimator.