모든 통계학자가 알아야 할 이론은 무엇입니까?

30

나는 매우 기본적이고 최소한의 요구 사항 관점에서 이것을 생각하고 있습니다. 업계 (학계 아님) 통계학자가 정기적으로 알고 이해하고 활용해야하는 주요 이론은 무엇입니까?

마음에 떠오르는 것은 큰 법칙입니다 . 통계 분석을 데이터 분석에 적용하는 데 가장 필요한 것은 무엇입니까?

theory careers law-of-large-numbers

— bnjmn
소스

41

솔직히 말해서, 많은 수의 법칙이 산업에서 큰 역할을한다고 생각하지 않습니다. 최대 가능성 추정 및 테스트 (전능 한 GLM 및 로지스틱 회귀 포함), 부트 스트랩과 같은 일반적인 절차의 점근 적 타당성을 이해하는 것이 도움이되지만 이는 나쁜 표본 문제에 부딪 힐 가능성보다는 분포 문제입니다. .

이미 언급 한 주제 (GLM, 추론, 부트 스트랩) 외에도 가장 일반적인 통계 모델은 선형 회귀이므로 선형 모델에 대한 철저한 이해가 필수적입니다. 업종에서 분산 분석을 실행할 수는 없지만 이해할 수없는 경우 통계 전문가라고해서는 안됩니다.

다양한 종류의 산업이 있습니다. 제약에서는 무작위 시련과 로지스틱 회귀 없이는 생계를 유지할 수 없습니다. 설문 조사 통계에서는 Horvitz-Thompson 추정기와 무응답 조정 없이는 생활 할 수 없습니다. 컴퓨터 과학 관련 통계에서는 통계 학습 및 데이터 마이닝 없이는 생활을 할 수 없습니다. 공공 정책 싱크 탱크 (그리고 점점 더 많은 교육 통계)에서는 인과 관계 및 치료 효과 추정기 (임의의 무작위 시험이 포함됨) 없이는 생계를 유지할 수 없습니다. 마케팅 리서치에서는 경제학 배경과 심리 측정 이론이 혼합되어 있어야합니다 (일반 통계 부서 오퍼링에서 이들 중 어느 것도 배울 수 없음). 산업 통계는 독자적인 6 시그마 패러다임으로 작동하지만 주류 통계와는 원격으로 연결되어있다. 실험 재료의 설계에서 더 강한 결합이 발견 될 수있다. 월스트리트 소재는 확률 론적 미적분학까지 금융 계량 경제학입니다. 이들은 매우 다른 기술이며 "산업"이라는 용어는 "학계"보다 훨씬 잘 정의되지 않습니다. 누구도 동시에 위의 두세 개 이상을 알고 있다고 주장 할 수는 없다고 생각합니다.

그러나 "산업"에서 보편적으로 요구되는 최고의 기술은 시간 관리, 프로젝트 관리 및 통계에 정통하지 않은 고객과의 의사 소통입니다. 따라서 산업 배치를 위해 스스로 준비하고 싶다면 이러한 주제에 대해 비즈니스 스쿨에서 수업을 받으십시오.

업데이트 : 원래 게시물은 2012 년 2 월에 작성되었습니다. 요즘 (2014 년 3 월) 업계에서 가장 인기있는 직업을 찾기 위해 "통계 학자"가 아닌 "데이터 과학자"라고 부르고 그 자체 선언을 따르는 Hadoop을 배우는 것이 좋습니다.

— 개정
소스

1

좋은 대답입니다. 업계 통계학 자의 큰 차이점 중 일부를 강조해 주셔서 감사합니다. 많은 사람들이 통계학자가 무엇인지에 대해 다른 생각을 가지고 있다고 믿기 때문에 이것은 내 질문에 동기를 부여합니다. 나는이 모든 것이 기본적인 이해와 교차하는 곳을 찾으려고 노력했다고 생각합니다. 또한 비즈니스 주제와 주제의 중요성에 대한 마지막 단락에 진심으로 감사드립니다. 좋은 점이지만 여전히 수락하기 전에 누군가 대화에 추가 할 수 있는지 확인하고 싶습니다.

— bnjmn

산업 통계가 운영된다고 말하는 "주류 통계와 원격으로 연결된"이 독특한 6 시그마 패러다임에 의아해합니다. 이 모든 하위 분야에서 발견되는 용어의 차이점을 제외하고는 전적으로 정통적인 것으로 보입니다.

— Scortchi-Monica Monica 복원

4

@Scortchi, 솔직히 이러한 용어 차이를 극복 할 수 없었습니다. 또한 정규 근사치가 꼬리에서 쓸모가 거의 없다는 것을 알고 있으므로 6 시그마 확률

10^{- 9}

$10^{-9}$ 100 또는 1000의 팩터로 꺼져있을 수 있습니다.

— StasK

충분히 공정함 : 측정 시스템 분석 (인터-레이더 계약, 게이지 재현성 및 반복성 연구), 통계적 공정 제어, 신뢰성 분석 (일명 생존 분석) 및 실험 설계 ((분수) 요인 설계, 반응 표면 방법론) )은 산업 통계의 특징이었습니다.

— Scortchi-Monica Monica 복원

12

편향-분산 트레이드 오프 와 관련된 문제를 잘 이해하고 있다고 생각합니다 . 대부분의 통계 학자들은 어떤 시점에서 추정기의 분산 또는 모형의 모수가 매개 변수가 충분히 높아서 편향이 2 차 고려 사항이 될 정도로 충분히 작은 데이터 세트를 분석하게됩니다.

— 디크 란 유대류
소스

11

매우 명백한 것을 지적하려면 :

중앙 한계 정리

실무자들이 대략적으로 $p$ 정확한 상황 에서 많은 가치 $p$ -값은 다루기 어렵다. 이 같은 선을 따라, 성공적인 개업의는 일반적으로

부트 스트랩

— 매크로
소스

8

I wouldn't say this is very similar to something like the law of large numbers or the central limit theorem, but because making inferences about causality is often central, understanding Judea Pearl's work on using structured graphs to model causality is something people should be familiar with. It provides a way to understand why experimental and observational studies differ with respect to the causal inferences they afford, and offers ways to deal with observational data. For a good overview, his book is here.

— gung - Reinstate Monica
소스

2

There's also Rubin's counterfactuals framework; there are also structural equation modeling and econometric instrumental variable techniques... some of that described in the Mostly Harmless Econometrics which of the best statistics books written by non-statisticians.

— StasK

7

A solid understanding of the substantive problem to be addressed is as important as any particular statistical approach. A good scientist in the industry is more likely than a statistician without such knowledge to come to a reasonable solution to their problem. A statistician with substantive knowledge can help.

— Brett
소스

6

델타-방법은 기괴한 통계의 분산을 계산하고 점근 적 상대 효율을 찾는 방법으로 "올바른 것을 추정하여"변수의 변화를 추천하고 효율 향상을 설명합니다. 이와 함께 Jensen의 불평등은 GLM과 위와 같은 변형에서 발생하는 이상한 종류의 편견을 이해하는 데 도움이됩니다. 그리고 이제는 바이어스와 분산에 대해 예측 정확도의 객관적인 척도로서 바이어스 편차 트레이드 오프와 MSE의 개념을 언급했습니다.

— AdamO
소스

6

In my view, statistical inference is most important for a practitioner. Inference has two parts: 1) Estimation & 2) Hypothesis testing. Hypothesis testing is important one. Since in estimation mostly a unique procedure, maximum likelihood estimation, followed and it is available most statistical package(so there is no confusion).

빈번한 실무자 질문은 차이 또는 원인 분석에 대한 중요한 테스트에 관한 것입니다. 중요한 가설 검정은이 곳에서 찾을 수 있습니다link .

Knowing about Linear models, GLM or in general statistical modelling is required for causation interpretation. I assume future of data analysis include Bayesian inference.

— vinux
소스

0

Casual inference is must. And how to address it's fundamental problem, you can't go back in time and not give someone a treatment. Read articles about rubin, fisher the founder of modern statistics student.).... What to learn to address this problem, proper randomisation and how Law of large numbers says things are properly randomised, Hypothesis testing ,Potential outcomes (holds against hetroscastisty assumption and is great with missingness ), matching (great for missingness but potential outcomes is better because it's more generalised, I mean why learn a ton of complicated things when you can only learn one complicated thing ), Bootstrap ,Bayesian statistics of course( Bayesian regression, naïve Bayesian regression, Bayesian factors) , and Non papmetric alternatives.

Normally in practice just follow these general steps ,

Regarding a previous comment you should genrally first start with an ANOVA (random effects or fixed effects, and transform continuous types into bins) then use a regression (which if you transform and alter can sometimes be as good as a ANOVA but never beat it) to see which specific treatments are significant,( apposed to doing multiple t test and using some correction like Holm methid) use a regression.

In the cases where you have to predict things use bayasian regression.

Missingness at more than 5% use potential outcomes

Another branch of data analytics is supervised machine learning which must be mentioned

— Kheagan Eckley
소스