신뢰할 수있는 출처의 답변을 보지 못했지만 간단한 예 (현재 알고있는)로 직접 대답하려고합니다.
일반적으로 역 전파를 사용하여 MLP를 훈련하는 것은 일반적으로 행렬로 구현됩니다.
행렬 곱셈의 시간 복잡성
Mij∗Mjk 대한 행렬 곱셈의 시간 복잡도 는 단순히 O(i∗j∗k) 입니다.
여기서 우리는 가장 간단한 곱셈 알고리즘을 가정합니다. 시간 복잡성이 약간 더 좋은 다른 알고리즘이 있습니다.
피드 포워드 패스 알고리즘
피드 포워드 전파 알고리즘은 다음과 같습니다.
먼저, 레이어 i 에서 j 로 이동 하려면
Sj=Wji∗Zi
그런 다음 활성화 기능을 적용합니다
Zj=f(Sj)
우리가 있다면 N (입출력 층을 포함한다) 층을,이 실행 N−1 회.
예
예를 들어, 4 레이어가 있는 MLP의 순방향 패스 알고리즘에 대한 시간 복잡도를 계산해 봅시다 . 여기서 i 는 입력 레이어 의 노드 수 , j 는 두 번째 레이어 의 노드 수 , k 는 노드 수 세 번째 계층과 l 출력 계층의 노드 수.
이 때문에 4 층, 당신은 필요한 3 이 개 사이의 가중치를 나타내는 행렬. 그럼으로 그들을 나타낸다하자 Wji , Wkj 및 Wlk 여기서, Wji 가진 행렬 j 행 i 열 ( Wji 따라서 층으로부터가는 가중치 포함 i 층으로 j ).
당신이 가정 t 훈련 예. 레이어 i 에서 j 전파 하기 위해 먼저
Sjt=Wji∗Zit
and this operation (i.e. matrix multiplcation) has O(j∗i∗t) time complexity. Then we apply the activation function
Zjt=f(Sjt)
and this has O(j∗t) time complexity, because it is an element-wise operation.
So, in total, we have
O(j∗i∗t+j∗t)=O(j∗t∗(t+1))=O(j∗i∗t)
Using same logic, for going j→k, we have O(k∗j∗t), and, for k→l, we have O(l∗k∗t).
In total, the time complexity for feedforward propagation will be
O(j∗i∗t+k∗j∗t+l∗k∗t)=O(t∗(ij+jk+kl))
I'm not sure if this can be simplified further or not. Maybe it's just O(t∗i∗j∗k∗l), but I'm not sure.
Back-propagation algorithm
The back-propagation algorithm proceeds as follows. Starting from the output layer l→k, we compute the error signal, Elt, a matrix containing the error signals for nodes at layer l
Elt=f′(Slt)⊙(Zlt−Olt)
where ⊙ means element-wise multiplication. Note that Elt has l rows and t columns: it simply means each column is the error signal for training example t.
We then compute the "delta weights", Dlk∈Rl×k (between layer l and layer k)
Dlk=Elt∗Ztk
where Ztk is the transpose of Zkt.
We then adjust the weights
Wlk=Wlk−Dlk
l→kO(lt+lt+ltk+lk)=O(l∗t∗k).
Now, going back from k→j. We first have
Ekt=f′(Skt)⊙(Wkl∗Elt)
Then
Dkj=Ekt∗Ztj
And then
Wkj=Wkj−Dkj
where Wkl is the transpose of Wlk. For k→j, we have the time complexity O(kt+klt+ktj+kj)=O(k∗t(l+j)).
And finally, for j→i, we have O(j∗t(k+i)). In total, we have
O(ltk+tk(l+j)+tj(k+i))=O(t∗(lk+kj+ji))
which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be O(t∗(ij+jk+kl)).
This time complexity is then multiplied by number of iterations (epochs). So, we have O(n∗t∗(ij+jk+kl)),
where n is number of iterations.
Notes
Note that these matrix operations can greatly be paralelized by GPUs.
Conclusion
We tried to find the time complexity for training a neural network that has 4 layers with respectively i, j, k and l nodes, with t training examples and n epochs. The result was O(nt∗(ij+jk+kl)).
We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)
Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.
I'm not sure what the results would be using other optimizers such as RMSprop.
Sources
The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.
If you're not familiar with back-propagation, check this article:
http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4