RNN and Language modeling¶

Basic structure¶

An unrolled recurrent neural network¶

Input and outputs of RNNs(rolled version)¶

$h_{t-1}$: old hidden-state vector
$x_{t}$: input vector at some time step
$h_{t}$: new hidden-state vector
$f_{W}$: RNN function with parameters W
$y_{t}$: output vector at time step t (can be obtained through $h_{t}$)

❗ 여기서 주의 점은 매 time step마다 같은 함수와 같은 parameters를 사용한다.

이때 함수를 자세히 보게 되면 아래와 같다.

Input의 dim는 3이고 hidden layer의 dim이 2라고 가정하면 아래와 같이 W의 모양이 결정 된다.

위 결과 $W_{hh}h_{t-1}+W_{xh}x_{t}$에 tanh함수를 거쳐서 $h_{t}$값을 얻을 수 있다. 이 이후 linear transform을 통해 $y_{t}$값을 얻을 수 있다.

Types of RNNs¶

1️⃣ One-to-one¶

기본 Neural Networks

2️⃣ One-to-many¶

Image Captioning

3️⃣ Many-to-one¶

감정 분류

4️⃣ Sequnce-tosequence¶

기계 번역

프레임 level에서 비디오 분류

Character-level Language Model¶

📌 학습 데이터 "hello"에 예를 들어보자.

1.사전 구축¶

Vocabulary:[h,e,l,o]

h:[1,0,0,0]
e:[0,1,0,0]
l: [0,0,1,0]
o:[0,0,0,1]

2. Input을 넣는다.¶

3. Hidden layer 계산¶

$$h_{t} = tanh(W_{hh}h_{t-1}+W_{xh}x_{t}+b)$$

hidden layer의 dim을 3이라 가정하면 아래와 같다.

4. $h_{t}$을 통해서 $y_{t}$예측¶

$$Logit = W_{hy}h_{t}+b$$

위 Logit값에 $softmax$함수를 거쳐서 $y_{t}$를 얻을 수 있다.

참고 자료

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Backpropagation through time(BPTT)¶

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf

전체 sequence에 loss를 계산하고 뒤로 전체 sequence의 gradient를 계산한다.

전체의 sequence에 대해서 계산하게 되면 많은 자원이 필요하기 때문에 일부분만 잘라서 forward와 backward를 계산한다.

하지만 이렇게 하면 forward는 모든 시간에 대해서 수행하게 되지만 backward는 몇몇 작은 time step에서만 수행이 된다.

Vanishing/Exploding¶

역전파가 발생하는 동안 time step마다 동일한 행렬을 곱하면 gradient vanishing이나 exploding이 발생하게 된다.

✅ 예를 들어 보면,

여기서는 gradient를 구하게 되면 3이 계속 곱해지는데 time step수 만큼 제곱하게 되어서 gradient가 기하급수적으로 커지게된다.

Transformer 이론 (0)	2021.02.18
Beam search and BLEU (0)	2021.02.17
Seq2Seq (0)	2021.02.17
LSTM and GRU (0)	2021.02.16
Bag of Words (0)	2021.02.15
Word Embedding (0)	2021.02.15
Generative Model (0)	2021.02.05
Transformer - Sequential Models (0)	2021.02.04

개발자CuCu