Advanced Self-Supervised Pre-training Models¶

✅ GPT-2¶

단지 GPT-1보다 더 큰 transformer LM(Language Model)
40GB의 텍스트로 학습을 진행
- 더 좋은 품질의 데이터 사용
LM zero-shot setting에서 down-stream tasks수행 할 수 있다.

🎓 방법¶

자연어 10종: 질의 응답으로서 멀티 태스킹 학습
소셜 미디어 플랫폼 인 Reddit에서 모든 아웃 바운드 링크를 스크랩하여 학습
사람이 선별 / 필터링 한 스크랩 웹 페이지
최소 3 karma가 받은 데이터 사용
8M 제거 된 Wikipedia 문서
링크에서 콘텐츠 추출로 dragnet과 신문 사용

💻 전처리¶

Byte pair encoding (BPE)
Minimal fragmentation of words across multiple vocab tokens

🔨 수정사항¶

레이어 정규화는 pre-activation residual network와 유사하게 각 sub-block의 input으로 이동되었다.
마지막 self-attention block이후로 추가적인 레이어 정규화가 추가되었다.
초기화시 residual layer의 가중치를 ${1 \over \sqrt{n}}$로 조정하였다. (n은 residual layer의 수이다.)

✅ GPT-3¶

GPT-2보다 더 커진 모델
LM은 Few-shot Learner이다.
175 billion parameters로 자기회귀 언어 모델이다.
96 Attention layers와 3.2M의 Batch size

Prompt : 모델에 주어진 접두사
Zero-shot : 작업에 대한 자연어 설명만으로 답을 예측
One-shot : 작업 설명과 함께 작업의 단일 예를 참조하여 예측
Few-shot : 작업의 몇 가지 예시를 보고 예측

Zero-shot성능은 모델 크기에 따라 꾸준히 증가한다.
Few-shot성능은 더 빠르게 증가

✅ ALBERT¶

더 큰 모델은 더 좋은 성능을 내는데 여기서 문제가 있다.

문제점

메모리 한계
Training Speed

해결책

Factorized Embedding Parameterization
Cross-layer Parameter Sharing
(For Performance) Sentence Order Prediction

1️⃣ Factorized Embedding Parameterization¶

layer을 하나 더 두어서 위 그림의 ALBERT처럼 바뀐다. 이러면 Parameter의 수를 줄일 수 있다.

✅ 예시로 계산을 해보자 V: 500 , H:100, E: 15라 하면

BERT: 500 x 100 = 50,000
ALBERT: 500 x 15 + 15 x 100 = 9,000

2️⃣ Cross-layer Parameter Sharing¶

Shared-FFN: Only sharing feed-forward network parameters across layers
Shared-attention: Only sharing attention parameters across layers
All-shared: Both of them

3️⃣ Sentence Order Prediction¶

기존의 다음 문장을 예측하는 task는 크게 효과가 없다.
두 문장의 순서가 바뀐 상태에서 Negative sampling 한다.
이를 통해 단어의 overlap으로 학습되는 것이 아니라 문맥을 이해할 수 있는 효과가 난다.

✅ ELECTRA¶

토큰 교체를 정확하게 분류하는 인코더를 효율적으로 학습
- 실제 입력 토큰과 그럴듯하지만 종합적으로 생성 된 대체물을 구별하는 방법을 배운다.
- 텍스트 인코더를 generator가 아닌 Discriminator로 pre-training
- Discriminator는 사전 훈련을위한 주요 네트워크입니다.

Replaced token detection pre-training 🆚 masked language model pre-training¶

동일한 모델 크기, 데이터 및 동일한 조건에서 BERT와 같은 MLM 기반 방법을 능가한다.

✅ Light-weight Models¶

DistillBERT (NeurIPS 2019 Workshop)
TinyBERT (Findings of EMNLP 2020)

✅ Fusing Knowledge Graph into Language Model¶

ERNIE: Enhanced Language Representation with Informative Entities (ACL 2019)
KagNET: Knowledge-Aware Graph Networks for Commonsense Reasoning (EMNLP 2019)

참고 자료

GPT-1
- https://blog.openai.com/language-unsupervised/
BERT : Pre-training of deep bidirectional transformers for language understanding, NAACL’19
- https://arxiv.org/abs/1810.04805
SQuAD: Stanford Question Answering Dataset
- https://rajpurkar.github.io/SQuAD-explorer/
SWAG: A Large-scale Adversarial Dataset for Grounded Commonsense Inference
- https://leaderboard.allenai.org/swag/submissions/public
How to Build OpenAI’s GPT-2: “ The AI That Was Too Dangerous to Release”
- https://blog.floydhub.com/gpt2/
GPT-2
- https://openai.com/blog/better-language-models/
- https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Language Models are Few-shot Learners, NeurIPS’20
- https://arxiv.org/abs/2005.14165
Illustrated Transformer
- http://jalammar.github.io/illustrated-transformer/
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, ICLR’20
- https://arxiv.org/abs/1909.11942
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, ICLR’20
- https://arxiv.org/abs/2003.10555
DistillBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- https://arxiv.org/abs/1910.01108
TinyBERT: Distilling BERT for Natural Language Understanding, Findings of EMNLP’20
- https://arxiv.org/abs/1909.10351
ERNIE: Enhanced Language Representation with Informative Entities
- https://arxiv.org/abs/1905.07129
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
- https://arxiv.org/abs/1909.02151

그래프를 바이럴 마케팅에 어떻게 활용할까? (0)	2021.02.23
검색 엔진에서는 그래프를 어떻게 활용할까? (0)	2021.02.23
실제 그래프는 어떻게 생겼을까? (0)	2021.02.22
그래프란 무엇이고 왜 중요할까? (0)	2021.02.22
Self-Supervised Pre-training Models (0)	2021.02.19
Transformer 이론 (0)	2021.02.18
Beam search and BLEU (0)	2021.02.17
Seq2Seq (0)	2021.02.17

개발자CuCu

Advanced Self-Supervised Pre-training Models

Advanced Self-Supervised Pre-training Models¶

✅ GPT-2¶

🎓 방법¶

💻 전처리¶

🔨 수정사항¶

✅ GPT-3¶

✅ ALBERT¶

1️⃣ Factorized Embedding Parameterization¶

3️⃣ Sentence Order Prediction¶

✅ ELECTRA¶

Replaced token detection pre-training 🆚 masked language model pre-training¶

✅ Light-weight Models¶

✅ Fusing Knowledge Graph into Language Model¶

'AI > 이론' 카테고리의 다른 글

+ Recent posts

티스토리툴바