BERT (Bidirectional Encoders Representational from Transformers) is a pre-trained model for Natural Language Processing(NLP) related tasks. It is introduced by the Google AI team in October 2018 in this paper Pre-training of Deep Bidirectional Transformers for Language Understanding. It is trained on Wikipedia and Corpus dataset, it knows the language and context which is quite decent. It has two versions Base(12 encoders) and Large (24 encoders). It is a specific, large transformer masked language model. BERT is built on top of the multiple clever ideas by the NLP community. Some examples are ELMo, the OpenAI Transformer, and The Transformer.
BERT is a quiet large model that is already trained we just fine-tune it according to specific problems by adding some additional fully connected layers. It is very much expensive to train it from scratch. We have to use the pre-trained model either base(12) or large(24) for our problem. The pre-trained model just understands the language and the context and we do fine-tune to train BERT on a specific task. The overall big-picture of BERT pre-trained and fine-tuning is show in below figure:
As you can see in above figure that pre-trained BERT and how we fine-tune it on to our specific task. For example SQuAD(Stanford Question Answering Dataset) for our question and answering task, NER (Name Entity Recognition) and MLNI (Multi-Genre Natural Language Inference). Two specific tasks on which BERT is fine-tuned are as follows.
- Mask Language Model
- Next Sentence Prediction