BERT

Use huge amount of data to train a language representation model (or an embedding model for sentences).
- What is the input of this model? A sentence? A document? Does it matter?
MLM: Masked Language Model
- Also called Cloze
What is the difference between sentenc- e-level and token-level tasks?
How to understand “left-to-right language modeling objectives”?
NSP?
BERT
- Pre-training: during pre-training, the model is trained on unlabeled data over different pre-training tasks
- Fine-tuning: initialized with the pre-trained parameters and fined-tuned for down-streams tasks with labelled data
- Model architecture: multi-layer bi-directional transformer encoder
How does position embeddings and positional encoding interact?
CRF?
- Conditional Random Fields

Sentence-BERT

Prior work’s problem: require both sentences are fed into the network, which causes a massive computational overhead, which makes inference for sentence similarity comparison etc. expensive.
Fine-turn BERT
- Siamese network?
  - Also called twin network
  - Just using the same network twice and compare the output for similarity purpose
- Triplet network?
  - Triplet loss: Loss = Max(dist(anchor, pos) - dist(anchor, neg) + margin, 0)
pooling operation?
- Why is it used in an NLP setting?
Why? Why do we use siamese and triplet in Sentence-BERT?
The “3 Model” seems most important but I failed to see how SBERT is faster than BERT for inference
- It derives an embedding using output from BERT
- The embedding network is fine-turned using siamese and triplet networks
- Oh, I see … instead of run BERT with (i, j)’s input, the new approach extracts an usable embedding from BERT for each sample (which is linear cost rather than combinatorial), and then compute the pair-wise things using another more efficient stuff.

HAN-doc

Hierarchical network mirrors hierarchical structure of documents
- Basically use two layer of attention on different levels of granularity
MLP

MLP