- positive example
- negative example
- has the same identity as anchor
- influence anchor and positive example to be closer, and
- influence anchor and negative example to be further apart
e.g. use L2 distance or some other distance metric
- Why not just sample 1 positive and 1 negative?
Used for BERT/RoBERTa: https://arxiv.org/pdf/1907.11692.pdf
- Instead of predicting the token - why not predict the embedding vector representing the missing token?
MLM Embedding Loss #idea