Interesting Loss Functions

Tags: None
State: None

Triplet Loss #todo


Used for BERT/RoBERTa:

Instead of predicting the token - why not predict the embedding vector representing the missing token?

IDEA MLM Embedding Loss This might be mathemtically equivalent to what is being done already. I'm not sure about this point though.

Basically start with cross-entropy on the tokens then gradually migrate to L2 loss on the embeddings for the words.