Hey, first off, thanks for sharing this work!
I noticed that you're actively using Relative Positional Embeddings (RPE) for all attention operations here and here.
As the RPE are intended to emulate the convolutional layer shift-invariance property. I wonder what was the impact of this choice? Looking at your arxiv paper, I haven't seen any explicit mention of this choice or an ablation study.
If you could share why RPE was important and if it is needed, I'd appretiate, thanks!
Hey, first off, thanks for sharing this work!
I noticed that you're actively using Relative Positional Embeddings (RPE) for all attention operations here and here.
As the RPE are intended to emulate the convolutional layer shift-invariance property. I wonder what was the impact of this choice? Looking at your arxiv paper, I haven't seen any explicit mention of this choice or an ablation study.
If you could share why RPE was important and if it is needed, I'd appretiate, thanks!