#HPLT v3.0 Dataset is OUT! 🚀 A massive leap in multilingual data quality & scale with: 📈 73% unique segments (up from 52%) 🌐 Better text extraction and langID 🔄 Global deduplication, cleaner corpus Ideal for training better multilingual models 🔗hplt-project.org/datasets/v3.0
HPLT v2 datasets now enriched with register labels from @UniTurku. As Amanda Myntti and Veronika Laippala's show: "Appropriate metadata increases the value of a dataset". - Blog post: hplt-project.org/register-labels - Datasets + register labels (to be merged): hplt-project.org/datasets/v2.0
New paper on the HPLT v2 dataset making-of: - pipeline documentation and code - extensive analysis of the quality and characteristics - evaluation of the performance of language models and machine translation systems trained on it 🤓Happy reading! arxiv.org/pdf/2503.10267
We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) 🤩 - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) 😮 Available at the HPLT dataset catalogue and OPUS.
👏👏Check out the 2nd best-performing system in our Generative Error Correction Challenge for LLM-based emotion recognition (yet unreasonably rejected by SLT2024). They achieved 75.1% accuracy on ASR transcriptions of IEMOCAP using GPT-4o.
``Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models,'' Pavel Stepachev, Pinzhen Chen, Barry Haddow, ift.tt/tCG7YmT
This work has gotten accepted at WMT (Conference on Machine Translation) held with EMNLP 2024! 🥳🎉 And we also have another work accepted on cross-cultural transcreation of restaurant menus, led by Zhonghe Zhang arxiv.org/abs/2408.13534 See you all in Miami! Eager to discuss…
We know LLMs are poor at MT in low-resource languages (LRLs): curious how to adapt them to perform better? 🚀 Our new paper explores the interplay between scale (of MT data) and diversity (of tasks/langs) in instruction tuning in determining LLM-MT performance for LRLs💡…
I can't understand why challenge papers were reviewed under the same process and acceptance rate as regular papers in #SLT2024. Some of the best-performing submissions in our challenge were rejected, even though the meta-reviewer recommended acceptance!🤡
United States Trends
- 1. The PENGU 228 B posts
- 2. FINALLY DID IT 443 B posts
- 3. The BONK 173 B posts
- 4. The Jito 31,7 B posts
- 5. #IDontWantToOverreactBUT N/A
- 6. #HappyBirthdayTaehyung 64,9 B posts
- 7. Hobi 52,1 B posts
- 8. Good Monday 44,9 B posts
- 9. Monday of 2025 13,2 B posts
- 10. #MondayMotivation 7.404 posts
- 11. The BULLISH 28,7 B posts
- 12. Victory Monday N/A
- 13. #MondayVibes 2.577 posts
- 14. #leopard 2.115 posts
- 15. Flossie 4.441 posts
- 16. Chadwick 1.291 posts
- 17. Jack White 2.873 posts
- 18. Sweet Angel Child Care 30,2 B posts
- 19. Bijan 3.143 posts
- 20. Nick Shirley 541 B posts
Something went wrong.
Something went wrong.