PStepachev's profile picture. Do ml and brrrr

Pavel Stepachev

@PStepachev

Do ml and brrrr

Pavel Stepachev reposted

#HPLT v3.0 Dataset is OUT! 🚀 A massive leap in multilingual data quality & scale with: 📈 73% unique segments (up from 52%) 🌐 Better text extraction and langID 🔄 Global deduplication, cleaner corpus Ideal for training better multilingual models 🔗hplt-project.org/datasets/v3.0

hplt_eu's tweet image. #HPLT v3.0 Dataset is OUT! 🚀

A massive  leap in multilingual data quality & scale with:

📈 73% unique segments (up from 52%)
🌐 Better text extraction and langID
🔄 Global deduplication, cleaner corpus

Ideal for training better multilingual models

🔗hplt-project.org/datasets/v3.0

Pavel Stepachev reposted

HPLT v2 datasets now enriched with register labels from @UniTurku. As Amanda Myntti and Veronika Laippala's show: "Appropriate metadata increases the value of a dataset". - Blog post: hplt-project.org/register-labels - Datasets + register labels (to be merged): hplt-project.org/datasets/v2.0

hplt_eu's tweet image. HPLT v2 datasets now enriched with register labels from @UniTurku. As Amanda Myntti and Veronika Laippala's show: "Appropriate metadata increases the value of a dataset". 

- Blog post: hplt-project.org/register-labels
- Datasets + register labels (to be merged): hplt-project.org/datasets/v2.0

Pavel Stepachev reposted

New paper on the HPLT v2 dataset making-of: - pipeline documentation and code - extensive analysis of the quality and characteristics - evaluation of the performance of language models and machine translation systems trained on it 🤓Happy reading! arxiv.org/pdf/2503.10267

hplt_eu's tweet image. New paper on the HPLT v2 dataset making-of: 

- pipeline documentation and code
- extensive analysis of the quality and characteristics
- evaluation of the performance of language models and machine translation systems trained on it

🤓Happy reading! arxiv.org/pdf/2503.10267

Pavel Stepachev reposted

We are happy to announce the second release of HPLT bilingual datasets: - 50 English-centric language pairs = 380M parallel sentences (HPLT) 🤩 - 1,275 non-English-centric language pairs = 16.7B parallel sentences (MultiHPLT) 😮 Available at the HPLT dataset catalogue and OPUS.


Pavel Stepachev reposted

👏👏Check out the 2nd best-performing system in our Generative Error Correction Challenge for LLM-based emotion recognition (yet unreasonably rejected by SLT2024). They achieved 75.1% accuracy on ASR transcriptions of IEMOCAP using GPT-4o.

``Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models,'' Pavel Stepachev, Pinzhen Chen, Barry Haddow, ift.tt/tCG7YmT



Pavel Stepachev reposted

This work has gotten accepted at WMT (Conference on Machine Translation) held with EMNLP 2024! 🥳🎉 And we also have another work accepted on cross-cultural transcreation of restaurant menus, led by Zhonghe Zhang arxiv.org/abs/2408.13534 See you all in Miami! Eager to discuss…

We know LLMs are poor at MT in low-resource languages (LRLs): curious how to adapt them to perform better? 🚀 Our new paper explores the interplay between scale (of MT data) and diversity (of tasks/langs) in instruction tuning in determining LLM-MT performance for LRLs💡…

remorax98's tweet image. We know LLMs are poor at MT in low-resource languages (LRLs): curious how to adapt them to perform better? 🚀

Our new paper explores the interplay between scale (of MT data) and diversity (of tasks/langs) in instruction tuning in determining LLM-MT performance for LRLs💡…


Pavel Stepachev reposted

I can't understand why challenge papers were reviewed under the same process and acceptance rate as regular papers in #SLT2024. Some of the best-performing submissions in our challenge were rejected, even though the meta-reviewer recommended acceptance!🤡


United States Trends

Loading...

Something went wrong.


Something went wrong.