Guilherme Penedo

@gui_penedo

Pre-training data @huggingface 🤗. Lisboeta 🇵🇹

Paris 🇫🇷

4月 2012に登録

971ポスト 4Kフォロワー 2Kフォロー中

おすすめツイート

@QuixiAI

@NousResearch

@tri_dao

@ml_hardware

@vitaliychiley

@0xResearch

@mvpatel2000

@BlockmageSec

@bronzeagepapi

@camenduru

@OpenMMLab

@lupantech

@_philschmid

@jrysana

@marccodess

固定されたツイート

Guilherme Penedo

@gui_penedo

/10/21

New dataset release: 🌐FineWiki This is an updated and better extracted version of Wikipedia, covering 325+ languages. Unlike the old dataset from 2023, we kept all the math content, tables, properly rendered templates, and extracted key facts. Examples and highlights below.

gui_penedo's tweet image. New dataset release: 🌐FineWiki

This is an updated and better extracted version of Wikipedia, covering 325+ languages.

Unlike the old dataset from 2023, we kept all the math content, tables, properly rendered templates, and extracted key facts.

Examples and highlights below.

Guilherme Penedo

@gui_penedo

/11/11

We have just released 📄FinePDFs-Edu, a version of FinePDFs filtered with the FineWeb-Edu approach using ModernBERT and mmBERT. 350B+ tokens of top tier mid-training data in multiple languages. You can also download the classifiers (all 69 of them!)

gui_penedo's tweet image. We have just released 📄FinePDFs-Edu, a version of FinePDFs filtered with the FineWeb-Edu approach using ModernBERT and mmBERT.

350B+ tokens of top tier mid-training data in multiple languages.

You can also download the classifiers (all 69 of them!)

Hynek Kydlíček

@HKydlicek

/11/11

We releasing a large update to 📄FinePDFs! - 350B+ highly education tokens in 69 languages, with incredible perf 🚀 - 69 edu classifiers, powered by ModernBert and mmBERT - 300k+ EDU annotations for each of 69 languages from Qwen3-235B

HKydlicek's tweet image. We releasing a large update to 📄FinePDFs!
- 350B+ highly education tokens in 69 languages, with incredible perf 🚀
- 69 edu classifiers, powered by ModernBert and mmBERT
- 300k+ EDU annotations for each of 69 languages from Qwen3-235B

Guilherme Penedo さんがリポスト

Loubna Ben Allal

@LoubnaBenAllal1

/10/30

After ~4 years building SOTA models & datasets, we're sharing everything we learned in ⚡The Smol Training Playbook We cover the full LLM cycle: designing ablations, choosing an architecture, curating data, post-training, and building solid infrastructure. We'll help you…

LoubnaBenAllal1's tweet image. After ~4 years building SOTA models &amp; datasets, we're sharing everything we learned in ⚡The Smol Training Playbook

We cover the full LLM cycle: designing ablations, choosing an architecture, curating data, post-training, and building solid infrastructure.

We'll help you…

Guilherme Penedo

@gui_penedo

/10/20

There's a lot of hype around OCR models going around, but how do you actually use them to make a useful dataset? Today, we're releasing our codebase and additional data for 📄FinePDFs!

Hynek Kydlíček

@HKydlicek

/10/20

We’re releasing the full FinePdfs source code — plus new datasets and models! 🚀 📚 Datasets: • OCR-Annotations — 1.6k PDFs labeled for OCR need • Gemma-LID-Annotation — 20k samples per language (annotated with Gemma3-27B) 🤖 Models: • XGB-OCR — OCR classifier for PDFs

HKydlicek's tweet image. We’re releasing the full FinePdfs source code — plus new datasets and models! 🚀

📚 Datasets:
• OCR-Annotations — 1.6k PDFs labeled for OCR need
• Gemma-LID-Annotation — 20k samples per language (annotated with Gemma3-27B)
🤖 Models:
• XGB-OCR — OCR classifier for PDFs

Guilherme Penedo さんがリポスト

Hynek Kydlíček

@HKydlicek

/10/08

Did I hear correctly that @gui_penedo announced that we will have free HF merch at our COLM poster session 🤔

Guilherme Penedo

@gui_penedo

/10/07

We're in 🇨🇦 for @COLM_conf Come talk to me and @HKydlicek on Wednesday at the FineWeb2 poster session (Session 3, Poster #58) @LoubnaBenAllal1 will be on Session 5, Poster #23 (SmolLM2)

gui_penedo's tweet image. We're in 🇨🇦 for @COLM_conf

Come talk to me and @HKydlicek on Wednesday at the FineWeb2 poster session (Session 3, Poster #58)

@LoubnaBenAllal1 will be on Session 5, Poster #23 (SmolLM2)

Guilherme Penedo

@gui_penedo

/09/09

Incredible to see FineWeb2 used in this amazing model that will power many many many use cases

Marc Marone

@ruyimarone

/09/09

XLM-R is an incredible model but we’ve learned so much since it came out - architecture, data, domains, and broader language coverage. This is our take and we can’t wait to see what folks build with it!

ruyimarone's tweet image. XLM-R is an incredible model but we’ve learned so much since it came out - architecture, data, domains, and broader language coverage. This is our take and we can’t wait to see what folks build with it!

Guilherme Penedo さんがリポスト

Leandro von Werra

@lvwerra

/09/09

The FinePDFs pipeline really shows the massive scale needed for LLM data processing: Raw data: 1.35B PDFs (1.2 PB) → $50k/mo storage Extraction: > rolmOCR: 368M docs → 250k GPUh, $750k > docling: 918M docs → 2.4M vCPUh, $35k Total liberation cost (incl. ablations): ~$1M

lvwerra's tweet image. The FinePDFs pipeline really shows the massive scale needed for LLM data processing:

Raw data: 1.35B PDFs (1.2 PB) → $50k/mo storage

Extraction:
&gt; rolmOCR: 368M docs → 250k GPUh, $750k
&gt; docling: 918M docs → 2.4M vCPUh, $35k

Total liberation cost (incl. ablations): ~$1M

Guilherme Penedo さんがリポスト

Hynek Kydlíček

@HKydlicek

/09/07

We are releasing 📄 FinePDFs: the largest PDF dataset spanning over half a billion documents! - Long context: Documents are 2x longer than web text - 3T tokens from high-demand domains like legal and science. - Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.