gui_penedo's profile picture. Pre-training data @huggingface 🤗. Lisboeta 🇵🇹

Guilherme Penedo

@gui_penedo

Pre-training data @huggingface 🤗. Lisboeta 🇵🇹

固定されたツイート

New dataset release: 🌐FineWiki This is an updated and better extracted version of Wikipedia, covering 325+ languages. Unlike the old dataset from 2023, we kept all the math content, tables, properly rendered templates, and extracted key facts. Examples and highlights below.

gui_penedo's tweet image. New dataset release: 🌐FineWiki

This is an updated and better extracted version of Wikipedia, covering 325+ languages.

Unlike the old dataset from 2023, we kept all the math content, tables, properly rendered templates, and extracted key facts.

Examples and highlights below.

We have just released 📄FinePDFs-Edu, a version of FinePDFs filtered with the FineWeb-Edu approach using ModernBERT and mmBERT. 350B+ tokens of top tier mid-training data in multiple languages. You can also download the classifiers (all 69 of them!)

gui_penedo's tweet image. We have just released 📄FinePDFs-Edu, a version of FinePDFs filtered with the FineWeb-Edu approach using ModernBERT and mmBERT.

350B+ tokens of top tier mid-training data in multiple languages.

You can also download the classifiers (all 69 of them!)

We releasing a large update to 📄FinePDFs! - 350B+ highly education tokens in 69 languages, with incredible perf 🚀 - 69 edu classifiers, powered by ModernBert and mmBERT - 300k+ EDU annotations for each of 69 languages from Qwen3-235B

HKydlicek's tweet image. We releasing a large update to 📄FinePDFs!
- 350B+ highly education tokens in 69 languages, with incredible perf 🚀
- 69 edu classifiers, powered by ModernBert and mmBERT
- 300k+ EDU annotations  for each of 69 languages from Qwen3-235B


Guilherme Penedo さんがリポスト

After ~4 years building SOTA models & datasets, we're sharing everything we learned in ⚡The Smol Training Playbook We cover the full LLM cycle: designing ablations, choosing an architecture, curating data, post-training, and building solid infrastructure. We'll help you…

LoubnaBenAllal1's tweet image. After ~4 years building SOTA models & datasets, we're sharing everything we learned in ⚡The Smol Training Playbook

We cover the full LLM cycle: designing ablations, choosing an architecture, curating data, post-training, and building solid infrastructure.

We'll help you…

There's a lot of hype around OCR models going around, but how do you actually use them to make a useful dataset? Today, we're releasing our codebase and additional data for 📄FinePDFs!

We’re releasing the full FinePdfs source code — plus new datasets and models! 🚀 📚 Datasets: • OCR-Annotations — 1.6k PDFs labeled for OCR need • Gemma-LID-Annotation — 20k samples per language (annotated with Gemma3-27B) 🤖 Models: • XGB-OCR — OCR classifier for PDFs

HKydlicek's tweet image. We’re releasing the full FinePdfs source code — plus new datasets and models! 🚀

📚 Datasets:
• OCR-Annotations — 1.6k PDFs labeled for OCR need
• Gemma-LID-Annotation — 20k samples per language (annotated with Gemma3-27B)
🤖 Models:
• XGB-OCR — OCR classifier for PDFs


Guilherme Penedo さんがリポスト

Did I hear correctly that @gui_penedo announced that we will have free HF merch at our COLM poster session 🤔


We're in 🇨🇦 for @COLM_conf Come talk to me and @HKydlicek on Wednesday at the FineWeb2 poster session (Session 3, Poster #58) @LoubnaBenAllal1 will be on Session 5, Poster #23 (SmolLM2)

gui_penedo's tweet image. We're in 🇨🇦 for @COLM_conf 

Come talk to me and @HKydlicek on Wednesday at the FineWeb2 poster session (Session 3, Poster #58)

@LoubnaBenAllal1 will be on Session 5, Poster #23 (SmolLM2)

Incredible to see FineWeb2 used in this amazing model that will power many many many use cases

XLM-R is an incredible model but we’ve learned so much since it came out - architecture, data, domains, and broader language coverage. This is our take and we can’t wait to see what folks build with it!

ruyimarone's tweet image. XLM-R is an incredible model but we’ve learned so much since it came out - architecture, data, domains, and broader language coverage. This is our take and we can’t wait to see what folks build with it!


Guilherme Penedo さんがリポスト

The FinePDFs pipeline really shows the massive scale needed for LLM data processing: Raw data: 1.35B PDFs (1.2 PB) → $50k/mo storage Extraction: > rolmOCR: 368M docs → 250k GPUh, $750k > docling: 918M docs → 2.4M vCPUh, $35k Total liberation cost (incl. ablations): ~$1M

lvwerra's tweet image. The FinePDFs pipeline really shows the massive scale needed for LLM data processing:

Raw data: 1.35B PDFs (1.2 PB) → $50k/mo storage 

Extraction:
> rolmOCR: 368M docs → 250k GPUh, $750k
> docling: 918M docs → 2.4M vCPUh, $35k 

Total liberation cost (incl. ablations): ~$1M

Guilherme Penedo さんがリポスト

We are releasing 📄 FinePDFs: the largest PDF dataset spanning over half a billion documents! - Long context: Documents are 2x longer than web text - 3T tokens from high-demand domains like legal and science. - Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.

HKydlicek's tweet image. We are releasing 📄 FinePDFs:
the largest PDF dataset spanning over half a billion documents!

- Long context: Documents are 2x longer than web text
- 3T tokens from high-demand domains like legal and science.
- Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.

Loading...

Something went wrong.


Something went wrong.