おすすめツイート
New dataset release: 🌐FineWiki This is an updated and better extracted version of Wikipedia, covering 325+ languages. Unlike the old dataset from 2023, we kept all the math content, tables, properly rendered templates, and extracted key facts. Examples and highlights below.
We have just released 📄FinePDFs-Edu, a version of FinePDFs filtered with the FineWeb-Edu approach using ModernBERT and mmBERT. 350B+ tokens of top tier mid-training data in multiple languages. You can also download the classifiers (all 69 of them!)
We releasing a large update to 📄FinePDFs! - 350B+ highly education tokens in 69 languages, with incredible perf 🚀 - 69 edu classifiers, powered by ModernBert and mmBERT - 300k+ EDU annotations for each of 69 languages from Qwen3-235B
After ~4 years building SOTA models & datasets, we're sharing everything we learned in ⚡The Smol Training Playbook We cover the full LLM cycle: designing ablations, choosing an architecture, curating data, post-training, and building solid infrastructure. We'll help you…
There's a lot of hype around OCR models going around, but how do you actually use them to make a useful dataset? Today, we're releasing our codebase and additional data for 📄FinePDFs!
We’re releasing the full FinePdfs source code — plus new datasets and models! 🚀 📚 Datasets: • OCR-Annotations — 1.6k PDFs labeled for OCR need • Gemma-LID-Annotation — 20k samples per language (annotated with Gemma3-27B) 🤖 Models: • XGB-OCR — OCR classifier for PDFs
Did I hear correctly that @gui_penedo announced that we will have free HF merch at our COLM poster session 🤔
We're in 🇨🇦 for @COLM_conf Come talk to me and @HKydlicek on Wednesday at the FineWeb2 poster session (Session 3, Poster #58) @LoubnaBenAllal1 will be on Session 5, Poster #23 (SmolLM2)
Incredible to see FineWeb2 used in this amazing model that will power many many many use cases
XLM-R is an incredible model but we’ve learned so much since it came out - architecture, data, domains, and broader language coverage. This is our take and we can’t wait to see what folks build with it!
The FinePDFs pipeline really shows the massive scale needed for LLM data processing: Raw data: 1.35B PDFs (1.2 PB) → $50k/mo storage Extraction: > rolmOCR: 368M docs → 250k GPUh, $750k > docling: 918M docs → 2.4M vCPUh, $35k Total liberation cost (incl. ablations): ~$1M
We are releasing 📄 FinePDFs: the largest PDF dataset spanning over half a billion documents! - Long context: Documents are 2x longer than web text - 3T tokens from high-demand domains like legal and science. - Heavily improves over SoTA when mixed with FW-EDU&DCLM web copora.
United States トレンド
- 1. Cloudflare 209K posts
- 2. Gemini 3 23.3K posts
- 3. #AcousticPianoCollection 1,002 posts
- 4. Piggy 53K posts
- 5. Olivia Dean 3,929 posts
- 6. Taco Tuesday 14.4K posts
- 7. Saudi 112K posts
- 8. Anthropic 8,125 posts
- 9. Salman 29.4K posts
- 10. Good Tuesday 35.1K posts
- 11. #tuesdayvibe 3,063 posts
- 12. Sam Leavitt N/A
- 13. #ONEPIECE1166 4,275 posts
- 14. CAIR 23K posts
- 15. JUST ANNOUNCED 26.2K posts
- 16. jeonghan 93.1K posts
- 17. Gary Sinise 6,476 posts
- 18. Garling 2,236 posts
- 19. Passan N/A
- 20. Brian Walshe N/A
おすすめツイート
-
Eric Hartford
@QuixiAI -
Nous Research
@NousResearch -
Tri Dao
@tri_dao -
Abhi Venigalla
@ml_hardware -
Vitaliy Chiley
@vitaliychiley -
0xResearch
@0xResearch -
Mihir Patel
@mvpatel2000 -
BLOCKMAGE
@BlockmageSec -
Kirito (e/acc) 🏴☠️
@bronzeagepapi -
camenduru
@camenduru -
OpenMMLab
@OpenMMLab -
Pan Lu
@lupantech -
Philipp Schmid
@_philschmid -
John
@jrysana -
Marc M.
@marccodess
Something went wrong.
Something went wrong.