Instruction Workshop, NeurIPS 2023

@itif_workshop

The official account of the 1st Workshop on Instruction Tuning and Instruction Following (ITIF), colocated with NeurIPS, in December 2023.

New Orleans

an-instructive-workshop.github.io

Joined August 2023

192Posts 252Followers 29Following

Instruction Workshop, NeurIPS 2023 reposted

Sneha Kudugunta

@snehaark

Oct 28

A new, tractable approach to study scaling laws for larger data mixtures compared to prior art. We achieve significantly better fit ($R^2=0.98$) on multilingual data mixtures with ~50 languages.

snehaark's tweet image. A new, tractable approach to study scaling laws for larger data mixtures compared to prior art. We achieve significantly better fit ($R^2=0.98$) on multilingual data mixtures with ~50 languages.

Instruction Workshop, NeurIPS 2023 reposted

Have you ever wondered how to build target-language LMs most efficiently? 🤨 Will finetuning from other multilingual LMs help? No curse of multilinguality?? See the snapshot below for the answer. What about transferring data to help? What languages help? See our new paper! 👇

IHung_Hsu's tweet image. Have you ever wondered how to build target-language LMs most efficiently? 🤨

Will finetuning from other multilingual LMs help? No curse of multilinguality?? See the snapshot below for the answer.

What about transferring data to help? What languages help? See our new paper! 👇

Shayne Longpre

@ShayneRedford

Oct 28

📢Thrilled to introduce ATLAS 🗺️: scaling laws beyond English, for pretraining, finetuning, and the curse of multilinguality. The largest public, multilingual scaling study to-date—we ran 774 exps (10M-8B params, 400+ languages) to answer: 🌍Are scaling laws different by…

ShayneRedford's tweet image. 📢Thrilled to introduce ATLAS 🗺️: scaling laws beyond English, for pretraining, finetuning, and the curse of multilinguality.

The largest public, multilingual scaling study to-date—we ran 774 exps (10M-8B params, 400+ languages) to answer:

🌍Are scaling laws different by…

Instruction Workshop, NeurIPS 2023 reposted

Xeophon

@xeophon_

Oct 28

What an insane banger and amazing work

Instruction Workshop, NeurIPS 2023 reposted

Marzieh Fadaee

@mziizm

Oct 28

Many gems in this paper for those who want to systematically scale multilingual learning with fewer arbitrary decisions. Great job Shayne et al.

Shayne Longpre

@ShayneRedford

Oct 28

Instruction Workshop, NeurIPS 2023 reposted

Niklas Muennighoff

@Muennighoff

Oct 28

To scale data-constrained LLMs, repeating & denoising objectives can help. Another solution: Add multilingual data. But what languages help & how much? Below a snapshot for this at 2B scale, e.g., Chinese can hurt English while Indonesian may help.

Muennighoff's tweet image. To scale data-constrained LLMs, repeating &amp; denoising objectives can help. Another solution: Add multilingual data. But what languages help &amp; how much? Below a snapshot for this at 2B scale, e.g., Chinese can hurt English while Indonesian may help.

Shayne Longpre

@ShayneRedford

Oct 28

Instruction Workshop, NeurIPS 2023 reposted

Cody Blakeney

@code_star

Oct 28

Exciting to see someone do a study so elaborately and at such scale. I really like this grid looking at transfer synergy. Also this result is cool. Intuitive that model capacity helps, but great to see an empirical result.

code_star's tweet image. Exciting to see someone do a study so elaborately and at such scale. I really like this grid looking at transfer synergy.

Also this result is cool. Intuitive that model capacity helps, but great to see an empirical result.

Shayne Longpre

@ShayneRedford

Oct 28

Q2: Which languages actually help each other during training? And how much? 🌟Answer: We measure this empirically. We built a 38×38 transfer matrix, or 1,444 language pairs—the largest such resource to date. We highlight the top 5 most beneficial source languages for each…

ShayneRedford's tweet image. Q2: Which languages actually help each other during training? And how much?

🌟Answer: We measure this empirically. We built a 38×38 transfer matrix, or 1,444 language pairs—the largest such resource to date.

We highlight the top 5 most beneficial source languages for each…

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Oct 28

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Jul 14

Copyrighted 🚧, private 🛑, and sensitive ☢️ data remain major challenges for AI. FlexOlmo introduces an architectural mechanism to flexibly opt-in/opt-out segments of data in the training weights, **at inference time**. (Prior common solutions were to filter your data once…

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Jun 23

Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 ! Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by @kandpal_nikhil @blester125 @colinraffel. 📜: arxiv.org/pdf/2506.05209 📚🤖 Data & models: huggingface.co/common-pile 1/

ShayneRedford's tweet image. Thrilled to collaborate on the launch of 📚 CommonPile v0.1 📚 !

Introducing the largest openly-licensed LLM pretraining corpus (8 TB), led by @kandpal_nikhil @blester125 @colinraffel.

📜: arxiv.org/pdf/2506.05209
📚🤖 Data &amp; models: huggingface.co/common-pile
1/

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Apr 25

Come say hello at ICLR! 👋 Here's where you can find me: Friday: Data-centric AI Social! lu.ma/rmyoy2vw Saturday: Multimodal Data Provenance poster (3 pm, Hall 2B #494) Sunday: MLDPR Workshop (3 pm) [mldpr2025.com]—I'll talk about challenges to AI data…

ShayneRedford's tweet image. Come say hello at ICLR! 👋 Here's where you can find me:

Friday: Data-centric AI Social! lu.ma/rmyoy2vw

Saturday: Multimodal Data Provenance poster (3 pm, Hall 2B #494)

Sunday: MLDPR Workshop (3 pm) [mldpr2025.com]—I'll talk about challenges to AI data…

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Apr 14

Thrilled our global data ecosystem audit was accepted to #ICLR2025! Empirically, we find: 1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024). 2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection. 3⃣ <0.2% of data from…

ShayneRedford's tweet image. Thrilled our global data ecosystem audit was accepted to #ICLR2025!

Empirically, we find:

1⃣ Soaring synthetic text data: ~10M tokens (pre-2018) to 100B+ (2024).

2⃣ YouTube is now 70%+ of speech/video data but could block third-party collection.

3⃣ &lt;0.2% of data from…

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Mar 13

What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️ Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators: 1️⃣ Standardized AI flaw reports 2️⃣ AI flaw disclosure programs + safe harbors. 3️⃣ A coordination…

ShayneRedford's tweet image. What are 3 concrete steps that can improve AI safety in 2025? 🤖⚠️

Our new paper, “In House Evaluation is Not Enough” has 3 calls-to-action to empower independent evaluators:

1️⃣ Standardized AI flaw reports
2️⃣ AI flaw disclosure programs + safe harbors.
3️⃣ A coordination…

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Feb 19

I compiled a list of resources for understanding AI copyright challenges (US-centric). 📚 ➡️ why is copyright an issue? ➡️ what is fair use? ➡️ why are memorization and generation important? ➡️ how does it impact the AI data supply / web crawling? 🧵

ShayneRedford's tweet image. I compiled a list of resources for understanding AI copyright challenges (US-centric). 📚

➡️ why is copyright an issue?
➡️ what is fair use?
➡️ why are memorization and generation important?
➡️ how does it impact the AI data supply / web crawling?

🧵

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Feb 12

I wrote a spicy piece on "AI crawler wars"🐞 in @MIT @techreview (my first op-ed)! While we’re busy watching copyright lawsuits & the EU AI Act, there’s a quieter battle over data access that affects websites, everyday users, and the open web. 🔗 technologyreview.com/2025/02/11/111… 1/

ShayneRedford's tweet image. I wrote a spicy piece on "AI crawler wars"🐞 in @MIT @techreview (my first op-ed)!

While we’re busy watching copyright lawsuits &amp; the EU AI Act, there’s a quieter battle over data access that affects websites, everyday users, and the open web.

🔗 technologyreview.com/2025/02/11/111…

1/

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Feb 10

1/ Last week, we published the International AI Safety Report—supported by 30 nations plus the OECD, UN, and EU. Over 100 independent experts contributed. I’m thankful to play a small writing role, focusing on “Risks of Copyright.” 🔗 bit.ly/40Vm7Mu

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Feb 3

Our updated Responsible Foundation Model Development Cheatsheet (250+ tools & resources) is now officially accepted to @TmlrOrg 2025! It covers: - data sourcing, - documentation, - environmental impact, - risk eval - model release & licensing

ShayneRedford's tweet image. Our updated Responsible Foundation Model Development Cheatsheet (250+ tools &amp; resources) is now officially accepted to @TmlrOrg 2025!

It covers:
- data sourcing,
- documentation,
- environmental impact,
- risk eval
- model release &amp; licensing

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Feb 1

🪶 Some thoughts on DeepSeek, OpenAI, and the copyright battles: This isn’t the first time OpenAI has accused a Chinese company of breaking its Terms and training on ChatGPT outputs. Dec 2023: They suspended ByteDance’s accounts. 1/

ShayneRedford's tweet image. 🪶 Some thoughts on DeepSeek, OpenAI, and the copyright battles:

This isn’t the first time OpenAI has accused a Chinese company of breaking its Terms and training on ChatGPT outputs.

Dec 2023: They suspended ByteDance’s accounts.

1/

Instruction Workshop, NeurIPS 2023 reposted

Weijia Shi

@WeijiaShi2

Jan 31

Check out our recipe for adapting existing LMs for multimodal generation: it fully preserves language performances while enhancing models with visual understanding and generation🖼️

Weijia Shi

@WeijiaShi2

Dec 20

Introducing 𝐋𝐥𝐚𝐦𝐚𝐅𝐮𝐬𝐢𝐨𝐧: empowering Llama 🦙 with diffusion 🎨 to understand and generate text and images in arbitrary sequences. ✨ Building upon Transfusion, our recipe fully preserves Llama’s language performance while unlocking its multimodal understanding and…

WeijiaShi2's tweet image. Introducing 𝐋𝐥𝐚𝐦𝐚𝐅𝐮𝐬𝐢𝐨𝐧: empowering Llama 🦙 with diffusion 🎨 to understand and generate text and images in arbitrary sequences.

✨ Building upon Transfusion, our recipe fully preserves Llama’s language performance while unlocking its multimodal understanding and…

Instruction Workshop, NeurIPS 2023 reposted

Shayne Longpre

@ShayneRedford

Jan 21

New Report, to appear at @RealAAAI 2025: The @defcon 2024 @aivillage_dc Generative Red Team 2 (GRT2) Case Study, led by @seanmcgregor The event spanned: ⚔️495 hackers, against AI2’s Olmo + WildGuard 🐞200 model flaw reports 💰$7k+ paid bounties 🔗 arxiv.org/pdf/2410.12104

Toby Olmedo

@OlmedoToby32805

Kaushik Doddamani

@Kaushik_N_D

T G

@arcaan_5

Ymwokal

@Ymwokal926

Nuuicarn

@Nuuicarn9035

rr.420

@tweetbyMLMan

Avinash Pathak

@avipathak99

Shezan Kazi

@shezankazi

AMBIENT MUSIC - JAPAN SOUNDSCAPE -

@AmbientMusicJS

Nayan Saxena

@SaxenaNayan

SingForYou

@SingForYou92128

jluite

@jluite2014

klimen

@pawel_handle

Vaibhav Adlakha

@vaibhav_adlakha

Niel Carlson

@Typhoon_q_cat

Ahmad Mustafa Anis

@AhmadMustafaAn1

Jeb Lincoln Hu

@cshjb007

Marzieh Fadaee

@mziizm

Niklas Muennighoff

@Muennighoff

Thomas Palmeira

@_ThomasPF

Harrison

@Harriso70784099

Y. Bo

@YiqingBo

Shady

@shadysaeed

fqd

@fqd11082344

$yfilipch's profile picture. Co-founder @ OpenBabylon – strategic simulation through LLM world-building l Founder @ OpenBabylon | ¯\(ツ)/¯$

Yurii Filipchuk🇺🇦 — e/acc

@yfilipch

Andrew Lambert 🤖

@andrewlambert88

Yejie Wang

@YejieWang

jack

@jack77280827

Will Brannon

@wwbrannon

Joe Fenton

@JoeFenton

Amad Hussain

@amadh881

FOMO

@johnnyfelix009

Steve Chiou

@Steveofice

Pratyay Banerjee (নীল) 

@Neilzblaze007

Xin Liu

@isXinLiu

Azal Ahmad Khan

@azalakhan

Varun Talwar

@vt_65

Yougang Lyu

@yougang_lyu

Aaditya Ura

@aadityaura

Anurag Beniwal

@anuragbeniwal09

camenduru

@camenduru

un gran danés con Ph.D, M.Sc

@tobygreatdane

Ravi Raju

@ravi_ragu

Alex Graveley

@alexgraveley

Eric

@ericmitchellai

Qingxiu Dong

@qx_dong

Shuaichen Chang

@ShuaichenChang

Narbikova, Valeriya

@poppapoppy

Simon Yu

@simon_ycl

LeoLoves

@Leoloveswalk

Sneha Kudugunta

@snehaark

Shivalika Singh

@singhshiviii

Stella Biderman

@BlancheMinerva

Marzieh Fadaee

@mziizm

Zexue He

@ZexueHe

Mengzhou Xia

@xiamengzhou

Seungone Kim

@seungonekim

Russell Kaplan

@russelljkaplan

Manasi Sharma

@ManasiSharma_

Alex Tamkin

@AlexTamkin

Hyung Won Chung

@hwchung27

Hao Zhang

@haozhangml

Hailey Schoelkopf

@haileysch__

Niklas Muennighoff

@Muennighoff

Omer Levy

@omerlevy_

Nazneen Rajani

@nazneenrajani

Sam Bowman

@sleepinyourhat

Fei Xia

@xf1280

$sarahookr's profile picture. Adaptive Intelligence. Built @Cohere_Labs, @GoogleBrain, @GoogleDeepmind. ML Efficiency, Multimodal\lingual. Changing spaces where breakthroughs happen.$

Sara Hooker

@sarahookr

Tatsunori Hashimoto

@tatsu_hashimoto

JHU CLSP

@jhuclsp

Robin Jia

@robinomial

Hanna Hajishirzi

@HannaHajishirzi

Sean Ren

@xiangrenNLP

Yao Fu

@Francis_YAO_

Daniel Khashabi 🕊️

@DanielKhashabi

Yizhong Wang

@yizhongwyz

Qinyuan Ye

@qinyuan_ye

Shayne Longpre

@ShayneRedford

United States Trends

1. Northern Lights 44.5K posts
2. #Aurora 9,347 posts
3. #DWTS 52.8K posts
4. #RHOSLC 7,099 posts
5. AI-driven Web3 1,044 posts
6. Carmilla 1,874 posts
7. Sabonis 6,220 posts
8. H-1B 34.3K posts
9. Justin Edwards 2,444 posts
10. #GoAvsGo 1,563 posts
11. Gonzaga 2,987 posts
12. #MakeOffer 9,028 posts
13. Louisville 18.1K posts
14. Creighton 2,298 posts
15. Eubanks N/A
16. Jamal Murray N/A
17. Andy 61K posts
18. Cleto 2,508 posts
19. Oweh 2,120 posts
20. Zags N/A

Something went wrong.

Instruction Workshop, NeurIPS 2023

@itif_workshop

Sneha Kudugunta

I-Hung Hsu

Shayne Longpre

Xeophon

Marzieh Fadaee

Shayne Longpre

Niklas Muennighoff

Shayne Longpre

Cody Blakeney

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Shayne Longpre

Weijia Shi

Weijia Shi

Shayne Longpre

Toby Olmedo

Kaushik Doddamani

T G

Ymwokal

Nuuicarn

rr.420

Avinash Pathak

Shezan Kazi

AMBIENT MUSIC - JAPAN SOUNDSCAPE -

Nayan Saxena

SingForYou

jluite

klimen

Vaibhav Adlakha

Niel Carlson

Ahmad Mustafa Anis

Jeb Lincoln Hu

Marzieh Fadaee

Niklas Muennighoff

Thomas Palmeira

Harrison

Y. Bo

Shady

fqd

Yurii Filipchuk🇺🇦 — e/acc

Andrew Lambert 🤖

Yejie Wang

jack

Will Brannon

Joe Fenton

Amad Hussain

FOMO

Steve Chiou

Pratyay Banerjee (নীল) 

Xin Liu

Azal Ahmad Khan

Varun Talwar

Yougang Lyu

Aaditya Ura

Anurag Beniwal

camenduru

un gran danés con Ph.D, M.Sc

Ravi Raju

Alex Graveley

Eric

Qingxiu Dong

Shuaichen Chang

Narbikova, Valeriya

Simon Yu

LeoLoves

Sneha Kudugunta

Shivalika Singh

Stella Biderman

Marzieh Fadaee