#scalablemachinelearning kết quả tìm kiếm

Không có kết quả nào cho "#scalablemachinelearning"

1/n) Scaling across massive datasets can be achieved via hierarchical or federated approaches. Localized models process partitioned subsets, updating global baselines incrementally.


They find power-law scaling between pretraining data and downstream manipulation error. Meaning you can now predict robot performance from data scale. That’s normal for LLMs. It’s new for robots.

robotsdigest's tweet image. They find power-law scaling between pretraining data and downstream manipulation error.
Meaning you can now predict robot performance from data scale.
 That’s normal for LLMs.
 It’s new for robots.

They derive a interesting scaling law for the method where the failure rate drops exponentially as you scale candidates N and/or comparisons per match K. This is under two assumptions: the LLM generates a correct solution with non-zero probability and ...

BhavinJawade's tweet image. They derive a interesting scaling law for the method where the failure rate drops exponentially as you scale candidates N and/or comparisons per match K. This is under two assumptions: the LLM generates a correct solution with non-zero probability and ...

Pre-training under infinite compute • Data, not compute, is the new bottleneck • Standard recipes overfit → fix with strong regularization (30× weight decay) • Scaling laws: loss decreases monotonically, best measured by asymptote not fixed budget

arankomatsuzaki's tweet image. Pre-training under infinite compute

• Data, not compute, is the new bottleneck
• Standard recipes overfit → fix with strong regularization (30× weight decay)
• Scaling laws: loss decreases monotonically, best measured by asymptote not fixed budget

This is the best public resource on scaling hardware for AI, and its free. "How to Scale Your Model" is the bible from Google DeepMind that covers the math, systems and scaling laws for LLM training and inference workloads. Approachable yet thorough. Absolute Must-read.

deedydas's tweet image. This is the best public resource on scaling hardware for AI, and its free.

"How to Scale Your Model" is the bible from Google DeepMind that covers the math, systems and scaling laws for LLM training and inference workloads.

Approachable yet thorough. Absolute Must-read.

Large Language Models are getting smarter AND smaller. This video breaks down how AI researchers have made this possible. 00:00 Overview 00:20 Scaling Law 1.0 01:48 Using smaller weights 03:44 Increasing model training 06:05 Google Titans 07:40 What's after Transformers? 08:45…


Scaling is incredibly hard and demanding and leaves very little room for error in every little part of the training stack But once it works, it's beautiful to see it

MillionInt's tweet image. Scaling is incredibly hard and demanding and leaves very little room for error in every little part of the training stack

But once it works, it's beautiful to see it

We’ve answered the three remaining open questions @kellerjordan0 posted at kellerjordan.github.io/posts/muon/: - Yes, Muon scales to larger compute. - Yes, Muon can be trained on a large cluster. - Yes, Muon works for SFT. RL probably works too, leave it to the audience as exercise :)

🚀 Introducing our new tech report: Muon is Scalable for LLM Training We found that Muon optimizer can be scaled up using the follow techniques: • Adding weight decay • Carefully adjusting the per-parameter update scale ✨ Highlights: • ~2x computational efficiency vs AdamW…

Kimi_Moonshot's tweet image. 🚀 Introducing our new tech report: Muon is Scalable for LLM Training

We found that Muon optimizer can be scaled up using the follow techniques: 
• Adding weight decay
• Carefully adjusting the per-parameter update scale

✨ Highlights:
• ~2x computational efficiency vs AdamW…
Kimi_Moonshot's tweet image. 🚀 Introducing our new tech report: Muon is Scalable for LLM Training

We found that Muon optimizer can be scaled up using the follow techniques: 
• Adding weight decay
• Carefully adjusting the per-parameter update scale

✨ Highlights:
• ~2x computational efficiency vs AdamW…


This may be the most important figure in LLM research since the OG Chinchilla scaling law in 2022. The key insight is 2 curves working in tandem. Not one. People have been predicting a stagnation in LLM capability by extrapolating the training scaling law, yet they didn't…

DrJimFan's tweet image. This may be the most important figure in LLM research since the OG Chinchilla scaling law in 2022. The key insight is 2 curves working in tandem. Not one. 

People have been predicting a stagnation in LLM capability by extrapolating the training scaling law, yet they didn't…

📢New research on mechanistic architecture design and scaling laws. - We perform the largest scaling laws analysis (500+ models, up to 7B) of beyond Transformer architectures to date - For the first time, we show that architecture performance on a set of isolated token…

MichaelPoli6's tweet image. 📢New research on mechanistic architecture design and scaling laws.

- We perform the largest scaling laws analysis (500+ models, up to 7B) of beyond Transformer architectures to date

- For the first time, we show that architecture performance on a set of isolated token…

Simple and Scalable Strategies to Continually Pre-train Large Language Models Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually

_akhaliq's tweet image. Simple and Scalable Strategies to Continually Pre-train Large Language Models

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually

MosaicML announces Beyond Chinchilla-Optimal Accounting for Inference in Language Model Scaling Laws paper page: huggingface.co/papers/2401.00… Large language model (LLM) scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter…

_akhaliq's tweet image. MosaicML announces Beyond Chinchilla-Optimal

Accounting for Inference in Language Model Scaling Laws

paper page: huggingface.co/papers/2401.00…

Large language model (LLM) scaling laws are empirical formulas that estimate changes in model quality as a result of increasing parameter…

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws Modifies the scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand arxiv.org/abs/2401.00448

arankomatsuzaki's tweet image. Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Modifies the scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand

arxiv.org/abs/2401.00448

Can we predict the future capabilities of AI? . @ilyasut, co-founder and chief scientist of OpenAI, on the scaling law and an interesting area of research: (1) The scaling law relates the input of the neural network to simple-to-evaluate performance measures like next-word…


A counterintuitive implication of scale: trying to solve a more general version of the problem is an easier way to solve the original problem than directly tackling it. Attempting a more general problem encourages you to come up with a more general and simpler approach. This…


We had 1) the RMT paper on scaling Transformers to 1M tokens, and 2) the convolutional Hyena LLM for 1M tokens. How about 3) LONGNET: Scaling Transformers to 1 Billion Tokens (arxiv.org/abs/2307.02486)!? It achieves linear (vs quadratic) scaling via dilated (vs self) attention.

rasbt's tweet image. We had 1) the RMT paper on scaling Transformers to 1M tokens, and 2) the convolutional Hyena LLM for 1M tokens. 

How about 3) LONGNET: Scaling Transformers to 1 Billion Tokens (arxiv.org/abs/2307.02486)!?

It achieves linear (vs quadratic) scaling via dilated (vs self) attention.

Training large models demands a lot of GPU memory and a long training time. With several training parallelism strategies and a variety of memory saving designs, it is possible to train very large neural networks across many GPUs. lilianweng.github.io/lil-log/2021/0…


Không có kết quả nào cho "#scalablemachinelearning"
Không có kết quả nào cho "#scalablemachinelearning"
Loading...

Something went wrong.


Something went wrong.


United States Trends