yotamperlitz's profile picture. Research Scientist at @ibmresearch, Practicing #NLProc, #RL. 
Opinions are my own.

Yotam Perlitz 👾

@yotamperlitz

Research Scientist at @ibmresearch, Practicing #NLProc, #RL. Opinions are my own.

ปักหมุด

✨ Developed a new benchmark or dataset for language models? ✨ Want the community to trust and adopt it? 🤔 So, demonstrate its validity by comparing it to established benchmarks! BenchBench makes it easy. Check it out: 👉 huggingface.co/spaces/ibm/ben…


What a GEM!

GEM is so back! Our workshop for Generation, Evaluation, and Metrics is coming to an ACL near you. Evaluation in the world of GenAI is more important than ever, so please consider submitting your amazing work. CfP can be found at gem-benchmark.com/workshop

sebgehr's tweet image. GEM is so back! Our workshop for Generation, Evaluation, and Metrics is coming to an ACL near you. 
Evaluation in the world of GenAI is more important than ever, so please consider submitting your amazing work. 

CfP can be found at gem-benchmark.com/workshop


Yotam Perlitz 👾 รีโพสต์แล้ว

A big thank you to the Unitxt team, collaborators, and our community for an incredible 2024! Together, we pushed boundaries in AI evaluation and set new standards for the field. Read our End-of-Year Summary here: unitxt.ai/en/latest/blog… #unitxt #llmevaluation


Yotam Perlitz 👾 รีโพสต์แล้ว

Another cool benchmarking paper published yesterday. In "JuStRank: Benchmarking LLM Judges for System Ranking", researchers from @IBMResearch introduced JuStRank, the first large-scale benchmark for evaluating LLM judges for ranking target systems: arxiv.org/abs/2412.09569


Yotam Perlitz 👾 รีโพสต์แล้ว

Prompting LMs is not enough to quantify their linguistic competence! Meet the Holmes🔎 benchmark at #EMNLP2024 #TACL or 👉🧵 💠Meta-study of current literature 💠Coverage of LMs and phenomena 💠Analysis of LM size, architecture, and instruction tuning holmes-benchmark.github.io

UKPLab's tweet image. Prompting LMs is not enough to quantify their linguistic competence! Meet the Holmes🔎 benchmark at #EMNLP2024 #TACL or 👉🧵

💠Meta-study of current literature
💠Coverage of LMs and phenomena
💠Analysis of LM size, architecture, and instruction tuning

holmes-benchmark.github.io

Yotam Perlitz 👾 รีโพสต์แล้ว

RAG Developer Attention! 🔔 Docling is a new library from @IBM that efficiently parses PDF, DOCX, and PPTX and exports them to Markdown and JSON. It supports advanced PDF understanding and seamless integration with @llama_index and @LangChainAI. TL;DR: 🗂️ Parses numerous…

_philschmid's tweet image. RAG Developer Attention! 🔔 Docling is a new library from @IBM that efficiently parses PDF, DOCX, and PPTX and exports them to Markdown and JSON. It supports advanced PDF understanding and seamless integration with @llama_index and @LangChainAI.

TL;DR:
🗂️ Parses numerous…

Yotam Perlitz 👾 รีโพสต์แล้ว

Can lowercase really make or break an AI’s answer? 🤔 And what happens when an LLM ‘sees’ old TV-style white noise? 📺 We put these quirks to the test using Unitxt, showcasing the powerful, flexible testing it enables. Read on: unitxt.ai/en/latest/blog… #AI #MachineLearning #LLM


Yotam Perlitz 👾 รีโพสต์แล้ว

Scaling laws predict🦣large models by training🦟small ones, cool right? Fortunately, they are not that complicated or costly at least they don't have to be We have collected 400+ models fitted 1000+ scaling laws and created 1 guide for cheap & more reliable scaling law fitting:

LChoshen's tweet image. Scaling laws predict🦣large models by training🦟small ones, cool right?
Fortunately, they are not that complicated or costly
at least they don't have to be

We have collected 400+ models
fitted 1000+ scaling laws
and created 1 guide
for cheap & more reliable scaling law fitting:

Yotam Perlitz 👾 รีโพสต์แล้ว

It's been a great collaboration journey with my wonderful co-authors: @AndreasWaldis, @yotamperlitz @LChoshen, and @IGurevych. Please check it out and let us know if you'd like to see any additional functions or analyses added to the benchmark.


Get your benchmark game on: huggingface.co/spaces/ibm/ben…

yotamperlitz's tweet image. Get your benchmark game on: huggingface.co/spaces/ibm/ben…

✨ Developed a new benchmark or dataset for language models? ✨ Want the community to trust and adopt it? 🤔 So, demonstrate its validity by comparing it to established benchmarks! BenchBench makes it easy. Check it out: 👉 huggingface.co/spaces/ibm/ben…



Yotam Perlitz 👾 รีโพสต์แล้ว

1/ Into Image Captioning? Don’t miss this! Struggling to keep up with the influx of new metrics but still see the same 5 (BLEU, METEOR, ROUGE, CIDEr, SPICE) leading? Read our recent Captioning evaluation survey! arxiv.org/abs/2408.04909 w\ @GabiStanovsky @AbendOmri @leafrermann >

uriberger88's tweet image. 1/ Into Image Captioning? Don’t miss this!
Struggling to keep up with the influx of new metrics but still see the same 5 (BLEU, METEOR, ROUGE, CIDEr, SPICE) leading?
Read our recent Captioning evaluation survey!

arxiv.org/abs/2408.04909
w\
@GabiStanovsky
@AbendOmri
@leafrermann
>

Me trying to choose the right LLM benchmark without BenchBench: x.com/yotamperlitz/s…

yotamperlitz's tweet image. Me trying to choose the right LLM benchmark without BenchBench: 

x.com/yotamperlitz/s…

✨ Developed a new benchmark or dataset for language models? ✨ Want the community to trust and adopt it? 🤔 So, demonstrate its validity by comparing it to established benchmarks! BenchBench makes it easy. Check it out: 👉 huggingface.co/spaces/ibm/ben…



Shoutout to @streamlit, our framework of choice! Shoutout to @huggingface for hosting our space 🤗

@karpathy only trusts @lmsysorg chatbot arena! x.com/karpathy/statu… Hold on! What if you don't have an army of annotators at your disposal? 🤔



Loading...

Something went wrong.


Something went wrong.