EvalOpsDev's profile picture. Building the coordination layer for the cognitive economy

EvalOps

@EvalOpsDev

Building the coordination layer for the cognitive economy

Pinned

Got tired of customers asking 'how do I know your eval results are real?' Fair question. So we made them mathematically provable.

EvalOpsDev's tweet image. Got tired of customers asking 'how do I know your eval results are real?' Fair question. So we made them mathematically provable.

Every release is a high‑wire act. Instead of praying for calm winds, build a net. EvalOps ties your policies, metrics and audits into a mesh that lets you scale without falling.

EvalOpsDev's tweet image. Every release is a high‑wire act. Instead of praying for calm winds, build a net. 

EvalOps ties your policies, metrics and audits into a mesh that lets you scale without falling.

We open-sourced Nimbus – Firecracker-based CI for AI workloads. Multi-tenant isolation, RBAC, audit logs.

EvalOpsDev's tweet image. We open-sourced Nimbus – Firecracker-based CI for AI workloads. Multi-tenant isolation, RBAC, audit logs.

EvalOps is where evaluations meet operations — and security is no exception. “keep” shows how device posture, SSO, and OPA policies can be continuously tested and traced like any other system. Run it, break it, measure it. github.com/evalops/keep


Agents are already writing your code. The question isn't "should we use them?" It's "how do we ship them without surprises?" Provenance gives you a ledger. Every line. Every agent. Every risk. Measurable. github.com/evalops/proven…


We’re open-sourcing Smith — the Firecracker-based CI runner that powers EvalOps. Why rebuild Blacksmith? Because eval gating needs specialized infra — and we’re not forcing you onto our cloud. Run evals on EvalOps Cloud or your own. github.com/evalops/smith


I'm told we're doing awards now?

EvalOpsDev's tweet image. I'm told we're doing awards now?

Everyone wants to move fast. @EvalOpsDev makes sure you don’t break trust along the way. Governed AI releases start here.

Shipped a new home for @EvalOpsDev. No fluff, just governed AI releases. Check it out -> evalops.dev



🔥 Just dropped an evaluation‑driven LoRA loop built on Tinker from @thinkymachines! It trains, benchmarks & iterates until your model meets the mark. It auto‑spots weaknesses, spawns targeted LoRA jobs & tracks improvements. Proof‑of‑concept repo: github.com/evalops/tinker…


Sick of yak-shaving to get a clean Transformers setup? We built a stack that just works: PyTorch + HF Transformers Hydra configs FastAPI serving + Prometheus vLLM, LoRA, flash-attn, bitsandbytes Reproducible. Dockerized. CI/CD baked in. github.com/evalops/stack


Developer resumes are frozen in time. GitHub tells the real story. 7k commits, +1.4M lines → now that’s a holographic trading card worth flexing. 🚀 cards.evalops.dev

cards.evalops.dev

DevEval - Holographic GitHub Cards

Generate stunning holographic trading cards from your GitHub profile. Evaluate your dev achievements ✨


EvalOps reposted

LLM vendor: “Just quantization.” Reality: reward-hacked code, broken workflows, lost week. Companies: “nbd.” Users: 🙃🔥 Making this a thing of the past.

HaasOnSaaS's tweet image. LLM vendor: “Just quantization.”
Reality: reward-hacked code, broken workflows, lost week.
Companies: “nbd.”
Users: 🙃🔥

Making this a thing of the past.

🐙 Meet Mocktopus — multi‑armed mocks for your LLM apps! 🧪 Deterministic mocks for OpenAI‑style chat completions, tool calls & streaming. Make your evals deterministic, run CI offline, and record & replay 👉 github.com/evalops/mockto…


New @EvalOpsDev research drop: Metaethical Breach DSPy ⚖️ Can models be coaxed into breaking their own guardrails if you wrap harmful prompts in moral philosophy? Repo: github.com/evalops/metaet…


Most eval tools tell you how your model did. @EvalOpsDev + DreamCoder tells you why. By synthesizing programs from data, DreamCoder finds hidden structure, invents evals, and surfaces failure grammars you didn’t think to test. Evals that discover the evals.

EvalOpsDev's tweet image. Most eval tools tell you how your model did.

@EvalOpsDev + DreamCoder tells you why.

By synthesizing programs from data, DreamCoder finds hidden structure, invents evals, and surfaces failure grammars you didn’t think to test.

Evals that discover the evals.

United States Trends

Loading...

Something went wrong.


Something went wrong.