EvalOps

@EvalOpsDev

Building the coordination layer for the cognitive economy

San Francisco, CA

evalops.dev

Joined July 2025

45Posts 310Followers 1Following

Pinned

EvalOps

@EvalOpsDev

Nov 3

Got tired of customers asking 'how do I know your eval results are real?' Fair question. So we made them mathematically provable.

EvalOps

@EvalOpsDev

Oct 25

Every release is a high‑wire act. Instead of praying for calm winds, build a net. EvalOps ties your policies, metrics and audits into a mesh that lets you scale without falling.

EvalOps

@EvalOpsDev

Oct 22

We open-sourced Nimbus – Firecracker-based CI for AI workloads. Multi-tenant isolation, RBAC, audit logs.

EvalOps is where evaluations meet operations — and security is no exception. “keep” shows how device posture, SSO, and OPA policies can be continuously tested and traced like any other system. Run it, break it, measure it. github.com/evalops/keep

EvalOpsDev's tweet card. PoC zero-trust access stack with Google SSO, Envoy, OPA, and device attestation - evalops/keep

GitHub - evalops/keep: PoC zero-trust access stack with Google SSO, Envoy, OPA, and device attest...

Source: github.com

EvalOps

@EvalOpsDev

Oct 17

Agents are already writing your code. The question isn't "should we use them?" It's "how do we ship them without surprises?" Provenance gives you a ledger. Every line. Every agent. Every risk. Measurable. github.com/evalops/proven…

EvalOpsDev's tweet card. Agent provenance & risk analytics API: FastAPI service with Semgrep detection, Redis, analytics dashboards - evalops/provenance

GitHub - evalops/provenance: Agent provenance & risk analytics API: FastAPI service with Semgrep...

Source: github.com

EvalOps

@EvalOpsDev

Oct 15

We’re open-sourcing Smith — the Firecracker-based CI runner that powers EvalOps. Why rebuild Blacksmith? Because eval gating needs specialized infra — and we’re not forcing you onto our cloud. Run evals on EvalOps Cloud or your own. github.com/evalops/smith

EvalOpsDev's tweet card. Self-hosted CI infrastructure optimized for AI evaluation workloads. Run evals on bare metal with Firecracker isolation, built-in observability, and zero cloud egress costs - evalops/nimbus

GitHub - evalops/nimbus: Self-hosted CI infrastructure optimized for AI evaluation workloads. Run...

Source: github.com

EvalOps

@EvalOpsDev

Oct 9

I'm told we're doing awards now?

EvalOps

@EvalOpsDev

Oct 4

Everyone wants to move fast. @EvalOpsDev makes sure you don’t break trust along the way. Governed AI releases start here.

Jonathan Haas

@HaasOnSaaS

Oct 4

Shipped a new home for @EvalOpsDev. No fluff, just governed AI releases. Check it out -> evalops.dev

EvalOps

@EvalOpsDev

Oct 2

🔥 Just dropped an evaluation‑driven LoRA loop built on Tinker from @thinkymachines! It trains, benchmarks & iterates until your model meets the mark. It auto‑spots weaknesses, spawns targeted LoRA jobs & tracks improvements. Proof‑of‑concept repo: github.com/evalops/tinker…

EvalOpsDev's tweet card. Evaluation-driven LoRA fine-tuning loop using Tinker API - automatically improve models through iterative training and evaluation cycles - evalops/tinker-eval-loop

GitHub - evalops/tinker-eval-loop: Evaluation-driven LoRA fine-tuning loop using Tinker API -...

Source: github.com

EvalOps

@EvalOpsDev

Sep 30

Sick of yak-shaving to get a clean Transformers setup? We built a stack that just works: PyTorch + HF Transformers Hydra configs FastAPI serving + Prometheus vLLM, LoRA, flash-attn, bitsandbytes Reproducible. Dockerized. CI/CD baked in. github.com/evalops/stack

EvalOpsDev's tweet card. A cohesive Transformers stack built on PyTorch with uv for reproducible dependency management - evalops/stack

GitHub - evalops/stack: A cohesive Transformers stack built on PyTorch with uv for reproducible...

Source: github.com

EvalOps

@EvalOpsDev

Sep 30

Developer resumes are frozen in time. GitHub tells the real story. 7k commits, +1.4M lines → now that’s a holographic trading card worth flexing. 🚀 cards.evalops.dev

cards.evalops.dev

DevEval - Holographic GitHub Cards

Generate stunning holographic trading cards from your GitHub profile. Evaluate your dev achievements ✨

Source: cards.evalops.dev

EvalOps reposted

Jonathan Haas

@HaasOnSaaS

Sep 28

LLM vendor: “Just quantization.” Reality: reward-hacked code, broken workflows, lost week. Companies: “nbd.” Users: 🙃🔥 Making this a thing of the past.

HaasOnSaaS's tweet image. LLM vendor: “Just quantization.”
Reality: reward-hacked code, broken workflows, lost week.
Companies: “nbd.”
Users: 🙃🔥

Making this a thing of the past.

EvalOps

@EvalOpsDev

Sep 24

🐙 Meet Mocktopus — multi‑armed mocks for your LLM apps! 🧪 Deterministic mocks for OpenAI‑style chat completions, tool calls & streaming. Make your evals deterministic, run CI offline, and record & replay 👉 github.com/evalops/mockto…

EvalOpsDev's tweet card. 🐙 Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing - evalops/mocktopus

GitHub - evalops/mocktopus: 🐙 Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/An...

Source: github.com

EvalOps

@EvalOpsDev

Sep 22

New @EvalOpsDev research drop: Metaethical Breach DSPy ⚖️ Can models be coaxed into breaking their own guardrails if you wrap harmful prompts in moral philosophy? Repo: github.com/evalops/metaet…

EvalOpsDev's tweet card. DSPy implementation for detecting metaethical breaches in AI systems through systematic evaluation of moral reasoning patterns - evalops/metaethical_breach_dspy

GitHub - evalops/metaethical_breach_dspy: DSPy implementation for detecting metaethical breaches in...

Source: github.com

EvalOps

@EvalOpsDev

Sep 19

Most eval tools tell you how your model did. @EvalOpsDev + DreamCoder tells you why. By synthesizing programs from data, DreamCoder finds hidden structure, invents evals, and surfaces failure grammars you didn’t think to test. Evals that discover the evals.