Nimit Kalra

Hey there! I'm a researcher at Haize Labs in NYC, focusing on improving adversarial robustness and automated model-based evaluation for LLMs. I also co-host an AI Reading Group in NYC. I authored Verdict, a framework for specifying compound LLM judge systems. Before that, I spent three years at Citadel.

During my time at UT Austin, I focused mainly on computer vision and was advised by Philipp Krähenbühl on domain adaptation for real-world visuomotor navigation and robotics tasks. I graduated with a B.A. in Math, a B.S. in Computer Science, and a bunch of credits in Political Science and Economics.

I grew up near Dallas and spent most of my childhood up late hacking around on iOS jailbreaking/Cydia, Minecraft mods, USACO, and a bunch of other random things. Amidst this sprawling flatland, I discovered my love for the mountains.

Writing

Bootstrapping Supervision for LLM Self-Improvement — July 2025
High-quality reasoning supervision is expensive — not just in dollars, but in latency, throughput, and annotation consistency. Human-labeled traces are noisy, subjective, and particularly hard to scale for non-general domains, where instructions are idiosyncratic and objective ground truths are often elusive. As a result, many recent works attempt to bootstrap supervision directly from the pretrained model, leveraging it as a strong prior over reasoning behaviors.

Projects

[July 2025] spoken: Inference Wrapper for Speech-To-Speech Foundation Models
[May 2025] j1-micro & j1-nano: Tiny Variant of DeepSeek's Generative Reward Models
[Apr 2025] EvalsEvalsEvals: Automated Rubric Creation for LLM Evaluations
[Jan 2021] Point-Transformer for Map-View Priors in Autonomous Navigation
[Oct 2020] Domain Adaptation for Indoor PointGoal Navigation
[May 2020] Domain Adaptation via Multi-Task Distillation with Noisy Labels
[May 2020] A Bayesian Network Model for Sampling Dockless Scooter Traffic
[Oct 2020] Fast Random Kernelized Features: High-Dimensional SVM Classification
[Mar 2017] Composition of Real Flows

Publications/Preprints

Verdict: A Library for Compound LLM Judge Systems
Nimit Kalra, Leonard Tang
[arXiv]   [code]   [docs]

Open-source library for scaling test-time compute via graphs of chained prompted evaluators. We achieve SOTA/near-SOTA performance on a wide variety of challenging automated evaluation tasks without additional training or resorting to specification/prompt overfitting.

Constitutional Classifiers: Defending against Universal Jailbreaks
Anthropic Safeguards Research Team
[arXiv]   [blog]

Synthetic data recipe for training output classifiers with streaming prediction to flag harmful content according to an explicit constitution. Focus on adversarial data augmentation and red-teaming.

Domain Adaptation Through Task Distillation, ECCV 2020
Brady Zhou*, Nimit Kalra*, Philipp Krähenbühl
[arXiv]   [code]   [presentation]

We leverage dense vision labels (e.g., segmentation masks, which are freely available in simulators) to transfer navigation policies across visually-diverse domains (maze navigation → autonomous driving). By training a policy that operates on labels, we can obtain action supervision in a new domain and distill an end-to-end visuomotor policy.

Adventures

I enjoy a good road trip.

Hiking

Emory Peak, Big Bend National Park

Cascade Mountain, Adirondack High Peak Wilderness

Mt. Kosciuszko, Kosciuszko National Park

Chasm Lake via Long Peak's Trail, Rocky Mountain National Park

Rim-to-Rim, Grand Canyon National Park

Eiffel Lake / Parker Ridge, Banff National Park

Corkscrew Peak, Death Valley National Park

Mt. Charleston, Red Rock Canyon National Conservation Area

People

People who have had a major impact on me — whether a sparring buddy, mentor, or friend.

JagathAlexSrujayLeonardPhilippBradyYuweiSaakethDylanWillPrateek

Contact

I love meeting new people. Reach me at nimit@utexas.edu or schedule a chat.