Nimit Kalra

Hey there! I'm a research scientist at Haize Labs, where I tinker with making LLMs more reliable and robust. I also collaborate with Micah Goldblum's lab at Columbia and co-host an AI Reading Group in NYC. I authored Verdict, a framework for specifying compound LLM judge systems and previously worked at Citadel.

My research centers on machine learning systems where distribution shift is not an artifact of poor dataset design, but an unavoidable consequence of deployment. Think evolving adversaries, target domains with limited supervision, or train-test gaps.

During my time at UT Austin, I focused mainly on computer vision and was advised by Philipp Krähenbühl on domain adaptation for real-world visuomotor navigation and robotics tasks. I grew up near Dallas and spent most of my childhood staying up too late hacking around with iOS jailbreaking/Cydia and Minecraft mods. Amidst the urban sprawl, I found escape in the mountains.

Writing

Bootstrapping Supervision for LLM Self-Improvement, July 2025
High-quality reasoning traces are noisy, subjective, and particularly hard to scale for non-general domains, where instructions are idiosyncratic and objective ground truths are often elusive. Instead of relying on costly human supervision, many recent approaches iteratively bootstrap from a base model, leveraging it as a strong prior over reasoning behaviors.

Projects

[July 2025] spoken: Inference Wrapper for Speech Foundation Models
[May 2025] j1-micro: Tiny DeepSeek Generative Reward Models
[Apr 2025] EvalsEvalsEvals: Automated Rubric Creation for LLM Evals
[Jan 2021] Map-View Point Transformers for Autonomous Navigation
(see more) ⤵

Publications/Preprints

Closing the Train-Test Gap for Gradient-Based Planning
Arjun Parthasarathy*, Nimit Kalra*, Rohun Agrawal*,
Yann LeCun, Oumayma Bounou, Pavel Izmailov, Micah Goldblum

Although world models are trained on next-state prediction, at test-time we use them to estimate actions via gradient descent. Our adversarial and online training methods close this train–test gap, enabling gradient-based planning to surpass costly search methods on object manipulation tasks.

Verdict: A Library for Compound LLM Judge Systems
Nimit Kalra, Leonard Tang
[arXiv]   [code]   [docs]   [citations]

Open-source library for scaling test-time compute via graphs of chained prompted evaluators. We achieve SOTA/near-SOTA performance on a wide variety of challenging automated evaluation tasks without additional training or resorting to specification/prompt overfitting.

Constitutional Classifiers: Defending Against Universal Jailbreaks…
— with the Anthropic Safeguards Research Team
[arXiv]   [blog]

Synthetic data recipe for training output classifiers with streaming prediction to flag harmful content according to an explicit constitution. Focus on adversarial data augmentation and red-teaming.

Domain Adaptation Through Task Distillation, ECCV 2020
Brady Zhou*, Nimit Kalra*, Philipp Krähenbühl
[arXiv]   [code]   [presentation]

We leverage dense vision labels (e.g., segmentation masks, which are freely available in simulators) to transfer navigation policies across visually-diverse domains (maze navigation → autonomous driving). By training a policy that operates on labels, we can obtain action supervision in a new domain and distill an end-to-end visuomotor policy.

Adventures

I enjoy a good road trip.

Hiking

Emory Peak, Big Bend National Park

Cascade Mountain, Adirondack High Peak Wilderness

Mt. Kosciuszko, Kosciuszko National Park

Chasm Lake via Long Peak's Trail, Rocky Mountain National Park

Rim-to-Rim, Grand Canyon National Park

Eiffel Lake / Parker Ridge, Banff National Park

Corkscrew Peak, Death Valley National Park

Mt. Charleston, Red Rock Canyon National Conservation Area

Contact

I love meeting new people. Reach me at nimit@utexas.edu or schedule a quick chat.