Background
Before graduate school, I spent ~3 years as an early member of the engineering / data science organizations at two high growth startups: Roadie (acquired by UPS for $500m) and OneTrack.AI as software engineer, where I led efforts to scale data infrastructure to match growth, and worked on a range of challenging problems, including embedded deep learning, fault-tolerant distributed systems, realtime adaptive pricing, and data pipelines for time series and computer vision tasks.
Outside of research, I lift weights, read (here's my goodreads profile), watch mixed martial arts, and sometimes wonder whether randomness is real.
|
Research
Current Work
Prior Work
- Self-training / self-improvement
- Using uncertainty during reasoning and decision-making
- Vision-language representation learning
Some papers are highlighted.
|
|
One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
Zaid Khan,
Archiki Prasad,
Elias Stengel-Eskin,
Jaemin Cho,
Mohit Bansal
arXiv, 2025
project page
/
arXiv
How can an agent reverse engineer the underlying laws of an unknown, hostile & stochastic environment in "one life", without millions of steps + human-provided goals / rewards? We infer a world model in Python for an unknown environment from a single episode!
|
|
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha,
Ryan Marten,
Sedrick Keh,
Negin Raoof,
Georgios Smyrnis,
Hritik Bansal,
Marianna Nezhurina,
Jean Mercat,
Trung Vu,
Zayne Sprague,
Ashima Suvarna,
Benjamin Feuer,
Liangyu Chen,
Zaid Khan,
Eric Frankel,
Sachin Grover,
Caroline Choi,
Niklas Muennighoff,
Shiye Su,
Wanjia Zhao,
John Yang,
Shreyas Pimpalgaonkar,
Kartik Sharma,
Charlie Cheng-Jie Ji,
Yichuan Deng,
Sarah Pratt,
Vivek Ramanujan,
Jon Saad-Falcon,
Jeffrey Li,
Achal Dave,
Alon Albalak,
Kushal Arora,
Blake Wulfe,
Chinmay Hegde,
Greg Durrett,
Sewoong Oh,
Mohit Bansal,
Saadia Gabriel,
Aditya Grover,
Kai-Wei Chang,
Vaishaal Shankar,
Aaron Gokaslan,
Mike A. Merrill,
Tatsunori Hashimoto,
Yejin Choi,
Jenia Jitsev,
Reinhard Heckel,
Maheswaran Sathiamoorthy,
Alexandros G. Dimakis,
Ludwig Schmidt
arXiv, 2025
arXiv
/
openthoughts.ai
The fully-open OpenThoughts3 dataset consists of 1.2M reasoning traces and problems constructed by a pipeline designed through 1,000+ controlled experiments taking 40k H100/A100 hours.
The resulting OpenThinker3-7B model achieves state-ofthe-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond β improvements of 15.3, 17.2, and 20.5 percentage points
compared to the DeepSeek-R1-Distill-Qwen-7B model.
|
|
Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems
Zaid Khan,
Elias Stengel-Eskin,
Archiki Prasad,
Jaemin Cho,
Mohit Bansal
arXiv, 2025
project page
/
arXiv
What if we could transform advanced math problems into abstract programs that can generate endless, verifiable problem variants? EFAGen uses test-time search with execution feedback to infer executable functional abstractions (EFAs) in Python for diverse math problems, including Olympiad-level problems.
|
|
MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use
Zaid Khan,
Ali Farhadi,
Ranjay Krishna,
Luca Weihs,
Mohit Bansal,
Tanmay Gupta
arXiv, 2025
project page
/
arXiv
Neural tree search for repo-level code-use planning. MutaGReP explores plan space through LLM guided mutations, while grounding the plan to functionality in the codebase using a symbol retriever.
|
|
Learning to Generate Unit Tests for Automated Debugging
Archiki Prasad*,
Elias Stengel-Eskin*,
Justin Chih-Yao Chen,
Zaid Khan,
Mohit Bansal
CoLM, 2025
code
/
arXiv
Testing is a critical part of software engineering β what if we could automatically discover inputs which break your code? We show how to train SLMs (Qwen2.5-7B + Llama3.1-8B) to generate unit tests that break code and are useful for debugging.
* Equal contribution
|
|
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
Zaid Khan,
Elias Stengel-Eskin,
Jaemin Cho,
Mohit Bansal
ICLR, 2025   Spotlight
project page
/
arXiv
A testbed for RL-style data generation agents + teaching environments to automate post-training: the process of improving a model on diverse, open-ended tasks, based on automatically-discovered model skills / weaknesses.
|
|
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
Zaid Khan,
Vijay Kumar BG,
Samuel Schulter,
Yun Fu,
Manmohan Chandraker
CVPR, 2024
project page
/
arXiv
We show how to improve the program synthesis ability of an LLM from execution feedback and apply it to create a 7B model that writes programs that orchestrate other models to solve computer vision tasks.
|
|
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
Zaid Khan,
Yun Fu
CVPR, 2024
arXiv
We show how to identify unreliable responses from multimodal LLMs by examining the consistency of their responses over the neighborhood of a visual question, without requiring access to the model's internals.
|
|
Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan,
Vijay Kumar BG,
Samuel Schulter,
Manmohan Chandraker,
Yun Fu
NeurIPS, 2023
project page
/
arXiv
We show how to selectively decompose complex questions into simpler sub-questions to improve zero-shot performance on challenging multimodal reasoning tasks.
|
|
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Zaid Khan,
Vijay Kumar BG,
Samuel Schulter,
Xiang Yu,
Yun Fu,
Manmohan Chandraker
CVPR, 2023
code
/
arXiv
Getting labels for a multimodal dataset can be expensive. We show how you can use unlabeled images to improve performance on data-scarce multimodal tasks.
|
|
Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning
Zaid Khan,
Yun Fu
ICLR, 2023
code
/
arXiv
We explore creating CLIP-like models by minimally updating already-trained vision and language models, finding that updating less than 7% of parameters can match full model training.
|
|
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Zaid Khan,
Vijay Kumar BG,
Xiang Yu,
Samuel Schulter,
Manmohan Chandraker,
Yun Fu
ECCV, 2022
project page
/
arXiv
We demonstrate a very data-efficient way to align vision and language by learning to reconstruct each modality from the other.
|
|
Exploiting BERT for Multimodal Target Sentiment Classification Through Input Space Translation
Zaid Khan,
Yun Fu
ACM MM, 2021   Oral
code
/
arXiv
Understanding the emotional content of social media posts is difficult for traditional sentiment analysis models.
We show that language models do a good job of this if the post can be translated into a natural input space for them.
|
|
One Label, One Billion Faces: Usage and Consistency of Racial Categories in Computer Vision
Zaid Khan,
Yun Fu
ACM FAccT, 2021
arXiv
Are notions of algorithmic fairness based on racial categories meaningful?
We study computer vision datasets that use racial categories, and empirically show that the racial categories encoded in each dataset are often highly inconsistent with each other and with human intuitions.
|
|