Human Baseliner for Open-Ended ML Research Tasks
$75 to $90/hr
## Overview
We are hiring experienced machine learning engineers and researchers to serve as **human baseliners** for evaluations of open-ended machine learning research tasks. These evaluations measure how well AI agents perform on realistic AI R&D problems. To interpret agent performance, we also need strong human reference points: skilled practitioners attempting the same tasks under the same time and compute constraints. As a baseliner, you will complete self-contained ML research tasks in a sandboxed environment, working independently with your preferred tools and workflow. Your performance will be used as a benchmark against which frontier-model agents are evaluated.
## What You’ll Do
- Attempt open-ended machine learning research tasks under a fixed time and compute budget (work trial)
- Work independently in a sandboxed Linux environment with internet access
- Use your preferred tooling, including IDEs and AI coding assistants such as Cursor, Claude Code, and ChatGPT
- Record your full working session via screen recording
- Complete a short pre-task and post-task questionnaire
- Submit your final work product, screen recording, and completed questionnaires: Post this you will be hired for a longer commitment
## Commitment
- Minimum **20 hours per week if selected**
- More availability is strongly preferred
## Requirements
Candidates must meet **all** of the following:
- **3+ years of machine learning experience**
- Time spent in a PhD program counts toward this requirement
- Undergraduate and master’s experience does not count
- Attended a **top-100 university** or worked at **FAANG or a comparable company**
- Experience with at least one major ML framework such as **PyTorch, JAX, or TensorFlow**
- Deep, hands-on expertise in at least one of the focus areas below:
- Pretraining under tight data and compute budgets
- PPO, reward shaping, custom `gym` / `gymnasium` environments, and throughput tuning
- Full fine-tuning, LoRA, QLoRA, DPO, RLHF, RLAIF, and distillation
- Large-scale corpus filtering, deduplication, subsampling, and benchmark contamination avoidance
- Architecture design under strict parameter-count or size constraints
- Modifying pretrained architectures, including attention patterns, pooling heads, or training objectives
- Contrastive training for embedding or retrieval models
- Generative vision or video modeling
- Multilingual or low-resource language experience
- Image or video data pipelines at scale
- Experience balancing competing model objectives such as safety and capability
- Prior work as an ML evaluator, red-teamer, or baseliner
## Required Domain Expertise
Candidates must have strong practical experience in **at least one** of the following:
- **Pretraining**: training transformer language models from scratch
- **Reinforcement learning**: training agents in custom or existing environments
- **Post-training**: fine-tuning and aligning LLMs
- **Dataset curation**: building and cleaning large text corpora for LLM training
- **Model architecture**: designing and modifying neural network architectures
## Logistics (work trial requirements)
- One baseline attempt per contractor per task
- Each task may only be attempted once by a given contractor
- All work is confidential and covered by NDA
- Compute and environment are provided; no personal GPU is required
How it works: apply here and we connect you to our hiring partner for this role. By continuing you agree we may forward your application.