M logo

Member of Technical Staff, Post-Training, RL

MirendilUnited States, California

$350,000 - $500,000 / year

Automate your job search with Sonara.

Submit 10x as many applications with less effort than one manual application.1

Reclaim your time by letting our AI handle the grunt work of job searching.

We continuously scan millions of openings to find your top matches.

pay-wall

Overview

Schedule
Full-time
Career level
Senior-level
Remote
On-site
Compensation
$350,000-$500,000/year
Benefits
Paid Vacation

Job Description

Mirendil

Mirendil is a tech-first company focused on solving core bottlenecks that unlock step-change acceleration across science and technology. Our first goal is to democratize frontier AI R&D across scientific disciplines. We believe accelerating scientific discovery is one of the most powerful ways to improve the future of humanity, and that AI will play a central role in making that possible.

We are building a frontier AI research company and training our own models end-to-end. Our work spans areas such as model training, reinforcement learning, reasoning systems, and infrastructure for large-scale experiments. Our team includes researchers and engineers from Anthropic, Google DeepMind, xAI, OpenAI, Microsoft, Apple, and MIT.

The Role

We are looking for research engineers to help build the post-training stack for frontier reasoning models.

This role sits at the point where model capability, training dynamics, data, verification, and infrastructure all meet. You will design and run the experiments that turn a strong base model into a model that can solve difficult tasks reliably: choosing training objectives, shaping data mixtures, building verifiers, debugging reward signals, scaling runs, and understanding why a recipe works or fails.

Researchers are also expected to have strong engineering skills. The best work here will involve both: forming hypotheses about training behavior, implementing them in real systems, running large-scale experiments, reading the resulting traces carefully, and turning the lessons into the next training run.

Some areas you may work on include:

  • Post-training recipes: Develop and iterate on RL, SFT, and distillation recipes. Understand how choices in objectives, data mixtures, hyperparameters, rollout generation, and filtering affect efficiency, stability, capability, and final model behavior.

  • Scaling RL: Make post-training work at larger scales: more tokens, longer trajectories, larger models, more steps, and larger compute budgets. This includes identifying the bottlenecks that appear only when an approach leaves the small-run regime.

  • Long-horizon reasoning: Train models on tasks where success depends on many intermediate decisions. Develop methods for assigning useful feedback across long trajectories, where sparse rewards, credit assignment, exploration, and verification all become harder.

  • Off-policy and asynchronous training: Work on training regimes where data is generated by older policies, different policies, or partially filtered policies. Build intuition and tooling for when off-policy data helps, when it hurts, and how to control the resulting instabilities.

  • Verification and reward quality: Build robust verification pipelines for tasks where correctness can be checked automatically or semi-automatically. Detect and reduce reward hacking, false positives, brittle verifiers, and other failure modes that make RL look better than it really is.

  • Multi-task post-training: Scale recipes across different task families and domains. Study the tradeoffs between specialization and generality, and design training mixtures that improve all capabilities together.

  • Experiment analysis and debugging: Develop a deep empirical understanding of training runs. Diagnose regressions, separate real improvements from noise, design better ablations, and build the probes and analyses needed to make post-training less opaque.

  • End-to-end execution: Work closely with systems, infrastructure, and data teams to get experiments from idea to production-scale runs. This includes making training pipelines reliable, ensuring data and verifier quality, and turning successful experiments into repeatable and scalable recipes.

If you're excited about building the infrastructure that makes frontier RL research possible at scale, we'd love to hear from you.

We offer a base salary of $350,000–$500,000 USD and a meaningful equity grant, depending on experience and background, along with competitive benefits.

Automate your job search with Sonara.

Submit 10x as many applications with less effort than one manual application.

pay-wall

FAQs About Member of Technical Staff, Post-Training, RL Jobs at Mirendil

What is the work location for this position at Mirendil?
This job at Mirendil is located in United States, California, according to the details provided by the employer. Some roles may also include multiple work locations depending on the requirement.
What pay range can candidates expect for this role at Mirendil?
Candidates can expect a pay range of $350,000 and $500,000 per year.
What employment applies to this position at Mirendil?
Mirendil lists this role as a Full-time position.
What experience level is required for this role at Mirendil?
Mirendil is looking for a candidate with "Senior-level" experience level.
What benefits are offered by Mirendil for this role?
Mirendil offers Paid Vacation for this position. Actual benefits may vary depending on the employer's policies and employment terms.
What is the process to apply for this position at Mirendil?
You can apply for this role at Mirendil either through Sonara's automated application system, which helps you submit applications 10X faster with minimal effort, or by applying manually using the direct link on the job page.