
Director Of Engineering, Production Infrastructure
Automate your job search with Sonara.
Submit 10x as many applications with less effort than one manual application.1
Reclaim your time by letting our AI handle the grunt work of job searching.
We continuously scan millions of openings to find your top matches.

Job Description
Klaviyo's mission is to help businesses own their growth. To do that at scale, our engineers need rock‑solid platform primitives, clear, reliable building blocks that make it simple to run, store, observe, and ship in production. As Director, Production Infrastructure you will lead and build a product‑quality platform that accelerates every R&D team's path from idea to customer value. You and your team will define what is (and isn't) a platform primitive, set strong ownership boundaries, and deliver "golden paths" that answer questions like: "I want to run X- where do I run it?", "My product needs a single table to store a bit of data, where should I put it?", and "How do I get data from my service to the frontend?"
This is a hands‑on, execution‑first leadership role for a platform‑minded builder who measures success in developer velocity, system reliability, and business impact.
What You'll Do
- Own the Production Infrastructure charter. Define the platform primitives Klaviyo provides (compute runtimes, data storage options, messaging/eventing, service networking, observability) and the clear "contract" for each: APIs, SLIs/SLOs, support model, and runbooks ensuring consistency with our company wide operational excellence best practices
- Publish golden paths and decision trees that make default choices obvious (e.g., "run X here," "store a bit here," "expose data to frontend via Y"), minimizing one‑off work and increasing self‑service.
- Raise reliability and safety bars across production: incident prevention and response (blameless postmortems, on‑call health), change management, capacity planning, and resilient multi‑tenant patterns.
- Accelerate developer velocity by improving time‑to‑first‑service, deployment lead time, and mean time to recovery; partner with product teams to remove infrastructure bottlenecks and reduce cognitive load.
- Engineer for cost‑effectiveness at scale. Establish clear cost guardrails, usage quotas, and right‑sizing policies; partner with Finance and Security to balance spend, risk, and speed.
- Lead and grow high‑performing teams of managers and senior ICs; set crisp goals, coach for impact, and cultivate an inclusive, ownership‑driven culture.
- Partner cross‑functionally with engineering leaders, security, and others to sequence investments, clarify ownership boundaries, and land platform changes safely.
- Measure what matters. Define and report a concise scorecard (e.g., SLO coverage, incident frequency/severity, lead time for changes, MTTR, developer NPS for platform, infra cost‑to‑serve).
- Transform workflows by putting AI at the center, building smarter systems and ways of working from the ground up; continuously experiment with AI tools and share learnings to keep the org ahead of the curve.
Who You Are
- Platform‑minded, execution‑oriented leader with a track record building and operating production platforms at scale (e.g., multi‑tenant compute, storage, networking, CI/CD, observability). You prioritize measurable outcomes such as:reliability, efficiency, and developer productivity.
- Experienced people leader: 10+ years in infrastructure/SRE/platform engineering, including 5+ years managing managers and senior ICs; you set high bars, coach well, and build inclusive teams.
- Reliability first. Deep familiarity with SRE practices, SLO/SLI design, incident management, capacity planning, and operational readiness. (
- Great system thinker & communicator. You reduce ambiguity, create clarity in docs and diagrams, and influence across product, data, and security to land org‑wide changes.
- Outcome‑driven and accountable. You set crisp goals, instrument the work, and hold teams to impact not just activity. You're comfortable saying "no" and narrowing scope to ship.
- AI‑curious and hands‑on. You've already experimented with AI in work or personal projects and are eager to learn fast, using AI responsibly to make your team's work smarter and more efficient.
- Technical stack familiarity (mix of): public cloud (AWS/GCP), container orchestration, service meshes/ingress, data stores (SQL/NoSQL/object), eventing/streaming, IaC, and modern observability.
Nice to Haves
- Experience productizing internal platforms (treating infra as a product with SLAs, roadmaps, and developer experience metrics).
- Background in data or event‑driven architectures at scale; prior partnership with a centralized data platform (e.g., KDP) to define clean ownership boundaries.
- Prior success improving cost‑to‑serve and reliability in a high‑growth SaaS environment.
We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC, certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3, 2025.
Please see the independent bias audit report covering our use of Covey here
Automate your job search with Sonara.
Submit 10x as many applications with less effort than one manual application.
