Nvidia logo

Director, Engineering Operations And Site Reliability Engineering - Datacenter Server Systems

NvidiaSanta Clara, CA

$292,000 - $442,750 / year

Automate your job search with Sonara.

Submit 10x as many applications with less effort than one manual application.1

Reclaim your time by letting our AI handle the grunt work of job searching.

We continuously scan millions of openings to find your top matches.

pay-wall

Overview

Schedule
Full-time
Career level
Executive
Remote
On-site
Compensation
$292,000-$442,750/year
Benefits
Paid Vacation

Job Description

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's fueled by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what's never been done before takes vision, innovation, and the world's best talent. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

NVIDIA is seeking a strong technology leader for our Engineering Operations and Site Reliability Engineering for our next-generation datacenter server systems. This role sits at the intersection of execution, reliability, automation, and large-scale system operations, where we keep NVIDIA's rack-scale systems healthy, observable, and highly available for internal engineering users. These systems bring together the full power of NVIDIA CPUs, GPUs, NVLink, InfiniBand/Spectrum-X networking, cluster management technologies, and our optimized AI/HPC software stack. We enable fast product development by ensuring large internal racks, clusters, and lab infrastructure are reliable, well-instrumented, and operated with scalable engineering practices. This is a technical leadership role focused on execution excellence for large-scale internal datacenter systems. The ideal candidate has strong engineering judgment, experience operating complex distributed infrastructure, and the ability to build teams that combine focused operations with automation-first software engineering.

What you will be doing:

  • Lead teams that help us ensure NVIDIA's internal rack-scale server systems, clusters, and lab facilities remain available, healthy, and reliable.

  • Drive execution across fleet operations, incident response, roadmap planning, change management, operational readiness, and reliability metrics.

  • Build automation, telemetry, alerting, and dashboards that improve visibility and help teams resolve issues faster.

  • Partner with hardware, firmware, software, networking, validation, and infrastructure teams to deploy, sustain, and debug complex systems.

  • Create feedback loops into NPI and sustaining teams to improve product quality, serviceability, and development velocity.

  • Grow and mentor a high-performing technical team with a culture of ownership, learning, and automation-first execution.

What we need to see:

  • BS or MS in Computer Science, Electrical Engineering, Computer Engineering, or related field (or equivalent experience).

  • 12+ overall years of experience in infrastructure, systems engineering, reliability, datacenter operations, distributed systems, or related areas, including 7+ years of people management experience.

  • Strong understanding of server systems, Linux, cluster operations, high-speed networking, and large-scale infrastructure.

  • Experience operating complex systems with high availability expectations, including monitoring, incident management, automation, and fleet-health practices.

  • Proven track record of driving execution across multiple teams, priorities, and technical domains, including close partnership with hardware, firmware, software, networking, validation, and infrastructure organizations.

  • Clear written and verbal communication skills, including executive-level reporting on operational health, risks, and priorities.

  • Track record of building cohesive teams and developing technical leaders who improve reliability and execution.

Ways to stand out from the crowd:

  • Prior Director or Senior Manager experience leading infrastructure, reliability, platform engineering, or large-scale lab operations teams.

  • Experience operating GPU, AI, HPC, cloud, or hyperscale datacenter infrastructure.

  • Broad knowledge of rack-scale systems, including server management, networking, storage, power, thermal, and RAS concepts.

  • Experience building automation, telemetry, fleet health, or dashboarding systems that improve product quality, serviceability, or engineering velocity.

Do you enjoy making complex AI infrastructure reliable at scale while enabling engineering teams to move faster? Come join our datacenter server systems team and help build the reliable, token-efficient computing platforms driving NVIDIA's success in this exciting and rapidly growing field.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 292,000 USD - 442,750 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 30, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Automate your job search with Sonara.

Submit 10x as many applications with less effort than one manual application.

pay-wall

FAQs About Director, Engineering Operations And Site Reliability Engineering - Datacenter Server Systems Jobs at Nvidia

What is the work location for this position at Nvidia?
This job at Nvidia is located in Santa Clara, CA, according to the details provided by the employer. Some roles may also include multiple work locations depending on the requirement.
What pay range can candidates expect for this role at Nvidia?
Candidates can expect a pay range of $292,000 and $442,750 per year.
What employment applies to this position at Nvidia?
Nvidia lists this role as a Full-time position.
What experience level is required for this role at Nvidia?
Nvidia is looking for a candidate with "Executive" experience level.
What benefits are offered by Nvidia for this role?
Nvidia offers Paid Vacation for this position. Actual benefits may vary depending on the employer's policies and employment terms.
What is the process to apply for this position at Nvidia?
You can apply for this role at Nvidia either through Sonara's automated application system, which helps you submit applications 10X faster with minimal effort, or by applying manually using the direct link on the job page.