V logo

Software Engineering – Inference Engineer

Automate your job search with Sonara.

Submit 10x as many applications with less effort than one manual application.1

Reclaim your time by letting our AI handle the grunt work of job searching.

We continuously scan millions of openings to find your top matches.

pay-wall

Overview

Schedule
Full-time
Career level
Senior-level
Remote
Hybrid remote

Job Description

Location: San Francisco, CA (Onsite | Remote)

About Virtue AI

Virtue AI sets the standard for advanced AI security platforms. Built on decades of foundational and award-winning research in AI security, its AI-native architecture unifies automated red-teaming, real-time multimodal guardrails, and systematic governance for enterprise apps and agents. Deploy in minutes—across any environment—to keep your AI protected and compliant. We are a well-funded, early-stage startup founded by industry veterans, and we're looking for passionate builders to join our core team.

What You’ll Do

As an Inference Engineer, you will own how models are served in production. Your job is to make inferences fast, stable, observable, and cost-efficient—even under unpredictable workloads.

You will:

  • Serve and optimize LLM, embedding, and other ML models' inference across multiple model families

  • Design and operate inference APIs with clear contracts, versioning, and backward compatibility

  • Build routing and load-balancing logic for inference traffic

    • Multi-model routing

    • Fallback and degradation strategies

    • vLLM or SGLang

  • Package inference services into production-ready Docker images

  • Implement logging and metrics for inference systems

    • Latency, throughput, token counts, GPU utilization

    • Prometheus-based metrics

  • Analyze server uptime and failure modes

    • GPU OOMs, hangs, slowdowns, fragmentation

    • Recovery and restart strategies

  • Design GPU and model placement strategies

    • Model sharding, replication, and batching

    • Tradeoffs between latency, cost, and availability

  • Work closely with backend, platform (Cloud, DevOps), and ML teams to align inference behavior with product requirements

What Makes You a Great Fit

You understand that inference is a systems problem, not just a model problem. You think in QPS, p99 latency, GPU memory, and failure domains.

Required Qualifications

  • Bachelor’s degree or higher in CS, CE, or related field

  • Strong experience serving LLMs and embedding models in production

  • Hands-on experience designing:

    • Inference APIs

    • Load balancing and routing logic

  • Experience with SGLang, vLLM, TensorRT, or similar inference frameworks

  • Strong understanding of GPU behavior

    • Memory limits, batching, fragmentation, utilization

  • Experience with:

    • Docker

    • Prometheus metrics

    • Structured logging

  • Ability to debug and fix real inference failures in production

  • Experience with autoscaling inference services

  • Familiarity with Kubernetes GPU scheduling

  • Experience supporting production systems with real SLAs

  • Proven ability to debug and fix inference failures in production

  • Comfortable operating in a fast-paced startup environment with high ownership

Preferred Qualifications

  • Experience with GPU-level optimization

    • Memory planning and reuse

    • Kernel launch efficiency

    • Reducing fragmentation and allocator overhead

  • Experience with kernel- or runtime-level optimization

    • CUDA kernels, Triton kernels, or custom ops

  • Experience with model-level inference optimization

    • Quantization (FP8 / INT8 / BF16)

    • KV-cache optimization

    • Speculative decoding or batching strategies

  • Experience pushing inference efficiency boundaries (latency, throughput, or cost)

Why Join Virtue AI

  • Competitive salary + equity

  • Direct ownership of inference reliability and performance

  • Hard problems at the intersection of systems, GPUs, and AI

  • Production impact – Your work directly affects latency, cost, and uptime

  • Strong technical culture – Engineers who debug and optimize, not just prototype

Automate your job search with Sonara.

Submit 10x as many applications with less effort than one manual application.

pay-wall

FAQs About Software Engineering – Inference Engineer Jobs at Virtue AI

What is the work location for this position at Virtue AI?
This job at Virtue AI is located in San Francisco, California, according to the details provided by the employer. Some roles may also include multiple work locations depending on the requirement.
What pay range can candidates expect for this role at Virtue AI?
Employer has not shared pay details for this role.
What employment applies to this position at Virtue AI?
Virtue AI lists this role as a Full-time position.
What experience level is required for this role at Virtue AI?
Virtue AI is looking for a candidate with "Senior-level" experience level.
What is the process to apply for this position at Virtue AI?
You can apply for this role at Virtue AI either through Sonara's automated application system, which helps you submit applications 10X faster with minimal effort, or by applying manually using the direct link on the job page.