Site Reliability Engineer (SRE) — Full-Time (Remote US)

By info.lockedinai | June 10, 2026

Website LockedIn AI

Site Reliability Engineer

Site Reliability Engineer (SRE) — Full-Time (Remote US)
About LockedIn AI
LockedIn AI is the #1 real-time AI interview and meeting copilot platform, trusted by over 1 million users worldwide. We build high-performance AI systems that power live interview assistance, coding support, and real-time communication tools used in mission-critical career moments.

Role Overview
We are hiring a proactive, systems-minded Site Reliability Engineer to ensure that all production systems at LockedIn AI are reliable, scalable, performant, and resilient.

This is a high-impact role where uptime and latency directly define user experience. Our users rely on real-time AI during live interviews — every millisecond and every failure matters.

You will own reliability across cloud infrastructure, APIs, AI inference systems, and real-time data pipelines serving 1M+ users.

Key Responsibilities
Reliability, Availability & Performance
Own system reliability across APIs, AI inference services, and real-time applications
Define and manage SLIs, SLOs, and error budgets aligned with product goals
Design fault-tolerant and self-healing architectures for production systems
Continuously optimize latency, throughput, and system performance

Infrastructure & Cloud Engineering
Build and manage cloud infrastructure (AWS, GCP, or Azure)
Design and operate Kubernetes-based production environments
Implement Infrastructure as Code using Terraform, Pulumi, or CloudFormation
Optimize infrastructure cost while maintaining high availability

Observability & Monitoring
Build observability systems using Prometheus, Grafana, Datadog, ELK, or similar tools
Design dashboards, alerting systems, and distributed tracing pipelines
Reduce alert noise while ensuring critical issues are detected instantly
Monitor AI-specific metrics such as latency, throughput, and GPU utilization

Incident Response & Reliability Engineering
Lead incident response for outages, performance issues, and service degradation
Participate in on-call rotations and coordinate cross-team resolution efforts
Conduct blameless postmortems and drive systemic improvements
Develop runbooks, playbooks, and escalation procedures

CI/CD & Deployment Engineering
Build and maintain CI/CD pipelines for fast and safe deployments
Implement canary releases, blue-green deployments, and automated rollback systems
Ensure safe deployment of application code, AI models, and configuration changes
Improve deployment velocity without compromising reliability

Security & Compliance
Implement infrastructure security best practices including IAM, encryption, and secrets management
Maintain secure, privacy-first infrastructure aligned with company standards
Perform vulnerability scanning, patching, and system hardening
Ensure compliance with internal security policies and best practices

Required Qualifications
Experience
3+ years in Site Reliability Engineering, DevOps, or infrastructure roles
Experience managing production systems at scale
Strong incident response and postmortem experience
Startup or high-growth environment experience preferred

Technical Skills
Strong programming skills (Python, Go, or similar)
Deep experience with cloud platforms (AWS, GCP, or Azure)
Kubernetes, Docker, and container orchestration expertise
Infrastructure as Code (Terraform, Pulumi, CloudFormation, or similar)
CI/CD systems (GitHub Actions, GitLab CI, ArgoCD, Jenkins, etc.)
Observability tools (Prometheus, Grafana, Datadog, ELK/OpenSearch)

Preferred Qualifications
Experience with AI/ML infrastructure or model serving systems
Background in real-time systems or low-latency architectures
Familiarity with chaos engineering and fault injection testing
Experience monitoring AI systems (latency, tokens, GPU usage, drift)
Multi-cloud or hybrid-cloud experience
Contributions to SRE or infrastructure open-source projects

What We Offer
Early-stage equity ownership in LockedIn AI
Direct impact on a platform used by 1M+ users worldwide
Remote-first flexibility with optional NYC collaboration
High-ownership role over production reliability systems
Fast-paced AI-native engineering environment
Strong autonomy in system design and reliability strategy

Why Join
Own reliability for a real-time AI system used in critical interview moments
Ensure low latency and high uptime for millions of users
Work on distributed systems powering AI inference at scale
Shape the reliability culture of a fast-growing AI company
Build systems where performance directly impacts user success

How to Apply
Please submit:

Resume or CV
Short note including:Why you want to join LockedIn AI
Whether you have used the platform
What improvements you would suggest
GitHub, incident writeups, or infrastructure projects (optional but encouraged)

Equal Opportunity
LockedIn AI is committed to building an inclusive workplace. All hiring decisions are based on merit, skills, and business needs.

To apply for this job email your details to info.lockedinai@gmail.com