Senior Site Reliability Engineer

General Repair

San Francisco, CA

February 3, 2026

Full Time

Application ends: July 30, 2026

Apply Now

Job Description

Join Our Team As

Position: Senior Site Reliability Engineer
Company: Citizen Health

Job Brief

about the job

who we are. . citizen health was founded on the belief — shaped by firsthand lived experiences navigating healthcare — that having the right advocate is the single most important factor in achieving better care and outcomes. by uniquely combining data, ai, and community, citizen is building a personalized ai advocate powered by patients’ complete medical histories and data from thousands of other patients to generate personalized insights for clinical decisions and day-to-day challenges. starting in rare and complex conditions, patients share their data in exchange for value, enabling biopharma and researchers with seamless access to patients and regulatory-grade data, shaving years off drug development for much-needed treatments.. . citizen is founded by experienced entrepreneurs with multiple successes under their belts and is funded by top-tier investors including 8vc, transformation capital, and headline ventures, among others. we are a mission-driven team excited to be building the future of consumer healthcare.. . the role. . citizen health is seeking a senior site reliability engineer (sre) to ensure the resilience, performance, and availability of our ai-powered, patient-centric healthcare platform.. . in this hands-on, high-impact role, you will apply software engineering principles to operational challenges—designing and maintaining reliable systems that scale, fail gracefully, and recover quickly. you’ll work cross-functionally to establish slos/slis, implement robust observability, establish a metrics-driven approach to service performance, and drive improvements in incident response, fault tolerance, and service reliability.. . if you’re passionate about building systems that stay up, scale well, and recover fast — and you thrive on solving reliability challenges in modern cloud-native environments — we’d love to talk to you.. . responsibilities. . reliability engineering & observability. .

Experience

Work Growth of 5 Years is required

Job Detail

Define and measure service reliability through SLIs, SLOs, and error budgets
Implement and operate observability tooling (e
g
, NewRelic, Prometheus) across cloud and Kubernetes environments
Analyze logs, traces, and metrics to surface actionable insights and improve system health
Perform capacity planning, load testing, and performance profiling and tuning to support scale and reliability, and to optimize system performance
Resilience & Automation

Design and maintain resilient, self-healing infrastructure in AWS and Kubernetes (EKS)
Conduct chaos engineering experiments, failure mode analysis, and disaster recovery drills to proactively identify and fix weaknesses
Build automation to reduce toil, improve reliability metrics (latency, uptime, error rates, MTTD, and MTTR), and prevent recurrence of incidents
Engineer infrastructure for fault tolerance, auto-scaling, and graceful degradation
Incident Response & Operations

Drive incident response efforts, manage on-call rotations, and coordinate resolution of production outages
Conduct root cause analyses and blameless postmortems to drive learning and resilience
Continuously improve key reliability metrics such as latency, uptime, error rates, and availability
Collaborate with security, platform, and DevOps teams to ensure high-availability and production-readiness of services
Cross-Team Collaboration & Culture

Collaborate with engineering teams during design and architecture phases to assess and mitigate reliability risks
Support progressive delivery strategies including feature flags and canary deployments
Champion SRE principles and practices, helping build a culture of resilience and shared ownership
Stay current with emerging practices in cloud reliability, observability, and SRE tooling
Who You Are

You are a hands-on individual who thrives in fast-paced, high-stakes environments where reliability is mission-critical
You bring deep experience operating distributed systems at scale, with a strong foundation in cloud-native infrastructure, Kubernetes, and observability
You think like a software engineer but focus like an operator, using code to solve operational challenges
You’re driven by making systems more resilient, reducing downtime, and building fault-tolerant architectures that scale with user demand
You value data-driven decision making, and see SLIs/SLOs, incident postmortems, and continuous improvement as essential tools, not checkboxes
You have a strong sense of ownership, thrive under pressure, and believe the best systems are the ones that heal themselves
You collaborate closely across teams, care deeply about the end-user experience, and are always looking for better ways to keep complex systems running smoothly
Must-Have Skills

5 years in Site Reliability, DevOps, or Infrastructure Engineering roles
Strong software engineering skills in languages such as Python, Go, or Bash
Deep expertise operating production systems in AWS, with additional experience in GCP or Azure
Proven experience operating and scaling Kubernetes (EKS) in production
Experience implementing GitOps with FluxCD (or similar)
Hands-on experience implementing observability, auto-scaling, and self-healing systems
Strong foundation in networking, load balancing, CDN, and container orchestration
Solid knowledge of performance optimization techniques (e
g
profiling, caching, tuning)
Strong incident response background, including postmortems and SLO/SLI development, driving improvements through data and analysis
Passion for reliability, automation, operational excellence, and building systems patients and clinicians can trust
Excellent communication and collaboration skills
A methodical approach to problem-solving and system design

Preferred Skills

Deep understanding of distributed systems, fault tolerance, and failure modes
Experience implementing chaos engineering practices (Gremlin, Chaos Mesh, etc
)
Familiarity with multi-region, multi-cloud reliability strategies
Experience with service meshes (e
g
, Istio) and resilience patterns
Solid background in security, compliance, and operational hardening (HIPAA, SOC 2)
Experience in capacity planning, scaling, and disaster recovery design

Why Join Us?

At Citizen Health, You Will Be Part Of An Ambitious Mission To Build Innovative AI-driven Solutions That Redefine The Consumer Healthcare Experience And Improve Outcomes For Millions
We Value Curiosity, Creativity, And Collaboration, And We Believe In Empowering Every Team Member To Have An Impact
Here, You’ll Find

The chance to work on pioneering AI technologies and solve complex problems that push the boundaries of what’s possible in healthcare
The freedom to lead your projects and shape the company’s direction in a mission-driven environment where your work makes a real impact
A fast-paced, high-growth setting that brings expanding opportunities and clear paths for career progression
A supportive culture of learning, experimentation, and innovation, where new ideas are encouraged and explored
A commitment to empathy and diversity, recognizing that our greatest strengths come from our varied perspectives and experiences
Regular team activities and knowledge sharing, all within a culture that prioritizes well-being and connection

Additional Perks

Competitive salary equity package
Comprehensive health, dental, and vision insurance
Unlimited paid time off, including a generous parental leave
Flexible hybrid work environment

Don’t meet every qualification? No worries! We believe that passion, curiosity, and the right mindset are just as important as a checklist of skills

If you’re excited about what we’re building, we encourage you to apply

Our Commitment to Diversity & Inclusion: Citizen Health is proud to be an equal opportunity employer

We celebrate diversity and are committed to creating an inclusive environment for all employees

We welcome applicants of all backgrounds, identities, and experiences

Everyone deserves an equal chance to contribute and grow here — regardless of race, gender identity, sexual orientation, religion, national origin, age, disability, or veteran status

Compensation Range: $160K – $190K ~

Date Posted

February 3, 2026
Location

San Francisco, CA
Expiration date

July 30, 2026
Experience

5 Years

Senior Site Reliability Engineer

Job Description

about the job

Related Jobs

Non-Cdl Drivers/Movers

Junk Driver Wanted

Delivery Driver (Non Cdl)

Cdl A Shuttle Truck Driver

Contact Information

For Job Seekers

For Employers

Other Links

Get Latest Update

Login to TCHQ

Reset Password

Create a free TCHQ account

Senior Site Reliability Engineer

Job Description

about the job

Share this post

Related Jobs

Non-Cdl Drivers/Movers

Junk Driver Wanted

Delivery Driver (Non Cdl)

Cdl A Shuttle Truck Driver

Contact Information

For Job Seekers

For Employers

Other Links

Get Latest Update