Senior Site Reliability Engineer

Application ends: July 30, 2026
Apply Now

Job Description

Join Our Team As
Position: Senior Site Reliability Engineer
Company: Citizen Health

Job Brief

about the job

who we are. . citizen health was founded on the belief — shaped by firsthand lived experiences navigating healthcare — that having the right advocate is the single most important factor in achieving better care and outcomes. by uniquely combining data, ai, and community, citizen is building a personalized ai advocate powered by patients’ complete medical histories and data from thousands of other patients to generate personalized insights for clinical decisions and day-to-day challenges. starting in rare and complex conditions, patients share their data in exchange for value, enabling biopharma and researchers with seamless access to patients and regulatory-grade data, shaving years off drug development for much-needed treatments.. . citizen is founded by experienced entrepreneurs with multiple successes under their belts and is funded by top-tier investors including 8vc, transformation capital, and headline ventures, among others. we are a mission-driven team excited to be building the future of consumer healthcare.. . the role. . citizen health is seeking a senior site reliability engineer (sre) to ensure the resilience, performance, and availability of our ai-powered, patient-centric healthcare platform.. . in this hands-on, high-impact role, you will apply software engineering principles to operational challenges—designing and maintaining reliable systems that scale, fail gracefully, and recover quickly. you’ll work cross-functionally to establish slos/slis, implement robust observability, establish a metrics-driven approach to service performance, and drive improvements in incident response, fault tolerance, and service reliability.. . if you’re passionate about building systems that stay up, scale well, and recover fast — and you thrive on solving reliability challenges in modern cloud-native environments — we’d love to talk to you.. . responsibilities. . reliability engineering & observability. .

Experience
Work Growth of 5 Years is required

Job Detail
  • Define and measure service reliability through SLIs, SLOs, and error budgets
  • Implement and operate observability tooling (e
  • g
  • , NewRelic, Prometheus) across cloud and Kubernetes environments
  • Analyze logs, traces, and metrics to surface actionable insights and improve system health
  • Perform capacity planning, load testing, and performance profiling and tuning to support scale and reliability, and to optimize system performance


  • Resilience & Automation

    Design and maintain resilient, self-healing infrastructure in AWS and Kubernetes (EKS)
  • Conduct chaos engineering experiments, failure mode analysis, and disaster recovery drills to proactively identify and fix weaknesses
  • Build automation to reduce toil, improve reliability metrics (latency, uptime, error rates, MTTD, and MTTR), and prevent recurrence of incidents
  • Engineer infrastructure for fault tolerance, auto-scaling, and graceful degradation


  • Incident Response & Operations

    Drive incident response efforts, manage on-call rotations, and coordinate resolution of production outages
  • Conduct root cause analyses and blameless postmortems to drive learning and resilience
  • Continuously improve key reliability metrics such as latency, uptime, error rates, and availability
  • Collaborate with security, platform, and DevOps teams to ensure high-availability and production-readiness of services


  • Cross-Team Collaboration & Culture

    Collaborate with engineering teams during design and architecture phases to assess and mitigate reliability risks
  • Support progressive delivery strategies including feature flags and canary deployments
  • Champion SRE principles and practices, helping build a culture of resilience and shared ownership
  • Stay current with emerging practices in cloud reliability, observability, and SRE tooling


  • Who You Are

    You are a hands-on individual who thrives in fast-paced, high-stakes environments where reliability is mission-critical
  • You bring deep experience operating distributed systems at scale, with a strong foundation in cloud-native infrastructure, Kubernetes, and observability


  • You think like a software engineer but focus like an operator, using code to solve operational challenges
  • You’re driven by making systems more resilient, reducing downtime, and building fault-tolerant architectures that scale with user demand
  • You value data-driven decision making, and see SLIs/SLOs, incident postmortems, and continuous improvement as essential tools, not checkboxes


  • You have a strong sense of ownership, thrive under pressure, and believe the best systems are the ones that heal themselves
  • You collaborate closely across teams, care deeply about the end-user experience, and are always looking for better ways to keep complex systems running smoothly


  • Must-Have Skills

    5 years in Site Reliability, DevOps, or Infrastructure Engineering roles
  • Strong software engineering skills in languages such as Python, Go, or Bash
  • Deep expertise operating production systems in AWS, with additional experience in GCP or Azure
  • Proven experience operating and scaling Kubernetes (EKS) in production
  • Experience implementing GitOps with FluxCD (or similar)
  • Hands-on experience implementing observability, auto-scaling, and self-healing systems
  • Strong foundation in networking, load balancing, CDN, and container orchestration
  • Solid knowledge of performance optimization techniques (e
  • g
  • profiling, caching, tuning)
  • Strong incident response background, including postmortems and SLO/SLI development, driving improvements through data and analysis
  • Passion for reliability, automation, operational excellence, and building systems patients and clinicians can trust
  • Excellent communication and collaboration skills
  • A methodical approach to problem-solving and system design

    Preferred Skills

    Deep understanding of distributed systems, fault tolerance, and failure modes
  • Experience implementing chaos engineering practices (Gremlin, Chaos Mesh, etc
  • )
  • Familiarity with multi-region, multi-cloud reliability strategies
  • Experience with service meshes (e
  • g
  • , Istio) and resilience patterns
  • Solid background in security, compliance, and operational hardening (HIPAA, SOC 2)
  • Experience in capacity planning, scaling, and disaster recovery design

    Why Join Us?

    At Citizen Health, You Will Be Part Of An Ambitious Mission To Build Innovative AI-driven Solutions That Redefine The Consumer Healthcare Experience And Improve Outcomes For Millions
  • We Value Curiosity, Creativity, And Collaboration, And We Believe In Empowering Every Team Member To Have An Impact
  • Here, You’ll Find

    The chance to work on pioneering AI technologies and solve complex problems that push the boundaries of what’s possible in healthcare
  • The freedom to lead your projects and shape the company’s direction in a mission-driven environment where your work makes a real impact
  • A fast-paced, high-growth setting that brings expanding opportunities and clear paths for career progression
  • A supportive culture of learning, experimentation, and innovation, where new ideas are encouraged and explored
  • A commitment to empathy and diversity, recognizing that our greatest strengths come from our varied perspectives and experiences
  • Regular team activities and knowledge sharing, all within a culture that prioritizes well-being and connection

    Additional Perks

    Competitive salary equity package
  • Comprehensive health, dental, and vision insurance
  • Unlimited paid time off, including a generous parental leave
  • Flexible hybrid work environment

    Don’t meet every qualification? No worries! We believe that passion, curiosity, and the right mindset are just as important as a checklist of skills
  • If you’re excited about what we’re building, we encourage you to apply


  • Our Commitment to Diversity & Inclusion: Citizen Health is proud to be an equal opportunity employer
  • We celebrate diversity and are committed to creating an inclusive environment for all employees
  • We welcome applicants of all backgrounds, identities, and experiences
  • Everyone deserves an equal chance to contribute and grow here — regardless of race, gender identity, sexual orientation, religion, national origin, age, disability, or veteran status


  • Compensation Range: $160K – $190K ~