Job Description
Company: Citizen Health
about the job
who we are. . citizen health was founded on the belief — shaped by firsthand lived experiences navigating healthcare — that having the right advocate is the single most important factor in achieving better care and outcomes. by uniquely combining data, ai, and community, citizen is building a personalized ai advocate powered by patients’ complete medical histories and data from thousands of other patients to generate personalized insights for clinical decisions and day-to-day challenges. starting in rare and complex conditions, patients share their data in exchange for value, enabling biopharma and researchers with seamless access to patients and regulatory-grade data, shaving years off drug development for much-needed treatments.. . citizen is founded by experienced entrepreneurs with multiple successes under their belts and is funded by top-tier investors including 8vc, transformation capital, and headline ventures, among others. we are a mission-driven team excited to be building the future of consumer healthcare.. . the role. . citizen health is seeking a senior site reliability engineer (sre) to ensure the resilience, performance, and availability of our ai-powered, patient-centric healthcare platform.. . in this hands-on, high-impact role, you will apply software engineering principles to operational challenges—designing and maintaining reliable systems that scale, fail gracefully, and recover quickly. you’ll work cross-functionally to establish slos/slis, implement robust observability, establish a metrics-driven approach to service performance, and drive improvements in incident response, fault tolerance, and service reliability.. . if you’re passionate about building systems that stay up, scale well, and recover fast — and you thrive on solving reliability challenges in modern cloud-native environments — we’d love to talk to you.. . responsibilities. . reliability engineering & observability. .
- Define and measure service reliability through SLIs, SLOs, and error budgets
- Implement and operate observability tooling (e
- g
- , NewRelic, Prometheus) across cloud and Kubernetes environments
- Analyze logs, traces, and metrics to surface actionable insights and improve system health
- Perform capacity planning, load testing, and performance profiling and tuning to support scale and reliability, and to optimize system performance
Resilience & Automation
Design and maintain resilient, self-healing infrastructure in AWS and Kubernetes (EKS)- Conduct chaos engineering experiments, failure mode analysis, and disaster recovery drills to proactively identify and fix weaknesses
- Build automation to reduce toil, improve reliability metrics (latency, uptime, error rates, MTTD, and MTTR), and prevent recurrence of incidents
- Engineer infrastructure for fault tolerance, auto-scaling, and graceful degradation
Incident Response & Operations
Drive incident response efforts, manage on-call rotations, and coordinate resolution of production outages- Conduct root cause analyses and blameless postmortems to drive learning and resilience
- Continuously improve key reliability metrics such as latency, uptime, error rates, and availability
- Collaborate with security, platform, and DevOps teams to ensure high-availability and production-readiness of services
Cross-Team Collaboration & Culture
Collaborate with engineering teams during design and architecture phases to assess and mitigate reliability risks- Support progressive delivery strategies including feature flags and canary deployments
- Champion SRE principles and practices, helping build a culture of resilience and shared ownership
- Stay current with emerging practices in cloud reliability, observability, and SRE tooling
Who You Are
You are a hands-on individual who thrives in fast-paced, high-stakes environments where reliability is mission-critical- You bring deep experience operating distributed systems at scale, with a strong foundation in cloud-native infrastructure, Kubernetes, and observability
You think like a software engineer but focus like an operator, using code to solve operational challenges- You’re driven by making systems more resilient, reducing downtime, and building fault-tolerant architectures that scale with user demand
- You value data-driven decision making, and see SLIs/SLOs, incident postmortems, and continuous improvement as essential tools, not checkboxes
You have a strong sense of ownership, thrive under pressure, and believe the best systems are the ones that heal themselves- You collaborate closely across teams, care deeply about the end-user experience, and are always looking for better ways to keep complex systems running smoothly
Must-Have Skills
5 years in Site Reliability, DevOps, or Infrastructure Engineering roles- Strong software engineering skills in languages such as Python, Go, or Bash
- Deep expertise operating production systems in AWS, with additional experience in GCP or Azure
- Proven experience operating and scaling Kubernetes (EKS) in production
- Experience implementing GitOps with FluxCD (or similar)
- Hands-on experience implementing observability, auto-scaling, and self-healing systems
- Strong foundation in networking, load balancing, CDN, and container orchestration
- Solid knowledge of performance optimization techniques (e
- g
- profiling, caching, tuning)
- Strong incident response background, including postmortems and SLO/SLI development, driving improvements through data and analysis
- Passion for reliability, automation, operational excellence, and building systems patients and clinicians can trust
- Excellent communication and collaboration skills
- A methodical approach to problem-solving and system design
Preferred Skills
Deep understanding of distributed systems, fault tolerance, and failure modes - Experience implementing chaos engineering practices (Gremlin, Chaos Mesh, etc
- )
- Familiarity with multi-region, multi-cloud reliability strategies
- Experience with service meshes (e
- g
- , Istio) and resilience patterns
- Solid background in security, compliance, and operational hardening (HIPAA, SOC 2)
- Experience in capacity planning, scaling, and disaster recovery design
Why Join Us?
At Citizen Health, You Will Be Part Of An Ambitious Mission To Build Innovative AI-driven Solutions That Redefine The Consumer Healthcare Experience And Improve Outcomes For Millions - We Value Curiosity, Creativity, And Collaboration, And We Believe In Empowering Every Team Member To Have An Impact
- Here, You’ll Find
The chance to work on pioneering AI technologies and solve complex problems that push the boundaries of what’s possible in healthcare - The freedom to lead your projects and shape the company’s direction in a mission-driven environment where your work makes a real impact
- A fast-paced, high-growth setting that brings expanding opportunities and clear paths for career progression
- A supportive culture of learning, experimentation, and innovation, where new ideas are encouraged and explored
- A commitment to empathy and diversity, recognizing that our greatest strengths come from our varied perspectives and experiences
- Regular team activities and knowledge sharing, all within a culture that prioritizes well-being and connection
Additional Perks
Competitive salary equity package - Comprehensive health, dental, and vision insurance
- Unlimited paid time off, including a generous parental leave
- Flexible hybrid work environment
Don’t meet every qualification? No worries! We believe that passion, curiosity, and the right mindset are just as important as a checklist of skills - If you’re excited about what we’re building, we encourage you to apply
Our Commitment to Diversity & Inclusion: Citizen Health is proud to be an equal opportunity employer- We celebrate diversity and are committed to creating an inclusive environment for all employees
- We welcome applicants of all backgrounds, identities, and experiences
- Everyone deserves an equal chance to contribute and grow here — regardless of race, gender identity, sexual orientation, religion, national origin, age, disability, or veteran status
Compensation Range: $160K – $190K ~