Job Description
Company: TalentXM
about the job
mlops engineer (contract role)job title: mlops engineercontract duration: 6–12 months (extension possible)location: remote (u.s.-based)reports to: devops & service delivery leadership. company overviewour client is a rapidly growing technology services company specializing in ai, data, and quality engineering solutions. with a global delivery model spanning north america, the united kingdom, and an offshore development center in india, our client consistently delivers cutting-edge solutions that surpass expectations. the company has achieved triple-digit year-over-year growth and earned prestigious industry accolades, including making the inc. 5000 list of fastest-growing companies and being recognized as best it services company, best data technology company, as well as partner of the year by tricentis. they uphold a bold vision to be experts in ai, data, and quality engineering transformations – encouraging innovative thinking and authentic relationships in all endeavors.. role overviewour client is seeking an experienced mlops engineer to develop and operationalize a comprehensive ai cost tracking and observability framework across multiple cloud platforms. in this role, you will be instrumental in ensuring visibility into ai/ml model performance, usage, and cost metrics across azure, google cloud, and snowflake environments. you will collaborate closely with cross-functional teams (devops, finops, and others) to optimize model deployments both for performance and cost-efficiency.. key responsibilities:
- Cost & Observability Framework: Build a common AI cost tracking and observability framework spanning Azure ML, Google Vertex AI (Gemini), and Snowflake platforms
- Cloud Billing Integration: Integrate cloud billing and usage APIs (Azure ML, OpenAI, Google Vertex AI/Gemini) to aggregate and monitor AI service costs
- Metadata Tagging: Develop model-level metadata tagging processes for cost attribution and trend analysis, enabling granular visibility into costs per model or project
- Monitoring & Alerting: Implement and manage Datadog dashboards (or similar observability tools) with alerts for model performance issues – including latency spikes, model drift, and anomaly detection in predictions or usage
- Collaboration for Optimization: Work closely with DevOps and FinOps teams to visualize model costs and identify optimization opportunities (e
- g
- rightsizing resources, adjusting usage patterns)
- Documentation & Knowledge Transfer: Deliver comprehensive documentation and conduct knowledge transfer sessions to internal teams at project closure, ensuring they can maintain and extend the cost tracking framework
-
Required Skills & Experience MLOps/DevOps Experience: 5 years of hands-on experience in MLOps, DevOps, or Cloud Engineering roles focused on AI/ML systems deployment and operations - Cloud AI Platforms: Strong experience working with Azure ML , Google Vertex AI (Gemini) , and OpenAI platforms/services, including deploying and managing models on these services
- Observability Tools: Expertise in Datadog (or equivalent monitoring/observability tools) for tracking application performance, logs, and metrics
- Programming & Automation: Advanced proficiency in Python and SQL for building automation scripts, data analysis, and integration of monitoring pipelines
- CI/CD & Monitoring Integration: Proven experience integrating cost and performance monitoring steps into CI/CD pipelines, ensuring that model deployments are coupled with automated observability and cost checks
- FinOps & Cost Management: Solid understanding of FinOps principles , cloud billing APIs, and strategies for cloud cost optimization in an engineering context (e
- g
- optimizing compute/storage for AI workloads)
-
Preferred Qualifications Generative AI Frameworks: Experience with GenAI/Agentic AI frameworks such as LangChain or building RAG (Retrieval-Augmented Generation) pipelines, especially in production environments - Regulated Environment Experience: Familiarity with implementing cost tracking and ML monitoring in regulated environments (e
- g
- ensuring compliance with ISO , SOC 2 , HITRUST or similar standards)
- ~