MLOps Lead, Central Technology
About the position
Responsibilities
• Provide technical MLOps leadership for a team of MLOps Engineers, managing and leading the team in operating AI training and inference systems.
• Drive the application of MLOps and DevOps principles across multiple platforms, ensuring peak operational efficiency.
• Define end to end metrics program including full proactive monitoring and alerting systems for the MLOps team.
• Facilitate model training through collaboration with AI Researchers to ensure best practices in machine learning and deep learning.
• Optimize Kubernetes based AI Lifecycle platform through IAC practices and integrate with On-Prem HPC systems.
• Collaborate on Data systems for AI model training with Data Infrastructure Eng team and Science data teams.
• Lead MLOps team supporting on-call rotation with a focus on automation and proactive alerting.
Requirements
• BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience.
• 7+ years of relevant coding and systems experience.
• 5+ years of systems Architecture and Design experience, with a broad range of MLOps experience.
• Proven technical leadership in SRE and MLOps related experience.
• Strong experience scaling containerized applications on Kubernetes or Mesos.
• Cloud Platform proficiency with AWS, GCP, or Microsoft Azure.
• MLOps experience working with medium to large scale GPU clusters in Kubernetes.
• Working knowledge of Nvidia CUDA and AI/ML custom libraries.
• Knowledge of Linux systems optimization and administration.
• Solid Coding experience with a systems language such as Rust, C/C++, C#, Go, Java, or Scala.
• Expertise with a scripting language such as Python, PHP, or Ruby.
• Experience in integrating Data with the AI Lifecycle.
• AI/ML Platform Operations experience in an environment integrated with challenging data and systems platform challenges.
• Large scale Streaming data systems integration experience.
• Experience with Hadoop, Spark, and/or Kafka deployments.
• Workflow scheduling tools experience such as Apache Airflow, Dagster, or Apache Beam.
• Understanding of Data Engineering, Data Governance, Data Infrastructure, and AI/ML execution platforms.
Nice-to-haves
• Experience with PyTorch, Keras, or Tensorflow.
• Experience with HPC and Slurm.
Benefits
• Generous employer match on employee 401(k) contributions.
• Annual benefit for employees that can be used for housing, student loan repayment, childcare, commuter costs, or other life needs.
• CZI Life of Service Gifts awarded to employees to support causes closest to them.
• Paid time off to volunteer at an organization of your choice.
• Funding for select family-forming benefits.
• Relocation support for employees moving to the Bay Area.
Apply tot his job
Apply To this Job