Machine Learning Engineering Manager – LLM Serving, Infrastructure
• Lead a high-performing engineering team to develop, build, and deploy a high-scale, low-latency LLM Serving Infrastructure.
• Drive the implementation of a unified serving layer to support multiple LLM models and inference types (batch, offline eval flows and real-time/streaming).
• Lead all aspects of the development of the Model Registry for deploying, versioning, and running LLMs across production environments.
• Ensure successful integration with the core Personalization and Recommendation systems to deliver LLM-powered features.
• Define and champion standardized technical interfaces and protocols for efficient model deployment and scaling.
• Establish and monitor the serving infrastructure's performance, cost, and reliability, including load balancing, autoscaling, and failure recovery.
• Collaborate closely with data science, machine learning research, and feature teams (Autoplay, Home, Search, etc.) to drive the active adoption of the serving infrastructure.
• Scale up the serving architecture to handle hundreds of millions of users and high-volume inference requests for internal domain-specific LLMs.
• Drive Latency and Cost Optimization: partner with SRE and ML teams to implement techniques like quantization, pruning, and efficient batching to minimize serving latency and cloud compute costs.
• Develop Observability and Monitoring: build dashboards and alerting for service health, tracing, A/B test traffic, and latency trends to ensure consistency to defined SLAs.
• Contribute to Core LPM Serving: focus on the technical strategy for deploying and maintaining the core Large Personalization Model (LPM).
Apply tot his job
Apply To this Job