Senior Cloud Architect, SRE - DGX Cloud: Shaping the Future of Cloud Computing and AI Infrastructure
Join the Ranks of the World's Most Innovative Technology Company
NVIDIA is at the forefront of technological advancements, driving innovations in AI, computing, and beyond. We're seeking a highly skilled and experienced Senior Cloud Architect to join our DGX Cloud Site Reliability Engineering (SRE) team. As a Senior Cloud Architect, SRE - DGX Cloud, you will play a pivotal role in designing, building, and maintaining large-scale production systems that power NVIDIA's GPU cloud services. This is an exceptional opportunity to leverage your technical expertise, creativity, and passion for cloud computing to shape the future of AI infrastructure.
About the Role
The Senior Cloud Architect, SRE - DGX Cloud role is a key position within NVIDIA's SRE team, responsible for ensuring the reliability, efficiency, and scalability of our DGX Cloud solutions. As a Senior Cloud Architect, you will lead the technical architecture for DGX cloud solutions on top of cloud service providers like AWS, GCP, Azure, and OCI. You will work closely with cross-functional teams to design, implement, and support operational and reliability aspects of large-scale GPU training clusters.
Key Responsibilities
- Lead technical architecture for DGX cloud solutions on top of cloud service providers like AWS, GCP, Azure, and OCI.
- Provide fast and creative solutions for complex problems and write effective, clear, and reliable architecture specifications.
- Design, implement, and support operational and reliability aspects of large-scale GPU training clusters with a focus on performance at scale, real-time monitoring, logging, and alerting.
- Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
Requirements and Qualifications
To be successful in this role, you should possess a strong technical background with a focus on cloud computing, distributed systems, and site reliability engineering. The ideal candidate will have:
Essential Qualifications
- B.Sc./M.Sc./Ph.D. degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
- 8+ years of proven experience in cloud computing, distributed systems, or a related field.
- Experience with infrastructure automation, distributed systems design, and experience with designing, developing tools for running large-scale private or public cloud systems in production.
- Experience in one or more of the following: Python, Go.
- In-depth knowledge of Linux, Networking, and Cloud Native Technologies.
Preferred Qualifications
- Interest in crafting, analyzing, and fixing large-scale distributed systems.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- Ability to debug and optimize code and automate routine tasks.
- Experience in using or running large private and public cloud systems based on Kubernetes or Slurm.
What We Offer
NVIDIA is committed to providing a comprehensive compensation and benefits package that reflects our employees' skills, experience, and contributions. The base salary range for this role is $220,000 - $419,750 USD. You will also be eligible for equity and benefits. We accept applications on an ongoing basis, so we encourage you to apply as soon as possible.
Our Culture and Work Environment
At NVIDIA, we pride ourselves on fostering a diverse and inclusive work environment that encourages creativity, innovation, and collaboration. Our SRE team is no exception, with a culture that values intellectual curiosity, problem-solving, and openness. We promote self-direction, allowing our engineers to work on meaningful projects while providing the support and mentorship needed to learn and grow.
As a remote team, we offer the flexibility to work from anywhere, at any time, as long as you're committed to delivering exceptional results. We're committed to building a community that is diverse, inclusive, and respectful, where everyone can thrive and grow.
Career Growth and Development
At NVIDIA, we're committed to helping our employees grow and develop their careers. As a Senior Cloud Architect, SRE - DGX Cloud, you will have the opportunity to work on complex, challenging projects that will help you develop your technical skills and expertise. You will also have access to our comprehensive training and development programs, designed to help you stay up-to-date with the latest technologies and trends.
Join Our Team!
If you're a motivated, talented, and experienced Senior Cloud Architect looking to shape the future of cloud computing and AI infrastructure, we want to hear from you! Apply today to join our team and be part of a community that is driving innovation and excellence in the tech industry.
NVIDIA is an equal opportunity employer and welcomes applications from diverse candidates. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
Apply for this job