Lead Site Reliability Engineer, Observability - Remote
About the position
Responsibilities
• Design, deploy and scale our Prometheus architecture to handle 100+ million active series and beyond.
• Deploy and operate large, high-performance ElasticSearch clusters holding 2000+TB of data.
• Deploy and grow high-throughput data pipelines built on Kafka, handling hundreds of thousands of events per second.
• Design and build an alerting system that allows engineering teams to construct alerts from multiple data sources and alerting workflows.
• Write libraries and APIs that give engineers self-service access to our monitoring, logging, and other observability systems.
• Use Terraform to deploy public and private cloud infrastructure.
Requirements
• 5+ years experience designing, deploying and operating mid to large size distributed systems on VMs or bare metal machines running Linux (we run Debian and Ubuntu).
• 2+ years experience developing with languages like Ruby, Python, Go, Scala, or Bash.
• Excited by the challenge of solving difficult problems in large distributed systems that deal with huge amounts of data.
• Desire to work on a highly autonomous team that cares deeply about quality and customer experience.
• Curious, learn fast and feel comfortable diving into unfamiliar code and systems to solve problems.
• Understand the value of observability and can work with other teams to help them better monitor their services.
• Willing to be part of a production on-call rotation.
• Direct experience with technologies such as Elasticsearch Logstash Kibana (ELK) stack, Kafka, Prometheus/Thanos/Cortex, Graphite, Ansible, Terraform, Consul.
• Strong experience in building out solutions based on Software engineering best practices.
Benefits
• Quality medical, dental and vision insurance.
• 401(k) plan with a Cisco matching contribution.
• Short and long-term disability coverage.
• Basic life insurance.
• Numerous wellbeing offerings.
• Up to twelve paid holidays per calendar year, including one floating holiday.
• Paid time off for birthdays.
• Vacation time off policy with flexible limits for exempt employees.
• Sick time off policy with 80 hours provided on hire date and annually thereafter.
• Paid time to volunteer and give back to the community.
Apply tot his job
Apply To this Job