Site Reliability Engineer – SRE
Job Description:
• Serve as first responder for production incidents during U.S. operating hours (±2h EST).
• Lead triage during outages, analyzing logs, metrics, and traces to identify root causes.
• Drive incident postmortems and follow-ups to prevent recurrence.
• Communicate clearly and quickly during incidents to internal stakeholders.
• Own reliability outcomes across all OpenFX systems, with a focus on uptime, latency, and error budgets.
• Enhance observability through logging, metrics, alerting, and dashboards.
• Optimize on-call processes and ensure smooth handoffs across IST, EST, and PST coverage.
• Partner with DevOps and engineering pods to implement fixes or approve production changes.
• Proactively identify systemic reliability risks and propose improvements.
• Contribute automation and tooling to reduce manual incident handling.
• Champion best practices in reliability engineering and operational excellence.
Requirements:
• 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
• Proven experience leading incident response, running postmortems, and communicating during outages.
• Strong background with cloud infrastructure (AWS preferred), container orchestration (Kubernetes, ECS), and Infrastructure-as-Code (Terraform, CloudFormation).
• Familiarity with observability stacks (e.g., Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
• Ability to triage errors at both the infrastructure and application level, and escalate effectively when deeper intervention is required.
• Ownership mindset with strong communication skills in high-pressure situations.
Benefits:
• Competitive salary and benefits package.
• Equity in a rapidly growing company.
• Opportunity to work on mission-critical infrastructure in fintech.
• A collaborative team culture with a bias toward ownership and outcomes.
• The chance to make a direct impact on the resilience of global financial infrastructure.
Apply tot his job
Apply To this Job