Job Description
Sr. Site Reliability Engineer
About The Role
You will be the team’s go-to person for infrastructure, monitoring, and production health. You’ll manage Kubernetes-based systems, build and improve observability tooling, and use data to surface problems before they become incidents. When code changes are needed to make systems more observable, you’ll make them yourself.
What You'll Do
Own and improve our monitoring, alerting, and observability systems
Build dashboards and metrics that give the team real insight into production health
Manage Kubernetes infrastructure — resource allocation, diagnostics, and keeping things running well
Query data with SQL to understand system behavior, spot trends, and investigate anomalies
Design alerting that is actionable and sustainable — no fatigue, no noise
Use AI to accelerate incident response and root cause analysis, and find ways to improve observability workflows for the whole team
Ins...
About The Role
You will be the team’s go-to person for infrastructure, monitoring, and production health. You’ll manage Kubernetes-based systems, build and improve observability tooling, and use data to surface problems before they become incidents. When code changes are needed to make systems more observable, you’ll make them yourself.
What You'll Do
Own and improve our monitoring, alerting, and observability systems
Build dashboards and metrics that give the team real insight into production health
Manage Kubernetes infrastructure — resource allocation, diagnostics, and keeping things running well
Query data with SQL to understand system behavior, spot trends, and investigate anomalies
Design alerting that is actionable and sustainable — no fatigue, no noise
Use AI to accelerate incident response and root cause analysis, and find ways to improve observability workflows for the whole team
Ins...