Sr Observability SRE to enhance the availability and scalability of the current observability stack (Prometheus/OpenTelemetry) for our large fleet management client
Location – Fully Remote
Duration – 6-12 Months
# Of Positions – 1
Observability Contractor Overview
The contractor will be responsible for enhancing the availability and scalability of our current production observability stack, which is based on Prometheus, OpenTelemetry (OTel), and Google BigQuery. The goal is to ensure the stack can handle the current and projected volume of data effectively and reliably, provide capacity planning and stress testing, develop policies and processes for the integration of new metrics, handle feature requests to meet user observability needs, and perform operational management of the stack. Key Skills
Expertise designing and maintaining a multi-pillared observability platform for logs, metrics, and traces: Prometheus-compatible metrics stacks such as VictoriaMetrics, including alertmanager, exporters, distributed queries, and other interfaces. Tracing tooling using the OpenTelemetry standards, such as the OTel collector, Grafana Tempo, Jaeger, etc. Log collection and storage systems such as FluentBit, Loki, ElasticSearch, etc. Grafana. Experience designing systems able to ingest terabyte-scale observability data, including distributed storage and querying. Expertise with Docker and Kubernetes.
Software development experience for cloud native workloads using Python, Go, or similar languages. Responsibilities Collaborate with various internal teams including Cloud Automation, IT, SRE and Development to understand the current stack and scope out requirements for future improvements. Oversee the entire lifecycle of the observability stack, including design, configuration, deployment, and monitoring.
Develop and maintain comprehensive knowledge articles and troubleshooting documentation. Facilitate knowledge transfer to internal teams to ensure continuity post-contract. Outcome The success of the observability stack will be measured by: The system’s availability and uptime. The stack’s ability to scale in response to increased metric volume. Clearly documented capacity planning including upper bounds of the system Ease of integration with the observability stack by our various application development teams.
SOW Term: 6-12 Months, beginning July, and ending Jan-June,