Site Reliability engineer

Cloud Space LLC

New York, NY, USA

Published: 6/14/2022

Technology

Full Time

Job Description

Job Title: Site Reliability Engineer (SRE)
Location: New Jersey / New York
Tenure: 3 Years

Position Overview:

We are seeking an experienced Site Reliability Engineer (SRE) to join our team supporting Goldman Sachs. This role will focus on developing automation tools, improving operational efficiency, and ensuring infrastructure reliability. Key areas of responsibility include capacity management, SDLC support, observability, and incident management.

Key Responsibilities:

·& & & & & & & & Infrastructure Capacity Management

·& & & & & & & & Forecast demand and conduct capacity planning across application infrastructure.

·& & & & & & & & Continuously optimize resource utilization.

·& & & & & & & & Maintain production environments and manage Business Continuity Planning (BCP).

·& & & & & & & & Define acceptable downtime or failure thresholds to ensure high availability and resiliency.

·& & & & & & & & Build and maintain SRE infrastructure including tools, scripts, and integration with core engineering platforms (e.g., Prometheus, Grafana).

·& & & & & & & & Automation Process Improvement

·& & & & & & & & Develop and maintain automation tools to streamline infrastructure management.

·& & & & & & & & Reduce manual intervention through process automation.

Observability Monitoring

·& & & & & & & & Define metrics and thresholds; build frameworks to capture metrics, trends, and generate alerts.

·& & & & & & & & Monitor alerts generated by the observability framework and coordinate remediation based on pre-agreed schedules.

·& & & & & & & & Incident Management

·& & & & & & & & Serve as a bridge between support and engineering teams to improve incident response

·& & & & & & & & Manage incident follow-ups, post-mortems, and implement corrective actions.

Performance Reliability

·& & & & & & & & Work with engineering teams to optimize system performance based on continuous feedback.

·& & & & & & & & Ensure adherence to Service Level Objectives (SLOs) and Service Level Indicators (SLIs) as defined by application teams.

Qualifications:

·& & & & & & & & 5+ years of SRE experience (senior role); 3+ years SRE experience acceptable for supporting roles.

·& & & & & & & & Strong expertise in infrastructure management, observability tools (Prometheus, Grafana), and automation scripting.

·& & & & & & & & Hands-on experience in capacity planning, performance optimization, and incident response.

·& & & & & & & & Proven track record in building and maintaining high-availability systems.

·& & & & & & & & Strong collaboration skills with cross-functional teams.