We are looking for a proactive and detail-oriented Sr. Site Reliability Engineer (SRE) to ensure the reliability, performance, and availability of our applications. The role involves monitoring production systems, troubleshooting issues, and collaborating with cross-functional teams to drive faster resolution and continuous improvement. You will play a key role in maintaining system stability and enhancing observability across our microservices-based platform.

Site Reliability Engineer (SRE)

Skills & Experience

Strong understanding of Linux/Unix systems for application support
Hands-on experience in troubleshooting applications in staging and production environments
Ability to monitor system performance and identify root causes using logs and metrics
Experience working with Kubernetes and microservices-based architectures
Proficiency in observability and monitoring tools such as Grafana, Loki, and ELK (Elasticsearch, Logstash, Kibana)
Familiarity with CI/CD practices and tools (e.g., Jenkins, GitOps)
Experience in API testing and validation using tools like Postman and Swagger/OpenAPI
Hands-on experience with PostgreSQL and MongoDB for troubleshooting and ad-hoc reporting
Experience with ticketing and documentation tools such as Jira and Confluence
Minimum 4+ years of experience in application support or reliability engineering
Bachelor’s degree in Computer Science, Information Technology, or a related field
Relevant certifications (Cloud, Kubernetes, Microservices) are a plus

Key Responsibilities

Handle MFS application issues by investigating, troubleshooting, and escalating to engineering teams when needed
Perform initial root cause analysis (RCA) and support resolution of recurring or moderately complex issues
Ensure timely incident resolution in line with SLAs, including proper documentation of fixes and workarounds
Identify and analyze system bottlenecks, and assist in deploying fixes via change management processes
Collaborate with cross-functional teams (Development, SRE/DevOps, QA, Business) to resolve incidents and improve systems
Use observability tools (Grafana, Loki, ELK) to monitor system health, availability, performance, and resiliency
Participate in incident/severity calls, ensuring clear communication and coordination
Develop and maintain knowledge bases, SOPs, and runbooks for standardized operations and troubleshooting

Work Schedule

Willingness to work in a 24x7 environment, including weekends and on-call rotations