6 to 8 Years Relevant Experience
- Participate as part of on call rotation supporting Digital platform services and solutions.
- Improve the reliability of our systems and processes with a keen focus on ensuring built-in quality.
- Partner with Product, Architects, and Engineering to help define/measure KPIs, SLI/SLOs.
- Strive to reduce toil through automation initiatives.
- Create and maintain operational documentation.
- Run daily HoTo call with SRE Team members.
- Attend daily connect call with SRE Manager and SRE Lead.
- Attend daily connect call with customer, SRE Lead and SRE Shift Lead for day to day progress.
- Make sure SLA meet for Incident and Service Requests.
- For any priority tickets, escalated issues, inform to SRE Lead, SRE Manager and Onsite stakeholders.
- Manage a team of SRE’s to proactively ensure the stability, resilience and scale of our services by automation, testing and engineering. To take highly complex and manual processes and work to simplify and automate them.
- Provide coaching and mentoring to the SRE team to improve their skill sets.
- Do alert analysis on a daily basis and create SOP on escalated alerts.
- Perform Incident post-mortem to analyze system failures, identify areas of improvement and work to minimize downtime and disruptions.
- Ensure adequate staffing is maintained in all the shifts in order to meet the offshore deliverables.
Job Overview:
- We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team.
- The ideal candidate will have a strong background in Java, with a deep understanding of performance monitoring and observability tools such as AppDynamics and Splunk.
- This role also requires hands-on experience with incident management tools like ServiceNow, proficiency in SQL, and a solid understanding of automation frameworks.
- The SRE will be responsible for improving system reliability, automating manual tasks, and ensuring seamless system performance.
Key Responsibilities:
- System Reliability & Monitoring: Ensure the reliability, availability, and performance of critical applications using monitoring tools like AppDynamics and Splunk.
- Incident Management: Collaborate with cross-functional teams to manage and resolve incidents effectively using ServiceNow.
- Automation: Identify manual processes and develop automation scripts/tools to streamline workflows and improve system efficiency.
- Performance Tuning & Optimization: Analyse and optimise Java applications and databases for enhanced performance.
- Root Cause Analysis: Perform deep-dive analysis into system outages and failures, identify root causes, and implement preventive measures.
- Collaborate with Development Teams: Work closely with developers to ensure the design and deployment of resilient systems.
- SQL & Database Management: Write complex SQL queries to investigate system issues, optimise performance, and ensure data consistency.
- Capacity Planning & Scaling: Plan for future growth by managing system capacity and ensuring scalability.
- Documentation: Maintain thorough documentation of system architecture, incident handling, and operational runbooks.
Key Skills and Qualifications:
- Programming: Proficiency in Java and experience with object-oriented programming and debugging.
- Monitoring Tools: Strong hands-on experience with AppDynamics, Splunk, and other monitoring/observability platforms.
- Incident Management: Experience working with ServiceNow for incident tracking, reporting, and resolution.
- Automation: Proficiency in scripting languages (e.g., Python, Shell) for automation of repetitive tasks.
- Database Management: Strong knowledge of SQL and experience in database query optimization and troubleshooting.
- DevOps Tools: Familiarity with CI/CD tools, containerization (Docker/Kubernetes), and infrastructure-as-code practices.
- Analytical Skills: Ability to analyse and interpret complex data sets, with strong problem-solving skills.
- Communication: Excellent communication skills to work across teams and resolve critical issues under pressure.
Preferred Qualifications:
- Experience with Cloud Platforms: AWS, Azure, or Google Cloud.
- Infrastructure Automation: Experience with tools such as Ansible, Terraform, or Puppet.
- Experience in Distributed Systems: Understanding of microservices architecture and distributed systems reliability.