Managing customer issues related to the installation, configuration, and implementation of products on a timely basis, providing effective and clear communication, and establishing appropriate expectations with clients
Automate repetitive tasks to improve operational efficiency and reduce manual intervention.
Provide primary operational support and engineering for large-scale distributed software applications
Monitor and analyze system performance, ensuring optimal performance and scalability.
Respond to incidents, perform root cause analysis, and implement preventive measures.
Implement and maintain a comprehensive monitoring and alerting system to ensure early detection of anomalies and issues.
Design, build, and manage deployment pipelines to facilitate seamless and reliable application releases.
Conduct regular performance testing and capacity planning to identify and address bottlenecks in the infrastructure.
Participate in on-call rotation and handle production incidents as necessary.
Ensure customers are effectively represented to the Product Management and Engineering teams by writing actionable, detailed Defect reports and Enhancement requests in Jira
Skills and Experience:
Proven experience as a Site Reliability Engineer or a similar role in a large-scale production environment.
Strong expertise in scripting and automation using languages like Python, Bash, or similar.
Strong Linux skills, including command-line tools, shell scripting, and system diagnostics.
Proficiency with cloud platforms (e.g., AWS, Azure, GCP) and container technologies (Docker, Kubernetes).
Excellent customer service skills, empathy, and a sense of urgency
Deep understanding of monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack)
Knowledge of networking, security, and system administration.