Assist in defining requirements, designing and building data center technology components and testing efforts.
Must have skills :
Infrastructure Automation
Good to have skills :
Python (Programming Language), Work Load Automation Architecture and Design, Automation Architecture
Minimum
12
year(s) of experience is required
Educational Qualification :
15 years full time education
Summary: The Server SRE is responsible for ensuring the reliability, scalability, and performance of server infrastructure. This role combines software engineering, development and systems engineering to automate operations, manage incidents, achieve a noise-free environment. The candidate will do the automation development work and work closely with infrastructure teams to implement observability and automation solutions. Must Have Skills - Strong experience in Linux/Unix server administration as L3 - Expert in any scripting language i.e. Python, Bash, or Shell scripting - Hands-on experience with monitoring tools such as Prometheus, Grafana, Nagios - Ability to analyze the environment , incidents and problems in Server Area , develop Roadmap to automate workloads, optimize performance and Eradicate unwanted work. - Strong CMDB knowledge. - Experience with CI/CD pipelines and DevOps practices - Familiarity with server performance metrics and observability tools Good to Have Skills - Experience with cloud platforms (AWS, Azure, GCP) - Knowledge of container orchestration (e.g., Kubernetes) - Familiarity with infrastructure as code tools (e.g., Terraform, Ansible) - Exposure to incident management frameworks (e.g., ITIL, SRE principles) Job Requirements Minimum of 12 years of experience in server administration and reliability engineering. Strong analytical skills and ability to work in a fast-paced environment. Must be able to implement automation and monitoring solutions and analyze incidents to maintain system stability. Key Responsibilities - Monitor and maintain server health across environments - Automate operational tasks and reduce manual interventions - Implement observability solutions including metrics, logging, and tracing - Analyze incidents and perform root cause analysis - Collaborate with teams to improve system reliability and reduce alert noise - Design scalable server architectures for high availability - Conduct capacity planning and performance tuning Technical Experience Hands-on experience with server monitoring and automation tools. Strong scripting skills and familiarity with observability platforms. Experience in analyzing incidents and implementing solutions to reduce noise and improve reliability. Professional Attributes Excellent problem-solving and analytical skills. Strong communication and collaboration abilities. Proactive mindset with a focus on continuous improvement and operational excellence. Educational Qualification and Certification Bachelor's Degree in Computer Science, Information Technology, or related field. Certifications in Linux administration, cloud platforms, or SRE practices are a plus.
15 years full time education
Beware of fraud agents! do not pay money to get a job
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.