to support and enhance our infrastructure, applications, and deployment pipelines. This role is crucial in maintaining system reliability, ensuring platform stability, and optimizing our cloud-native environments. You will collaborate closely with cross-functional teams to monitor, automate, and improve the efficiency and performance of our systems.
Monitor infrastructure, cloud services, CI/CD pipelines, and application performance using tools like Prometheus, Grafana, ELK Stack, CloudWatch, or Datadog.
Respond to incidents and alerts promptly to minimize downtime and maintain business continuity.
Perform detailed root cause analysis (RCA) and maintain clear incident documentation.
Ensure SLA adherence and timely escalation to L2/L3 teams when required.
Track and report on system health, performance metrics, and incident trends.
Automation & Reliability Engineering
Develop and maintain automation scripts for deployment, monitoring, and maintenance using Bash, Python, or Ansible.
Implement Infrastructure as Code (IaC) practices using Terraform, CloudFormation, or Ansible for consistent deployments.
Continuously improve system reliability, scalability, and observability through proactive optimization.
Implement automated remediation for common issues to reduce manual intervention.
Cloud & Infrastructure Management
Manage and support production systems on AWS, Azure, or GCP cloud platforms.
Handle routine operational tasks including resource scaling, patching, backup management, and log analysis.
Troubleshoot complex infrastructure, networking, and application-level issues across distributed environments.
Support and maintain Kubernetes clusters and Docker containerized environments.
Manage networking components including load balancers, DNS configurations, and SSL/TLS certificates.
CI/CD & Application Deployment Support
Maintain, troubleshoot, and optimize CI/CD pipelines using Jenkins, GitLab CI, ArgoCD, or GitHub Actions.
Support seamless deployments across development, staging, and production environments.
Collaborate with development teams to ensure smooth delivery cycles and rapid feedback loops.
Implement deployment best practices and maintain deployment documentation.
Required Qualifications:
3-7 years of hands-on experience in DevOps, Site Reliability Engineering (SRE), or Cloud Operations roles.
Strong proficiency in Linux/Unix system administration and command-line operations.
Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP).
Solid understanding of CI/CD principles, Git version control, and automation frameworks.
Experience with Kubernetes and Docker for container orchestration and management.
Proficiency with monitoring and observability tools (Prometheus, Grafana, ELK Stack, CloudWatch, Datadog, etc.).
Knowledge of networking fundamentals including load balancers, DNS, firewalls, and SSL/TLS.
Strong scripting abilities in Python, Bash, Shell, or PowerShell.
Experience with incident response procedures and problem management.
Excellent communication and collaboration skills.
Strong analytical and problem-solving mindset with attention to detail.
Preferred Qualifications:
Professional certifications such as AWS Certified SysOps Administrator, Kubernetes CKA, Azure DevOps Engineer, or similar credentials.
Experience with production-grade systems like Kafka, Redis, PostgreSQL, MongoDB, NGINX, or Apache.
Familiarity with ITIL processes and incident management frameworks.
Exposure to regulated environments (e.g., manufacturing, IoT, finance, healthcare, or enterprise systems).
Experience with security tools and practices including vulnerability scanning, secrets management, and compliance monitoring.
Knowledge of GitOps practices and declarative infrastructure management.
What We Offer:
Opportunity to work with cutting-edge cloud technologies and modern DevOps practices.
Exposure to large-scale, mission-critical infrastructure.
Health insurance coverage.
Performance-based bonuses.
Professional development opportunities and certification support.
Collaborative and supportive team environment.
Work Model:
Work Hours:
Day shift
Location:
On-site at Baner, Pune, Maharashtra.
Work Mode:
Reliable commute or relocation to Pune required before joining.
Preference:
Immediate joiners will be given priority.
Key Performance Indicators:
System uptime and SLA compliance.
Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR).
Incident resolution rate without escalation.
Deployment success rate and pipeline stability.
Automation coverage and operational efficiency improvements.
Job Details:
Job Type:
Full-time, Permanent
Location:
Baner, Pune, Maharashtra
Schedule:
Day shift
Job Types: Full-time, Permanent
Benefits:
Health insurance
Paid sick time
Paid time off
Ability to commute/relocate:
Baner, Pune, Maharashtra: Reliably commute or planning to relocate before starting work (Required)
Application Question(s):
Are you available to join immediately? If not, please mention your notice period.
Experience:
DevOps: 3 years (Required)
Work Location: In person
Beware of fraud agents! do not pay money to get a job
MNCJobsIndia.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.