- Automating Tasks: Designing, maintenance and management of tools for automation of different operational processes. Design and Write code to automate repetitive tasks, such as provisioning new servers or managing configurations.
- Troubleshooting Outages: When incidents occur, dive into troubleshooting, identifying root causes, and resolving issues promptly.
- On-Call Responsibilities: participate in on-call rotations, ensuring 24/7 availability and rapid response to incidents.
- Monitoring and Observability: They set up monitoring systems, track key metrics, and respond proactively to anomalies.
- Capacity Planning: analyze system capacity, predict resource needs, and optimize infrastructure.
- Deployment and Release Management: Deployment, automation, management, configuration and maintenance of AWS cloud-based production system.
Process Capabilities:
- Change Management: oversee how code is deployed, configured, and monitored.
- Availability and Latency: focus on maintaining high availability and low latency for services.