Your mission
You will join our Infrastructure Operations Team and help Tupl to monitor and maintain our big data automation platform within specific customer projects. Right now, Tupl is focused on creating use cases in the telecom segment, so telecom experience is always a plus. The basic job duties for you are handling initial response and resolution for infrastructure-related issues. Your responsibilities are centered on monitoring, basic troubleshooting, and escalation.
1. Monitoring and Incident Detection
1. Monitoring and Incident Detection
- Continuous Monitoring: Use monitoring tools, like Prometheus, to track system performance, network health, and server availability.
- Alert Response: Acknowledge and respond to alerts for servers, storage, network devices, or cloud services.
- Log Reviews: Check system logs to identify anomalies and document findings.
- System Reboots: Restart services, servers, or systems as required.
- Password Resets: Assist users with account and access issues.
- Disk Management: Monitor disk usage and clear space where applicable.
- Service Restarts: Restart applications or services based on documented procedures.
- Ticket Handling: Manage and respond to tickets in JIRA/Opsgenie.
- User Requests: Fulfill simple requests, such as provisioning a user or granting permissions.
- Incident Documentation: Maintain logs of incidents, resolutions, and steps taken.
- Daily Reports: Provide regular reports on system health, resolved incidents, and escalations.
- Determine Severity: Assess incidents to decide if they need to be escalated to L2 or L3 teams.
- Follow Escalation Path: Work closely with higher-level teams and provide all relevant information for effective troubleshooting.
- Scheduled Checks: Perform routine checks on backups, disk health, and CPU/memory utilization.
- Patching: Maintaining and Patching infrastructure for different clients, and our own datacenter
- Cloud Monitoring: Track uptime and performance in AWS, Azure, or GCP environments.
- Virtualization Support: Monitor hypervisors (e.g., VMware, Hyper-V) for performance issues.
- Networking: Respond to basic network outages or connectivity problems.
- Status Updates: Provide regular updates to stakeholders or clients on the status of incidents.
- Handovers: Ensure smooth transition of responsibilities between shifts with clear documentation.