Description

  • Design and implement scalable infrastructure to support growing systems, ensuring high availability and performance;
  • Enhance system reliability and security - identifying potential risks and proactively implementing solutions;
  • Automate manual processes and repetitive tasks to improve operational efficiency and reduce human error;
  • Lead incident response efforts, including troubleshooting, root cause analysis, and post-incident reviews;
  • Maintain high uptime and availability of services, including monitoring, alerting, and ensuring swift recovery from outages;
  • Optimize system performance identifying bottlenecks and fine-tuning infrastructure for scalability and efficiency;
  • Drive Continuous Integration and Continuous Deployment (CI/CD) processes to enable rapid, safe, and automated code releases;
  • Implement and manage robust monitoring and alerting systems to proactively detect issues before they impact end users;
  • Collaborate closely with development teams to ensure smooth application deployment and operational excellence;
  • Champion DevOps best practices - fostering a culture of collaboration, automation, and continuous improvement across teams.

Responsibilities

  •  Technical Expertise: 
  • Programming and Scripting: Strong proficiency in scripting languages (e.g., Python, Bash) for automation and orchestration, and coding languages (e.g., Go, Java) for building and maintaining systems;
  • Containerization and Orchestration: Experience with Docker, Docker Swarm, and Kubernetes for container management, deployment, and orchestration; knowledge of Helm for managing Kubernetes applications;
  • Cloud Platforms: Familiarity with cloud services (AWS, GCP, Azure) and cloud-native design patterns, including serverless architecture, cloud storage, and network design in cloud environments;
  • Networking Fundamentals: Comprehensive understanding of network protocols (TCP/IP, HTTP/S), load balancing, DNS, VPN, and firewall management to ensure secure, high-performance network operations;


  •  Systems Architecture: 
  • Infrastructure Design: Deep understanding of scalable, reliable, and cost-effective infrastructure design, including experience with microservices architecture and distributed systems;
  • Operating Systems: Strong expertise in Linux (various distributions) and Windows, with a focus on system performance tuning, security hardening, and troubleshooting;
  • Resilience and High Availability: Experience in designing fault-tolerant systems with high availability configurations (e.g., clustering, replication, failover), ensuring minimal downtime;


  •  Networking and Security: 
  • Security Best Practices: Understanding of security protocols, SSL/TLS, SSH, VPN, and IAM policies; experience with implementing zero-trust architecture and robust access controls;
  • Vulnerability Management: Conducting vulnerability assessments, identifying security gaps, and deploying patches or mitigations to enhance security posture;
  • Network Security: Ability to configure network firewalls, intrusion detection/prevention systems (IDS/IPS), and DDoS protection;


  •  Communication: 
  • Collaboration: Strong interpersonal skills for collaborating effectively across cross-functional teams, including product, engineering, and leadership;
  • Technical Documentation: Ability to articulate complex technical topics through clear documentation, diagrams, and presentations for both technical and non-technical stakeholders;


  •  Problem-solving and Troubleshooting: 
  • Incident Management: Analyze and resolve incidents quickly and effectively under pressure; provide insights into root causes and proactive solutions to prevent recurrences;
  • Diagnostic Skills: Advanced diagnostic skills for analyzing logs, metrics, and traces to troubleshoot complex distributed systems and optimize their performance;


  •  DevOps Practices: 
  • CI/CD Pipelines: Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) to automate testing, integration, and deployment; understanding of blue-green, canary, and rolling deployment strategies;
  • Infrastructure as Code (IaC): Proficiency with IaC tools like Terraform, Ansible, or CloudFormation to automate infrastructure provisioning, configuration, and management;


  •  Experience: 
  • Managing Large-Scale Systems: Proven track record in managing large-scale distributed systems, with a focus on scalability, reliability, and performance optimization;
  • Infrastructure Automation: Ability to design, implement, and improve infrastructure automation, configuration management, and self-healing systems; 
  • Monitoring and Observability: Extensive experience with monitoring, alerting, and logging tools (e.g., Prometheus, VictoriaMetrics, Grafana, ELK Stack, Datadog, Dynatrace, NewRelic); ability to define and monitor SLOs and SLAs; 
  • Incident Response and RCA: Lead incident response, conduct root cause analysis (RCA), and implement corrective actions to reduce future incidents and increase system resilience;
  • Performance Optimization: Regularly analyze and optimize system and application performance, ensuring efficient resource usage and improved end-user experience;
  • Disaster Recovery and Business Continuity: Develop and execute disaster recovery strategies, including backups, failover procedures, and regular testing to ensure data integrity and business continuity;
  • Security Compliance: Implement and enforce security policies and standards, conduct periodic audits and vulnerability scans, and ensure compliance with industry regulations (e.g., GDPR, HIPAA, PCI-DSS);
  • Documentation and Knowledge Sharing: Create and maintain runbooks, architecture diagrams, and training materials; provide guidance and mentorship on SRE best practices within the organization;


  •  End-to-End SDLC Expertise: 
  • Full Lifecycle Experience: Expertise across all SDLC phases, including requirements analysis, system design, development, testing, deployment, monitoring, feedback, and continuous optimization.

About this role

Apply Before

dekabr 24, 2024

Job Posted On

noyabr 14, 2024

Job Type

Full-time

Category

Science, Technology, Engineering