pracaon.pl

Senior Site Reliability Engineer

Remote, Polska
EPAM
Partner
Зарплата за домовленістю
Повна зайнятість • Дистанційна робота • IT та телекомунікації

Основні характеристики вакансії

  • Мін. 5 років досвіду

  • Сервер: Java / .NET / Node / Python

  • DevOps / Хмара: AWS, Azure, Docker, Kubernetes

  • Повний робочий день

  • Віддалена робота - без поїздок

Description

We are seeking a Senior Site Reliability Engineer to join our SRE/DevOps organization. The successful candidate will design and implement automated provisioning, deployment, management, and monitoring solutions for a large-scale, rapidly evolving portfolio of SaaS services. Working closely with architecture and development teams, the Senior Engineer will contribute to engineering standards and best practices, drive CI/CD and observability improvements, and support the team in delivering reliable, scalable infrastructure. Responsibilities Design and implementation of CI/CD pipelines leveraging Kubernetes, Flux, and related cloud-native technologies Implementation and maintenance of monitoring and management solutions for cloud-based products using a combination of commercial off-the-shelf (COTS) and in-house tooling Collaboration with architects and development teams on standardized, scalable approaches for log management, service components, and infrastructure elements Evaluation and integration of AI-enabled tooling across observability, pipeline efficiency, and SRE troubleshooting workflows Development of DevOps tooling that reduces manual toil, strengthens security posture, and minimizes human error Build and maintenance of resilient, self-scaling systems that minimize customer impact while supporting a sustainable operational environment Participation in incident response, root cause analysis, and post-incident review processes Requirements Minimum 7 years of professional experience in software development, DevOps, and/or Site Reliability Engineering Minimum 3 years of experience with building and maintaining CI/CD pipelines and SRE automation within cloud environments at scale Experience with monitoring and alerting platforms (e.g., PagerDuty, Prometheus, Grafana) Hands-on experience with deployment and management of cloud infrastructure Experience with Amazon Web Services (e.g., EC2, Elasticsearch, Lambda, CloudFormation) Working experience with at least one additional cloud provider (GCP, Azure, or OCI) Experience with CI/CD toolchains (Jenkins, Kubernetes, Flux) Proficiency in one or more of the following languages: Python, Go, Java, or C English proficiency at B2 level or higher Nice to have Experience with application of Generative AI or ML-based tooling within an operations context Experience with DevSecOps practices and security automation Background in architecture of monitoring and management systems for enterprise SaaS products

Requirements

  • Minimum 7 years of professional experience in software development, DevOps, and/or Site Reliability Engineering

  • Minimum 3 years of experience with building and maintaining CI/CD pipelines and SRE automation within cloud environments at scale

  • Experience with monitoring and alerting platforms (e.g., PagerDuty, Prometheus, Grafana)

  • Hands-on experience with deployment and management of cloud infrastructure

  • Experience with Amazon Web Services (e.g., EC2, Elasticsearch, Lambda, CloudFormation)

  • Working experience with at least one additional cloud provider (GCP, Azure, or OCI)

  • Experience with CI/CD toolchains (Jenkins, Kubernetes, Flux)

  • Proficiency in one or more of the following languages: Python, Go, Java, or C

  • English proficiency at B2 level or higher

Responsibilities

  • Design and implementation of CI/CD pipelines leveraging Kubernetes, Flux, and related cloud-native technologies

  • Implementation and maintenance of monitoring and management solutions for cloud-based products using a combination of commercial off-the-shelf (COTS) and in-house tooling

  • Collaboration with architects and development teams on standardized, scalable approaches for log management, service components, and infrastructure elements

  • Evaluation and integration of AI-enabled tooling across observability, pipeline efficiency, and SRE troubleshooting workflows

  • Development of DevOps tooling that reduces manual toil, strengthens security posture, and minimizes human error

  • Build and maintenance of resilient, self-scaling systems that minimize customer impact while supporting a sustainable operational environment

  • Participation in incident response, root cause analysis, and post-incident review processes

Seniority

  • Senior

Nice to have

  • Experience with application of Generative AI or ML-based tooling within an operations context

  • Experience with DevSecOps practices and security automation

  • Background in architecture of monitoring and management systems for enterprise SaaS products

Ключові слова / Навички

Site Reliability Engineering
Amazon Web Services
Grafana
Jenkins
Kubernetes
PagerDuty
Prometheus
Flux
Go Language
Google Cloud Platform
Java
Microsoft Azure
Цю пропозицію імпортовано із зовнішнього порталу.Джерело оголошення