Senior Site Reliability Engineer

Remote, Polska

EPAM

Partner

6д

Зарплата за домовленістю

Повна зайнятість • Дистанційна робота • IT та телекомунікації

Основні характеристики вакансії

Мін. 5 років досвіду
Сервер: Java / .NET / Node / Python
DevOps / Хмара: AWS, Azure, Docker, Kubernetes
Повний робочий день
Віддалена робота - без поїздок

Description

We are seeking a Senior Site Reliability Engineer to join our SRE/DevOps organization. The successful candidate will design and implement automated provisioning, deployment, management, and monitoring solutions for a large-scale, rapidly evolving portfolio of SaaS services. Working closely with architecture and development teams, the Senior Engineer will contribute to engineering standards and best practices, drive CI/CD and observability improvements, and support the team in delivering reliable, scalable infrastructure. Responsibilities Design and implementation of CI/CD pipelines leveraging Kubernetes, Flux, and related cloud-native technologies Implementation and maintenance of monitoring and management solutions for cloud-based products using a combination of commercial off-the-shelf (COTS) and in-house tooling Collaboration with architects and development teams on standardized, scalable approaches for log management, service components, and infrastructure elements Evaluation and integration of AI-enabled tooling across observability, pipeline efficiency, and SRE troubleshooting workflows Development of DevOps tooling that reduces manual toil, strengthens security posture, and minimizes human error Build and maintenance of resilient, self-scaling systems that minimize customer impact while supporting a sustainable operational environment Participation in incident response, root cause analysis, and post-incident review processes Requirements Minimum 7 years of professional experience in software development, DevOps, and/or Site Reliability Engineering Minimum 3 years of experience with building and maintaining CI/CD pipelines and SRE automation within cloud environments at scale Experience with monitoring and alerting platforms (e.g., PagerDuty, Prometheus, Grafana) Hands-on experience with deployment and management of cloud infrastructure Experience with Amazon Web Services (e.g., EC2, Elasticsearch, Lambda, CloudFormation) Working experience with at least one additional cloud provider (GCP, Azure, or OCI) Experience with CI/CD toolchains (Jenkins, Kubernetes, Flux) Proficiency in one or more of the following languages: Python, Go, Java, or C English proficiency at B2 level or higher Nice to have Experience with application of Generative AI or ML-based tooling within an operations context Experience with DevSecOps practices and security automation Background in architecture of monitoring and management systems for enterprise SaaS products

Requirements

Minimum 7 years of professional experience in software development, DevOps, and/or Site Reliability Engineering
Minimum 3 years of experience with building and maintaining CI/CD pipelines and SRE automation within cloud environments at scale
Experience with monitoring and alerting platforms (e.g., PagerDuty, Prometheus, Grafana)
Hands-on experience with deployment and management of cloud infrastructure
Experience with Amazon Web Services (e.g., EC2, Elasticsearch, Lambda, CloudFormation)
Working experience with at least one additional cloud provider (GCP, Azure, or OCI)
Experience with CI/CD toolchains (Jenkins, Kubernetes, Flux)
Proficiency in one or more of the following languages: Python, Go, Java, or C
English proficiency at B2 level or higher

Responsibilities

Design and implementation of CI/CD pipelines leveraging Kubernetes, Flux, and related cloud-native technologies
Implementation and maintenance of monitoring and management solutions for cloud-based products using a combination of commercial off-the-shelf (COTS) and in-house tooling
Collaboration with architects and development teams on standardized, scalable approaches for log management, service components, and infrastructure elements
Evaluation and integration of AI-enabled tooling across observability, pipeline efficiency, and SRE troubleshooting workflows
Development of DevOps tooling that reduces manual toil, strengthens security posture, and minimizes human error
Build and maintenance of resilient, self-scaling systems that minimize customer impact while supporting a sustainable operational environment
Participation in incident response, root cause analysis, and post-incident review processes

Seniority

Senior

Nice to have

Experience with application of Generative AI or ML-based tooling within an operations context
Experience with DevSecOps practices and security automation
Background in architecture of monitoring and management systems for enterprise SaaS products

Ключові слова / Навички

Site Reliability Engineering

Amazon Web Services

Grafana

Jenkins

Kubernetes

PagerDuty

Prometheus

Flux

Go Language

Google Cloud Platform

Java

Microsoft Azure

Цю пропозицію імпортовано із зовнішнього порталу.Джерело оголошення