Senior DevOps Engineer (HPC)
Remote, PolskaEPAM
Wynagrodzenie do ustalenia
Wymagania
3+ years of experience with DevOps processes and automation using Infrastructure as Code tools such as Terraform
Hands-on experience operating or engineering large-scale HPC or similar computing environments
Proven expertise in Linux system administration including TCP/IP networking and storage subsystems
Experience administering large-scale cluster management software such as Slurm, LSF, or Grid Engine
Knowledge of configuration management tools like Ansible, Salt, or Puppet
Experience working in agile DevOps teams
Ability to develop and maintain monitoring tools such as Grafana and Prometheus
Experience with scripting languages such as Bash and Python for automation and tool development
Strong experience managing virtualized private cloud environments like OpenStack
Scientific degree or equivalent experience in computationally intensive scientific data analysis
Proven ability to manage relationships with third-party suppliers
Upper-intermediate proficiency in English (B2+)
Zakres obowiązków
Design, implement, and maintain robust platform infrastructure using Infrastructure as Code tools such as Terraform
Develop, deliver, and operate research computing services and applications
Apply Site Reliability Engineering principles to manage HPC service deployment, monitoring, and incident response
Solve complex technical problems related to HPC services and user applications
Manage large-scale HPC, HTC, or BC computing environments for optimal performance
Collaborate with scientific users to tailor HPC resources to research needs
Automate deployment processes to ensure consistency across HPC infrastructure
Maintain and administer large-scale cluster and server computing software such as Slurm, LSF, or Grid Engine
Develop and maintain monitoring dashboards using tools like Grafana and Prometheus
Work within a DevOps team environment following agile methodologies
Operate and utilize virtualized private cloud resources such as OpenStack
Administer large-scale parallel filesystems including Weka, GPFS, or Lustre
Use configuration management tools like Ansible, Salt, or Puppet to manage IT operations
Develop scripts and tools for HPC and DevOps platform operations using Bash and Python
Seniority
Senior
Mile widziane
Experience with container technologies such as LXD, Singularity, Docker, or Kubernetes
Operation and configuration experience with public cloud platforms like AWS, Azure, or GCP
Experience with HashiCorp tools such as Vault, Consul, and Nomad
Development experience with programming languages such as Java, C++, Python, Ruby, or Perl
Experience with parallel filesystems like Weka, GPFS, or Lustre
Opis
We are seeking a Senior DevOps Engineer to enhance our high-performance computing services and collaborate closely with the scientific community to optimize research computing. Join our team to build and operate cutting-edge HPC capabilities using automation and infrastructure-as-code. Apply now to contribute to innovative computational solutions in a dynamic environment. Responsibilities Design, implement, and maintain robust platform infrastructure using Infrastructure as Code tools such as Terraform Develop, deliver, and operate research computing services and applications Apply Site Reliability Engineering principles to manage HPC service deployment, monitoring, and incident response Solve complex technical problems related to HPC services and user applications Manage large-scale HPC, HTC, or BC computing environments for optimal performance Collaborate with scientific users to tailor HPC resources to research needs Automate deployment processes to ensure consistency across HPC infrastructure Maintain and administer large-scale cluster and server computing software such as Slurm, LSF, or Grid Engine Develop and maintain monitoring dashboards using tools like Grafana and Prometheus Work within a DevOps team environment following agile methodologies Operate and utilize virtualized private cloud resources such as OpenStack Administer large-scale parallel filesystems including Weka, GPFS, or Lustre Use configuration management tools like Ansible, Salt, or Puppet to manage IT operations Develop scripts and tools for HPC and DevOps platform operations using Bash and Python Requirements 3+ years of experience with DevOps processes and automation using Infrastructure as Code tools such as Terraform Hands-on experience operating or engineering large-scale HPC or similar computing environments Proven expertise in Linux system administration including TCP/IP networking and storage subsystems Experience administering large-scale cluster management software such as Slurm, LSF, or Grid Engine Knowledge of configuration management tools like Ansible, Salt, or Puppet Experience working in agile DevOps teams Ability to develop and maintain monitoring tools such as Grafana and Prometheus Experience with scripting languages such as Bash and Python for automation and tool development Strong experience managing virtualized private cloud environments like OpenStack Scientific degree or equivalent experience in computationally intensive scientific data analysis Proven ability to manage relationships with third-party suppliers Upper-intermediate proficiency in English (B2+) Nice to have Experience with container technologies such as LXD, Singularity, Docker, or Kubernetes Operation and configuration experience with public cloud platforms like AWS, Azure, or GCP Experience with HashiCorp tools such as Vault, Consul, and Nomad Development experience with programming languages such as Java, C++, Python, Ruby, or Perl Experience with parallel filesystems like Weka, GPFS, or Lustre