Senior HPC AI Cluster Engineer

hace 3 semanas


España NVIDIA A tiempo completo

NVIDIA is looking for an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. We are focused on building supercomputers and HPC clusters based on groundbreaking technologies. We are looking for an outstanding architect for a senior HPC role, to be a key player in the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing. You will provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialists to architect, develop, and bring up large-scale performance platforms.
What you will be doing:

Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting.
Manage Linux job/workload schedules and orchestration tools.
Develop and maintain continuous integration and delivery pipelines.
Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
Deploy monitoring solutions for the servers, network, and storage.
Perform troubleshooting from bare metal, operating system, software stack, and application level.
As a technical resource, develop, redefine, and document standard methodologies to share with internal teams.
Support Research & Development activities and engage in POCs/POVs for future improvements.

What we need to see:

A degree in Computer Science, Engineering, or a related field and 5+ years of experience.
Knowledge of HPC and AI solution technologies from CPUs and GPUs to high-speed interconnects and supporting software.
Experience with job scheduling workloads and orchestration tools such as Slurm, K8s.
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS-level security protection, and common protocols e.g. TCP, DHCP, DNS, etc.
Experience with multiple storage solutions such as Lustre, GPFS, zfs, and xfs. Familiarity with newer and emerging storage technologies.
Python programming and bash scripting experience.
Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/Chef.
Deep knowledge of Networking Protocols like InfiniBand and Ethernet.
Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix).
Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud).

Ways to stand out from the crowd:

Knowledge of CPU and/or GPU architecture.
Knowledge of Kubernetes and container-related microservice technologies.
Experience with GPU-focused hardware/software (DGX, Cuda).
Background with RDMA (InfiniBand or RoCE) fabrics.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

#J-18808-Ljbffr



  • España Aitopics A tiempo completo

    NVIDIA is looking for an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. We are focused on building supercomputers and HPC clusters based on groundbreaking technologies. We are looking for an outstanding architect for a senior HPC role, to be a key player in the most exciting computing hardware and software and...


  • España AVANSEL SELECCIÓN A tiempo completo

    AVANSEL SELECCIÓN – Palma de Mallorca, Islas BalearesEstamos buscando un/a Administrador/a de Sistemas HPC (High Performance Computing) para Sistema de Observación y Predicción Costero de las Islas Baleares (SOCIB), ubicada en Palma de Mallorca (Parc Bit).¿QUÉ SE OFRECE?Contrato indefinido a jornada completa.Categoría laboral: III-B-3.Salario bruto...

  • HPC Workflows Engineer

    hace 3 semanas


    España European Geosciences Union A tiempo completo

    Job Title: HPC Workflows Engineer (RE1)Type: Full timeLevel: ExperiencedPreferred Education: PhDPosted: 28 October 2024Deadline: The vacancy will remain open until a suitable candidate has been hired.About BSC:The Barcelona Supercomputing Center – Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses...


  • España European Geosciences Union A tiempo completo

    HPC Engineer for Earth Sciences applications (RE2)Employer: Barcelona Supercomputing CenterLocation: SpainSector: Atmospheric Sciences (AS)Climate: Past, Present & Future (CL)Earth and Space Science Informatics (ESSI)Type: Full timeLevel: ExperiencedPreferred education: PhDPosted: 28 October 2024Reference: 763_24_ES_HPC_RE2About BSC:The Barcelona...


  • España Iagservices A tiempo completo

    Lead AI Engineer / Senior Software DeveloperWe are recruiting an ambitious Lead AI Engineer to join our IAG technology team. You'll be instrumental in shaping our tech infrastructure, defining the strategic direction of our platforms, and overseeing their execution. This hands-on role involves leading the design and implementation of AI systems with a keen...


  • España Barcelona Supercomputing Center (BSC) A tiempo completo

    Job Reference 757_24_CS_BPPP_RE1 Position HPC Parallel Performance Engineer (RE1) Closing Date Friday, 15 November, 2024 Reference: 757_24_CS_BPPP_RE1 Job title: HPC Parallel Performance Engineer (RE1) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses...


  • España Center for Genomic Regulation A tiempo completo

    The Institute The Centre for Genomic Regulation (CRG) is an international biomedical research institute of excellence, based in Barcelona, Spain, with more than 400 scientists from 44 countries. The CRG is composed of an interdisciplinary, motivated and creative scientific team supported by a flexible and efficient administration and high-end innovative...


  • España European Geosciences Union e.V. A tiempo completo

    Job Title: HPC Engineer for Earth Sciences applications (RE2)Employer: Barcelona Supercomputing CenterLocation: SpainSector: Atmospheric Sciences (AS), Climate: Past, Present & Future (CL), Earth and Space Science Informatics (ESSI)Type: Full timeLevel: ExperiencedPreferred Education: PhDPosted: 28 October 2024Reference: 763_24_ES_HPC_RE2About BSC: The...

  • AI Solutions Engineer

    hace 3 semanas


    España Zurich 56 Company Ltd A tiempo completo

    AI Solutions Engineer Our opportunityWe are seeking a highly skilled AI Solutions Engineer to join our team. The ideal candidate will have extensive experience in machine learning (ML) and Generative AI, focusing on both the strategic and tactical development of AI solutions. You will architect and deploy advanced AI systems, particularly around Generative...

  • AI Solutions Engineer

    hace 3 semanas


    España Zurich Australian Insurance Ltd. A tiempo completo

    We are seeking a highly skilled AI Solutions Engineer to join our team. The ideal candidate will have extensive experience in machine learning (ML) and Generative AI, focusing on both the strategic and tactical development of AI solutions. You will architect and deploy advanced AI systems, particularly around Generative AI, leveraging your deep technical...

  • HPC Workflows Engineer

    hace 3 semanas


    España Somma A tiempo completo

    Reference: 761_24_ES_EMW_RE1Job title: HPC Workflows Engineer (RE1)About BSCThe Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses MareNostrum, one of the most powerful supercomputers in Europe, was a founding and hosting member of the former European HPC infrastructure...


  • España Nielseniq A tiempo completo

    NIQ is seeking a highly skilled and experienced Senior ML Engineer to join our dynamic team. As a Senior ML Engineer at NIQ, you will play a crucial role in developing and implementing advanced AI/GenAI models and algorithms to solve complex business problems. You will collaborate closely with cross-functional teams to design, build, and deploy scalable...


  • España Barcelona Supercomputing Center (BSC) A tiempo completo

    Job Reference 763_24_ES_HPC_RE2 Position HPC Engineer for Earth Sciences applications (RE2) Closing Date Thursday, 28 November, 2024 Reference: 763_24_ES_HPC_RE2 Job title: HPC Engineer for Earth Sciences applications (RE2) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing...

  • Junior Research Engineer

    hace 3 semanas


    España Barcelona Supercomputing Center (BSC) A tiempo completo

    Job Reference 786_24_CS_AIR_RE1 Position Junior Research Engineer - Support on AI research (RE1) Closing Date Saturday, 16 November, 2024 Reference: 786_24_CS_AIR_RE1 Job title: Junior Research Engineer - Support on AI research (RE1) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading...


  • España Barcelona Supercomputing Center (BSC) A tiempo completo

    Job Reference 788_24_CS_CAOS_RE3 Position Research Engineer on AI libraries and tools (RE3) Closing Date Saturday, 30 November, 2024 About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses MareNostrum, one of the most powerful...


  • España Barcelona Supercomputing Center (BSC) A tiempo completo

    Job Reference 721_24_CS_CAOS_RE3 Position Research Engineer on AI HW/SW (RE3) Closing Date Thursday, 31 October, 2024 Reference: 721_24_CS_CAOS_RE3 Job title: Research Engineer on AI HW/SW (RE3) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses Mare...

  • AI Institute Coordinator

    hace 3 semanas


    España Barcelona Supercomputing Center (BSC) A tiempo completo

    Job Reference 783_24_DIR_DIR_AIC Position AI Institute Coordinator - AI4S Closing Date Wednesday, 20 November, 2024 About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses MareNostrum, one of the most powerful...


  • España Nielseniq A tiempo completo

    NIQ is seeking a highly skilled and experienced Senior ML Engineer to join our dynamic team. As a Senior ML Engineer at NIQ, you will play a crucial role in developing and implementing advanced AI/GenAI models and algorithms to solve complex business problems. You will collaborate closely with cross-functional teams to design, build, and deploy scalable...


  • España AstraZeneca GmbH A tiempo completo

    Onsite in Barcelona role - 3 days in the office and 2 days at homeThe Senior AI Engineer will develop and deploy key AI products, generating business and scientific insights through advanced data science techniques. This role involves building models using both foundational and cutting-edge methods, processing structured and unstructured data, and...

  • AI Engineer

    hace 3 semanas


    España Top Remote Talent A tiempo completo

    Our client is a leader in the single-family rental (SFR) investment market, offering a comprehensive platform designed to make real estate investing more accessible, cost-effective, and straightforward. They combine a deep passion for helping investors build wealth through real estate with cutting-edge technology that redefines the investment process. With a...