Senior HPC AI Cluster Engineer
hace 3 semanas
NVIDIA is looking for an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. We are focused on building supercomputers and HPC clusters based on groundbreaking technologies. We are looking for an outstanding architect for a senior HPC role, to be a key player in the most exciting computing hardware and software to contribute to the latest breakthroughs in artificial intelligence and GPU computing. You will provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, and with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialists to architect, develop, and bring up large-scale performance platforms.
What you will be doing:
Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting.
Manage Linux job/workload schedules and orchestration tools.
Develop and maintain continuous integration and delivery pipelines.
Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
Deploy monitoring solutions for the servers, network, and storage.
Perform troubleshooting from bare metal, operating system, software stack, and application level.
As a technical resource, develop, redefine, and document standard methodologies to share with internal teams.
Support Research & Development activities and engage in POCs/POVs for future improvements.
What we need to see:
A degree in Computer Science, Engineering, or a related field and 5+ years of experience.
Knowledge of HPC and AI solution technologies from CPUs and GPUs to high-speed interconnects and supporting software.
Experience with job scheduling workloads and orchestration tools such as Slurm, K8s.
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS-level security protection, and common protocols e.g. TCP, DHCP, DNS, etc.
Experience with multiple storage solutions such as Lustre, GPFS, zfs, and xfs. Familiarity with newer and emerging storage technologies.
Python programming and bash scripting experience.
Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/Chef.
Deep knowledge of Networking Protocols like InfiniBand and Ethernet.
Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix).
Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud).
Ways to stand out from the crowd:
Knowledge of CPU and/or GPU architecture.
Knowledge of Kubernetes and container-related microservice technologies.
Experience with GPU-focused hardware/software (DGX, Cuda).
Background with RDMA (InfiniBand or RoCE) fabrics.
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.
#J-18808-Ljbffr
-
Senior HPC AI Cluster Engineer
hace 3 semanas
España Aitopics A tiempo completoNVIDIA is looking for an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. We are focused on building supercomputers and HPC clusters based on groundbreaking technologies. We are looking for an outstanding architect for a senior HPC role, to be a key player in the most exciting computing hardware and software and...
-
Administrador/a Sistemas HPC
hace 3 semanas
España AVANSEL SELECCIÓN A tiempo completoAVANSEL SELECCIÓN – Palma de Mallorca, Islas BalearesEstamos buscando un/a Administrador/a de Sistemas HPC (High Performance Computing) para Sistema de Observación y Predicción Costero de las Islas Baleares (SOCIB), ubicada en Palma de Mallorca (Parc Bit).¿QUÉ SE OFRECE?Contrato indefinido a jornada completa.Categoría laboral: III-B-3.Salario bruto...
-
HPC Workflows Engineer
hace 3 semanas
España European Geosciences Union A tiempo completoJob Title: HPC Workflows Engineer (RE1)Type: Full timeLevel: ExperiencedPreferred Education: PhDPosted: 28 October 2024Deadline: The vacancy will remain open until a suitable candidate has been hired.About BSC:The Barcelona Supercomputing Center – Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses...
-
HPC Engineer for Earth Sciences applications
hace 2 semanas
España European Geosciences Union A tiempo completoHPC Engineer for Earth Sciences applications (RE2)Employer: Barcelona Supercomputing CenterLocation: SpainSector: Atmospheric Sciences (AS)Climate: Past, Present & Future (CL)Earth and Space Science Informatics (ESSI)Type: Full timeLevel: ExperiencedPreferred education: PhDPosted: 28 October 2024Reference: 763_24_ES_HPC_RE2About BSC:The Barcelona...
-
Lead AI Engineer Senior Software Developer
hace 3 semanas
España Iagservices A tiempo completoLead AI Engineer / Senior Software DeveloperWe are recruiting an ambitious Lead AI Engineer to join our IAG technology team. You'll be instrumental in shaping our tech infrastructure, defining the strategic direction of our platforms, and overseeing their execution. This hands-on role involves leading the design and implementation of AI systems with a keen...
-
HPC Parallel Performance Engineer
hace 3 semanas
España Barcelona Supercomputing Center (BSC) A tiempo completoJob Reference 757_24_CS_BPPP_RE1 Position HPC Parallel Performance Engineer (RE1) Closing Date Friday, 15 November, 2024 Reference: 757_24_CS_BPPP_RE1 Job title: HPC Parallel Performance Engineer (RE1) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses...
-
Scientific DevOps Engineer
hace 2 semanas
España Center for Genomic Regulation A tiempo completoThe Institute The Centre for Genomic Regulation (CRG) is an international biomedical research institute of excellence, based in Barcelona, Spain, with more than 400 scientists from 44 countries. The CRG is composed of an interdisciplinary, motivated and creative scientific team supported by a flexible and efficient administration and high-end innovative...
-
HPC Engineer for Earth Sciences applications
hace 3 semanas
España European Geosciences Union e.V. A tiempo completoJob Title: HPC Engineer for Earth Sciences applications (RE2)Employer: Barcelona Supercomputing CenterLocation: SpainSector: Atmospheric Sciences (AS), Climate: Past, Present & Future (CL), Earth and Space Science Informatics (ESSI)Type: Full timeLevel: ExperiencedPreferred Education: PhDPosted: 28 October 2024Reference: 763_24_ES_HPC_RE2About BSC: The...
-
AI Solutions Engineer
hace 3 semanas
España Zurich 56 Company Ltd A tiempo completoAI Solutions Engineer Our opportunityWe are seeking a highly skilled AI Solutions Engineer to join our team. The ideal candidate will have extensive experience in machine learning (ML) and Generative AI, focusing on both the strategic and tactical development of AI solutions. You will architect and deploy advanced AI systems, particularly around Generative...
-
AI Solutions Engineer
hace 3 semanas
España Zurich Australian Insurance Ltd. A tiempo completoWe are seeking a highly skilled AI Solutions Engineer to join our team. The ideal candidate will have extensive experience in machine learning (ML) and Generative AI, focusing on both the strategic and tactical development of AI solutions. You will architect and deploy advanced AI systems, particularly around Generative AI, leveraging your deep technical...
-
HPC Workflows Engineer
hace 3 semanas
España Somma A tiempo completoReference: 761_24_ES_EMW_RE1Job title: HPC Workflows Engineer (RE1)About BSCThe Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses MareNostrum, one of the most powerful supercomputers in Europe, was a founding and hosting member of the former European HPC infrastructure...
-
Senior Machine Learning Engineer
hace 3 semanas
España Nielseniq A tiempo completoNIQ is seeking a highly skilled and experienced Senior ML Engineer to join our dynamic team. As a Senior ML Engineer at NIQ, you will play a crucial role in developing and implementing advanced AI/GenAI models and algorithms to solve complex business problems. You will collaborate closely with cross-functional teams to design, build, and deploy scalable...
-
HPC Engineer for Earth Sciences applications
hace 3 semanas
España Barcelona Supercomputing Center (BSC) A tiempo completoJob Reference 763_24_ES_HPC_RE2 Position HPC Engineer for Earth Sciences applications (RE2) Closing Date Thursday, 28 November, 2024 Reference: 763_24_ES_HPC_RE2 Job title: HPC Engineer for Earth Sciences applications (RE2) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing...
-
Junior Research Engineer
hace 3 semanas
España Barcelona Supercomputing Center (BSC) A tiempo completoJob Reference 786_24_CS_AIR_RE1 Position Junior Research Engineer - Support on AI research (RE1) Closing Date Saturday, 16 November, 2024 Reference: 786_24_CS_AIR_RE1 Job title: Junior Research Engineer - Support on AI research (RE1) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading...
-
Research Engineer on AI libraries and tools
hace 3 semanas
España Barcelona Supercomputing Center (BSC) A tiempo completoJob Reference 788_24_CS_CAOS_RE3 Position Research Engineer on AI libraries and tools (RE3) Closing Date Saturday, 30 November, 2024 About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses MareNostrum, one of the most powerful...
-
Research Engineer on AI HW/SW
hace 3 semanas
España Barcelona Supercomputing Center (BSC) A tiempo completoJob Reference 721_24_CS_CAOS_RE3 Position Research Engineer on AI HW/SW (RE3) Closing Date Thursday, 31 October, 2024 Reference: 721_24_CS_CAOS_RE3 Job title: Research Engineer on AI HW/SW (RE3) About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses Mare...
-
AI Institute Coordinator
hace 3 semanas
España Barcelona Supercomputing Center (BSC) A tiempo completoJob Reference 783_24_DIR_DIR_AIC Position AI Institute Coordinator - AI4S Closing Date Wednesday, 20 November, 2024 About BSC The Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) is the leading supercomputing center in Spain. It houses MareNostrum, one of the most powerful...
-
Senior Machine Learning Engineer
hace 3 semanas
España Nielseniq A tiempo completoNIQ is seeking a highly skilled and experienced Senior ML Engineer to join our dynamic team. As a Senior ML Engineer at NIQ, you will play a crucial role in developing and implementing advanced AI/GenAI models and algorithms to solve complex business problems. You will collaborate closely with cross-functional teams to design, build, and deploy scalable...
-
Associate Principal AI Engineer
hace 3 semanas
España AstraZeneca GmbH A tiempo completoOnsite in Barcelona role - 3 days in the office and 2 days at homeThe Senior AI Engineer will develop and deploy key AI products, generating business and scientific insights through advanced data science techniques. This role involves building models using both foundational and cutting-edge methods, processing structured and unstructured data, and...
-
AI Engineer
hace 3 semanas
España Top Remote Talent A tiempo completoOur client is a leader in the single-family rental (SFR) investment market, offering a comprehensive platform designed to make real estate investing more accessible, cost-effective, and straightforward. They combine a deep passion for helping investors build wealth through real estate with cutting-edge technology that redefines the investment process. With a...