Sr. HPC Architect - Hybrid
Caris Life Sciences
Position Summary
A Senior HPC Architect is responsible for designing and optimizing high-performance computing (HPC) systems, leveraging their expertise in parallel programming, performance analysis, and hardware architecture to create scalable, efficient solutions for demanding computational workloads, often collaborating with software developers and hardware engineers to achieve optimal performance across complex scientific or data-intensive applications.
Job Responsibilities
System Design and Implementation:
Architecting and designing high-performance computing clusters, selecting appropriate hardware components like CPUs, GPUs, storage systems, and networking infrastructure.
Installing and configuring operating systems (typically Linux) on cluster nodes.
Setting up and managing distributed file systems (like Lustre, Ceph, GPFS) for large data storage and access.
Implementing job scheduling systems (e.g., LSF, Slurm, PBS) to manage workload distribution across the cluster.
Performance Optimization:
Monitoring system performance metrics (CPU utilization, memory usage, network bandwidth) to identify bottlenecks and optimize resource allocation.
Benchmarking applications and performing performance analysis to identify areas for improvement.
Tuning application code for parallel processing to leverage the power of the HPC cluster.
User Support:
Providing technical support to researchers and users on how to access and utilize the HPC system
Training users on best practices for submitting jobs and optimizing their applications for the HPC environment
Troubleshooting user issues related to application execution, data management, and system access
System Administration:
Managing system updates, patching, and security configurations to maintain a stable and secure HPC environment
Implementing backup and disaster recovery procedures for critical data and system configurations
Monitoring system health and proactively addressing potential issues through alerts and logging systems
Required Qualifications
Minimum of five years’ experience in Linux systems administration.
Bachelor's degree in computer science, engineering, math, or scientific discipline with 2+ years of systems engineering; or 6 years’ experience in HPC architecture.
Hands-on architecture design experience with HPC to include storage, file system, InfiniBand, security, authentication, and compute architecture
Experience using Git to manage shared software configuration code bases
Hands-on experience with cloud-based services (e.g. Azure, AWS, GCP).
Good understanding of storage administration and optimization, such as performing upgrades and defining RAID configurations.
Deep understanding of parallel computing concepts and programming paradigms (MPI, OpenMP, CUDA).
Expertise in performance analysis tools and techniques to identify and address performance bottlenecks.
Knowledge of HPC hardware architectures, including processors, memory subsystems, network fabrics, and interconnects
Familiarity with HPC software stack components like compilers, runtime systems, job schedulers, and scientific libraries
Good understanding of storage administration and optimization, such as performing upgrades and defining RAID configurations.
Strong programming skills in languages commonly used in HPC (C, C++, Fortran)
Strong skills with scripting languages like Python and Shell scripting (e.g.,bash,ksh, Perl, Python) for automation
Experience with system administration and cluster management tools (e.g., LSF, Slurm, PBS)
Experience with distributed file systems (Lustre, Ceph, GPFS)
Excellent communication and problem-solving abilities to effectively collaborate with cross-functional teams
Preferred Qualifications
Experience in life sciences, healthcare and/or research institutions highly preferred
Experience building and installing scientific software and other 3rd party software applications on HPC systems
Experience with HPC schedulers and resource managers
Experience executing scientific software on HPC systems
Experience writing user documentation
Strong technical and analytical skills
Strong verbal and written communication skills
Always maintains the highest level of professionalism when interacting with internal and external customers
Demonstrates a high-energy, positive attitude and commitment to quality customer service
Contributes to a positive team environment within the center by demonstrating a strong work ethic, effectively communicating with others, and proactively anticipating center and user needs
Experience coordinating and running support teams
Related industry certifications preferred.
Physical Demands
Ability to lift, move and install HPC data center hardware and supplies.
Standing for extended periods while performing data center related tasks.
Training
All job specific, safety, and compliance training are assigned based on the job functions associated with this employee.
Other
This position requires periodic travel and some evenings, weekends, and/or holidays.
Job may require after-hours response to emergency issues.
Periodically scheduled on-call may require after-hours response for technical emergencies not explicitly related to assigned job responsibilities
Conditions of Employment: Individual must successfully complete pre-employment process, which includes criminal background check, drug screening, credit check ( applicable for certain positions) and reference verification.
This job description reflects management’s assignment of essential functions. Nothing in this job description restricts management’s right to assign or reassign duties and responsibilities to this job at any time.
Caris Life Sciences is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, status as a protected veteran, among other things, or status as a qualified individual with disability.