See all the jobs at Cambridge Computer Services, Inc here:
| Professional Services | Full-time | Fully remote
, ,Job Overview:
The Cambridge Computer Services HPC AI Network Technologist is a field-based consultant who builds end-to-end research computing system solutions focusing on networking design and implementation. The HPC AI Network Technologist is responsible for designing, implementing, and optimizing the network infrastructure that supports high-performance computing (HPC) and artificial intelligence (AI) workloads. This role ensures that network solutions are scalable, reliable, secure, and capable of meeting the demanding performance needs of advanced computational environments. You will leverage your expertise in scientific computing and knowledge of the technology landscape to drive outcomes that exceed client expectations.
Responsibilities and Duties:
- Gather client networking requirements, design optimized solutions, sometimes using a single vendor's portfolio and more often using a broad variety of vendors and technologies.
- Design, implement, and maintain high-throughput, low-latency network architectures for HPC and AI clusters, including technologies such as InfiniBand, Ethernet (100/400GbE), and advanced routing and switching.
- Deploy / implement a new solution or augment an existing HPC/AI solution from the ground up.
- Maintain industry knowledge of multiple Network Operating Systems (e.g. Cumulus Linux, SONIC, etc.).
- Configure and optimize network devices (switches, routers, firewalls) to ensure maximum performance and reliability for HPC and AI workloads.
- Monitor network performance, troubleshoot issues, and resolve connectivity, latency, and security problems.
- Implement and maintain network security measures, including firewalls, intrusion detection systems, and secure access protocols.
- Document network configurations, procedures, and troubleshooting steps for operational continuity.
- Consult on and assist with day-to-day management of clients research computing infrastructure environments.
- Maintain HPC and AI infrastructure in Linux-based environments for new and existing clients.
- Lead technical discussions and be the face of Cambridge Computer to the client in preparation for and during engagements.
- Validate solution designs, meet client requirements, and confirm system is technically feasible and deployable.
- Ensure solutions are simple and easy to understand while considering the client’s overall capabilities / skills.
- Scope out and detail professional services deliverables setting clear client expectations.
- Build documentation and provide knowledge transfer required for clients to support their environments.
- Display expertise not only in networking but in storage, data protection, digital archiving, and other infrastructure technologies.
- Gain advanced expertise of and certifications from the vendors Cambridge Computer uses in our solution stack.
Qualifications:
- Candidates must have at least 5+ years providing networking deployment services and/or cluster administration.
- University undergraduate degree in Computer Science, Computer Engineering, or a science-related field required.
- In-depth knowledge of networking protocols (TCP/IP, DNS, VLANs, routing protocols) and hands-on experience with network equipment.
- Networking certifications from Cisco, NVIDIA Networking (Mellanox), Juniper, Arista, HPE Aruba, and other manufacturers preferred.
- Candidates must also display solid knowledge of GPU-focused hardware/software and Linux system administration (package management, IP networking, troubleshooting etc.). They must also have solid fundamentals in cluster design/management technologies (Bright, Werewolf, XCat etc.), a background with storage technologies and parallel filesystems (Lustre, GPFS, BeeGFS etc.), experience with networking and configuring network switches (ethernet and InfiniBand), acquaintance with HPC schedulers (SLURM, UGE, LSF, etc.) and programming / libraries (MPI, CUDA, etc.), and proficiency with Scripting (Bash, Python, etc.).
- Have deep knowledge of tech industry leaders including AMD, Cisco, DDN, Dell, HPE, IBM, Intel, Juniper, Lenovo, Microsoft, NVIDIA, Oracle, Vast, VMWare, WEKA, and others.
- Familiarity with containerized environments (e.g. Kubernetes, Docker) and their networking requirements.
- As this is a field-based role, the employee must be able to work remotely, independently, and unsupervised. Travel will be approximately 50% of the time which include short day trips.
- Candidates must have impeccable communication skills, an ability to multitask, and high attention to detail. They must be effective problem solvers, organized, creative, intellectually curious, deal with ambiguity, and able to work with different types of personalities.
- Authorization to work in the United States on a full-time basis required.
- Cover letter
- Resume
- Competitive salary
- Multiple health insurance options
- Medical FSA and Dependent Care FSA
- Dental insurance
- Vision insurance
- 401(k) savings plan with employer matching
- Employer-sponsored long-term disability
- Paid holidays and PTO that increases with longevity at the company
- Discounted health club membership
- Convenient parking
- Opportunities for growth!