Director of AI Cluster Deliver
2 weeks ago
4 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Direct message the job poster from PaleBlueDot AI 1. Strategic Design and Architecture Planning Lead the overall architecture design of overseas AI compute clusters, encompassing compute, network, storage, and liquid cooling systems. Develop a deep understanding of customers’ AI workload requirements and translate them into advanced, reliable, and scalable technical solutions. 2. End-to-End Construction and Delivery Management Take full responsibility for the entire lifecycle of large-scale overseas AI clusters (at the tens of thousands of GPUs level)—from planning, equipment procurement, incoming inspection, rack installation, cabling, and tuning to final go-live. Lead and continuously optimize cluster deployment processes to ensure on-time and on-budget delivery. Manage seamless collaboration among data center facility teams, hardware vendors, and liquid cooling system suppliers. 3. Operations System Development and Incident Management Build and lead a high-performing, multicultural overseas operations team; establish a 24/7 operations framework, SOPs, and emergency response plans. Design the cluster monitoring, alerting, logging, and performance analysis platforms to enable comprehensive observability and health management. Serve as the top technical escalation point, taking charge of complex, high-impact incidents; perform root cause analysis and drive systematic improvements. 4. Customer and Technical Engagement Act as a technical authority, directly interfacing with customers’ technical teams to deliver solution presentations, POC support, and in-depth technical discussions. Ensure delivered cluster services meet or exceed customers’ SLA requirements, improving customer satisfaction and retention. 5. Operations and Cost Optimization Continuously optimize cluster operational efficiency, focusing on key KPIs such as PUE, WUE, and compute utilization. Develop and control the operations budget, identifying cost optimization opportunities while maintaining superior service quality. Qualifications Experience: Bachelor’s degree or above in Computer Science, Electrical Engineering, or related fields; minimum of 10 years of experience in large-scale data center or HPC/AI cluster operations and management. Overseas Project Expertise: Demonstrated success in the construction and delivery of advanced overseas AI compute clusters or hyperscale data centers; thorough understanding of overseas project operations, compliance, and cultural contexts. Architecture Design: Deep expertise in AI cluster architectures (e.g., NVIDIA DGX/SuperPOD, GPU-as-a-Service); strong understanding of InfiniBand/RoCE networks and distributed storage systems. Liquid Cooling Technologies: Hands-on experience with immersion or cold-plate liquid-cooled cluster deployment and operations; familiarity with system mechanics, maintenance challenges, and risk management. Systems Operations: Proficient in Linux OS, cluster scheduling systems (Slurm/Kubernetes), monitoring tools (Prometheus/Grafana), and automation (Ansible/Python). Leadership: Minimum of 5 years of technical team management experience; proven ability to build, lead, and motivate high-caliber engineering teams in cross-cultural environments. Customer Orientation: Excellent communication and presentation skills; capable of engaging effectively with internal and external customers in technical discussions and solution briefings. Language Skills: Fluency in English and Mandarin for both written and verbal communication is preferred. Seniority level Director Employment type Full-time Job function Information Technology and Project Management Industries Software Development Referrals increase your chances of interviewing at PaleBlueDot AI by 2x Sign in to set job alerts for “Director of Artificial Intelligence” roles. We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI. #J-18808-Ljbffr
-
Cluster Manager
5 days ago
hong kong, Hong Kong SAR China DayOne Full timeJoin to apply for the Cluster Manager role at DayOne Join DayOne – Shaping the Future of Data Infrastructure DayOne is a global leader in the development and operation of high-performance data centers. As one of the fastest-growing companies in the industry, we’ve built a robust presence across Asia and Europe — and we’re just getting started. As we...
-
Principal Cloud Architect – HPC/GPU
2 weeks ago
hong kong, Hong Kong SAR China Oracle Full timePrincipal Cloud Architect – HPC/GPU & AI Platform Solutions As a Principal Cloud Architect, you will be at the forefront of designing and implementing next generation accelerated computing and AI solutions on Oracle Cloud Infrastructure (OCI). You will engage directly with startup to strategic customers, helping them architect and deploy complex HPC and...
-
AI Solutions Project Manager
7 days ago
Hong Kong Island, Hong Kong SAR China Datax AI Solutions Full time1 day ago Be among the first 25 applicants Rooted in Hong Kong, Datax is at the forefront of developing and commercializing cutting‑edge AI solutions for local enterprises. Our vibrant team of 20 young and driven professionals has delivered over 40 successful projects across diverse sectors including finance, construction, and government. We are an...
-
Sourcing Director
3 days ago
Hong Kong Island, Hong Kong SAR China GMI Cloud Full timeGet AI-powered advice on this job and more exclusive features. Direct message the job poster from GMI Cloud About GMI Cloud GMI Cloud is a fast-growing AI infrastructure company backed by Headline VC and one of only six cloud providers worldwide to earn NVIDIA’s prestigious Reference Platform Cloud Partner designation. We operate 8 of our own GPU clusters...
-
AI Infrastructure Administrator – HPC
1 day ago
hong kong, Hong Kong SAR China The Chinese University of Hong Kong Full timeA leading educational institution in Hong Kong is seeking an Assistant Computer Officer to manage and support its AI infrastructure. Responsibilities include maintaining HPC clusters, installing software stacks, enforcing security standards, and providing technical support. Ideal candidates will hold a degree in Computer Science, possess relevant IT...
-
AVP, Cluster Manager
5 days ago
Hong Kong Island, Hong Kong SAR China OCBC Full timeAVP, Cluster Manager - Branch Operations 4 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Lead and contribute to projects for Improving branch operation efficiency via centralization, automation and process simplification Support branches in various daily operational and administration matters....
-
AI Ops Manager
3 days ago
Hong Kong Island, Hong Kong SAR China Leadingnation Full timeJob Responsibilities Assist the ITOC leaders in overseeing and executing work related to AI servers, AI systems, AI applications, and other technical platforms. Responsible for the construction and optimization of the monitoring system in the AI field within the ITOC. Develop and implement operational strategies for AI servers, AI systems, AI applications,...
-
Hong Kong Island, Hong Kong SAR China The University of Hong Kong Full time3 days ago Be among the first 25 applicants Applications are invited for appointment as Director/ Associate Director (Advancement Intelligence) (at the rank of Assistant Registrar) in the Development & Alumni Affairs Office (DAAO) (Ref.: ) (to commence as soon as possible, on a two-year fixed-term basis with the possibility of renewal subject to satisfactory...
-
Hong Kong Island, Hong Kong SAR China The University of Hong Kong Full timeApplications are invited for appointment as Director/ Associate Director (Advancement Intelligence) (at the rank of Assistant Registrar) in the Development & Alumni Affairs Office (DAAO) (Ref.: ) (to commence as soon as possible, on a two-year fixed-term basis with the possibility of renewal subject to satisfactory performance). The University of Hong Kong...
-
Hong Kong Island, Hong Kong SAR China Fano (Fano Labs) Full timeDirector of Product – AI Solutions for Financial Services Position Summary: As a Director of Product – AI Solutions for Financial Services in a cutting-edge AI company, you will own the vision, strategy, and execution of innovative B2B technology products designed to transform the financial services industry. This role requires a rare blend of deep...