Lead Platform/Site Reliability Engineer
2 days ago
As a Lead SRE, you'll be instrumental in shaping our systems' future. Your responsibilities will include: System Reliability Leadership: Develop and execute strategies to achieve unparalleled service reliability and availability. You'll implement cutting-edge best practices, design resilient monitoring solutions, and conduct comprehensive failure injection and failover testing. Advanced Automation: Spearhead automation initiatives to streamline complex operational tasks, enhancing efficiency and reducing manual interventions. You'll advocate for treating "operations as a software problem" throughout the organization. Comprehensive Monitoring & Performance: Design and maintain advanced monitoring and alerting systems to assess system health, performance, and user experience. You'll conduct in-depth analysis of metrics and logs to proactively identify and resolve complex issues. Incident Management & Prevention: Lead during critical incidents, ensuring rapid resolution and clear communication. You'll conduct thorough post-mortem analyses, implement sustainable solutions, and share insights to prevent recurrence. Expect to participate in on-call rotations as a primary escalation point. Strategic Collaboration: Work closely with development and operations teams to embed reliability principles throughout the software development lifecycle. You'll provide expert guidance, promote SRE best practices, and foster a culture of shared ownership for system reliability. Capacity Planning & Optimization: Monitor and analyze system capacity and performance data, forecast future demands, and lead efforts to scale infrastructure efficiently to meet growth. Continuous Improvement & Innovation: Identify areas for systemic improvement in systems, tools, and processes. You'll lead the design and implementation of innovative solutions to enhance reliability, performance, and operational efficiency. Mentorship & Leadership: Provide technical leadership and mentorship to SREs and other team members, fostering growth and skill development. You'll also contribute to hiring and onboarding processes for new team members. What You'll Bring: We're looking for a highly experienced and passionate SRE leader with: 12+ years of experience in Site Reliability Engineering, DevOps, or a related critical operations role, with a proven track record of leading significant reliability initiatives. A Bachelors degree in Computer Science, Engineering, or a related technical field, or equivalent extensive practical experience. Exceptional proficiency in scripting and programming languages (e.g., Python, Go, Java, Ruby, Bash) for developing advanced automation, tooling, and system integrations. Extensive hands-on experience with major cloud platforms (e.g., AWS, Google Cloud Platform, Azure) and deep expertise in containerization technologies (Docker, Kubernetes). Profound understanding of Linux/Unix systems internals, networking protocols, and distributed system architectures. Expertise in designing and managing CI/CD pipelines and robust version control systems (e.g., Git), advocating for GitOps principles. Mastery of monitoring, logging, and alerting tools (e.g., Datadog, Prometheus, Grafana, ELK stack, OpenTelemetry). Superior problem-solving skills, critical thinking, and meticulous attention to detail, especially under pressure. Outstanding communication, interpersonal, and collaboration skills, with the ability to influence and lead cross-functional teams. Proven ability to thrive and lead in a fast-paced, highly dynamic, and complex technical environment. Expert-level debugging and root cause analysis capabilities across complex distributed systems. Bonus Points For: Extensive experience with infrastructure as code (IaC) tools (e.g., Terraform, Ansible, Pulumi). Deep knowledge of various database systems (relational and NoSQL) and advanced data management strategies. Significant experience designing, implementing, and operating microservices architectures. Contributions to open-source projects related to SRE, operations, or cloud-native technologies. This role offers a unique opportunity to make a significant impact on our core services and directly influence our engineering culture around reliability. #J-18808-Ljbffr
-
Site Reliability Engineer
2 days ago
Hong Kong, Hong Kong SAR China Ashford Benjamin Ltd Full timeWe are exclusively representing a global investment firm that stands at the intersection of advanced mathematics, cutting-edge technology, and global finance. They operate a world-class, high-performance technology stack (including C++, Python, KDB+, and FPGA) to identify and execute on systematic opportunities. They are now seeking to add a strategic Site...
-
Site reliability engineer
4 days ago
hong kong, Hong Kong SAR China Tek Systems Full timeFintech Innovative project Leading digital transformation project Site Reliability Engineer (SRE) Experience: 2-3 years Location: HK island About the Role We are looking for a skilled and motivated Site Reliability Engineer (SRE) to join a dynamic, forward-thinking team working on high-impact projects that shape the future of technology in the financial...
-
Senior Platform Engineer/Squad Lead, Kafka
2 days ago
Hong Kong Island, Hong Kong SAR China Senior Platform EngineerSquad Lead, Kafka Full timeSenior Platform Engineer/Squad Lead, Kafka Senior Platform Engineer/Squad Lead, Kafka View all jobs Add expected salary to your profile for insights Mox is built by and for the ones who aspire to live life to the fullest – we call them Generation Mox! The name Mox reflects the endless opportunities we can create. Why Mox Everything at Mox – from our...
-
Lead Site Reliability Engineer
3 days ago
Hong Kong, Hong Kong SAR China IO TECH SOLUTIONS LIMITED Full timeMy client are seeking a Site Reliability Engineer (SRE) to join their growing infrastructure team in Asia. As an SRE, you will play a critical role in ensuring the stability, scalability, and resilience of our production systems. You'll help our client scale our services, define and enforce reliability standards, and continuously improve our engineering...
-
Lead Site Reliability Engineer
4 days ago
Hong Kong Island, Hong Kong SAR China IO TECH SOLUTIONS LIMITED Full timeMy clientare seeking a Site Reliability Engineer (SRE) to join their growing infrastructure team in Asia. As an SRE, you will play a critical role in ensuring the stability, scalability, and resilience of our production systems. You'll help our client scale our services, define and enforce reliability standards, and continuously improve our engineering...
-
Site Reliability Engineer
4 days ago
Hong Kong Island, Hong Kong SAR China Flow Traders Full timeFlow Traders is looking for an experienced Site Reliability Engineer to join our team and play a key role in building, maintaining, and scaling our cloud-based platform. The ideal candidate possesses a great sense of ownership and passion for working with the latest technologies. This is a unique opportunity to join a leading proprietary trading firm with an...
-
Site Reliability Engineer
10 hours ago
Hong Kong Island, Hong Kong SAR China Goliath Partners Full timeA global financial technology leader is seeking Site Reliability Engineers to help scale and support the systems behind high-performance investment strategies. You’ll join a collaborative engineering team focused on automation, reliability, and solving complex technical problems at scale. What You’ll Do Ensure the reliability and performance of...
-
Site Reliability Engineer
2 weeks ago
hong kong, Hong Kong SAR China Goliath Partners Full time3 days ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Direct message the job poster from Goliath Partners A global financial technology leader is seeking Site Reliability Engineers to help scale and support the systems behind high-performance investment strategies. You’ll join a collaborative engineering...
-
Site Reliability Engineer
1 week ago
hong kong, Hong Kong SAR China Qube Research & Technologies Full time2 days ago Be among the first 25 applicants Qube Research & Technologies (QRT) is a global quantitative and systematic investment manager, operating in all liquid asset classes across the world. We are a technology and data driven group implementing a scientific approach to investing. Combining data, research, technology, and trading expertise has shaped our...
-
Senior Platform Engineer
10 hours ago
Hong Kong Island, Hong Kong SAR China Senior Platform EngineerSquad Lead, Kafka Full timeA leading tech company in Hong Kong seeks a Senior Platform Engineer/Squad Lead to build and maintain a modern cloud-native data platform. Applicants should have over 12 years of experience in IT, focusing on database and messaging system management, with proficiency in AWS services, Kafka, and data governance. This role requires a bachelor's degree and...