This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Machine Learning Engineer - Training Platform in Australia.
You will join a high-impact AI Platform group focused on building the foundational systems that power large-scale model training across a global product ecosystem. In this role, you will design and evolve the infrastructure that enables distributed AI training workloads to run reliably, efficiently, and at scale. You will work on a Kubernetes-based training platform and contribute to the full training lifecycle, including orchestration, experiment management, and artifact handling. Your work will directly support research scientists, ML engineers, and product teams in deploying advanced AI capabilities. You will collaborate across infrastructure, cloud, and applied AI teams to solve complex distributed systems challenges. This is a highly cross-functional environment where platform engineering meets cutting-edge generative AI innovation
Accountabilities:
- Design, build, and scale the core training platform infrastructure supporting distributed AI workloads across multiple teams and use cases.
- Improve reliability, observability, debugging, and operational performance of large-scale training systems.
- Develop and enhance scheduling capabilities, including resource allocation, workload prioritization, and quota management for AI training jobs.
- Collaborate with research scientists, ML engineers, and infrastructure teams to optimize training workflows and system performance.
- Contribute to architecture and system design decisions for scalable AI infrastructure.
- Identify user pain points and translate them into platform improvements and roadmap priorities.
- Mentor engineers and promote best practices in distributed systems and AI infrastructure development.
Requirements:
- Strong experience in machine learning infrastructure, distributed systems, or large-scale AI training pipelines.
- Hands-on expertise with containerized environments and orchestration using Kubernetes.
- Familiarity with distributed training frameworks such as Ray or PyTorch distributed training.
- Experience working with cloud infrastructure supporting high-performance workloads (e.g., storage systems, networking, HPC environments).
- Strong systems design skills with the ability to build scalable, reliable, and maintainable platforms.
- Excellent collaboration skills, with experience working alongside ML engineers, researchers, and infrastructure teams.
- Strong ownership mindset and ability to solve complex cross-functional engineering problems.
- Passion for improving developer experience and enabling AI at scale.
Benefits:
- Equity packages to share in long-term company success.
- Inclusive parental leave supporting all parents and carers.
- Annual wellbeing and lifestyle allowance to support personal and professional needs.
- Flexible leave options to encourage rest, recharge, and meaningful time away.
- Remote-friendly working model within Australia with flexible work arrangements.
- Opportunities to work on cutting-edge AI infrastructure at global scale.
- Collaboration with world-class engineers, researchers, and infrastructure experts.