Logic Software Solutions logo

Machine Learning Infrastructure Engineer

Logic Software Solutions
2 days ago
Full-time
On-site
United States
AI and Machine Learning

Job Title: Machine Learning Infrastructure Engineer, GenAI Technology

Location: New York - Remote 
Job Type: Experienced Professional
Department: Software & System Engineering

About the Role

We are seeking a highly skilled Machine Learning Infrastructure Engineer to join our growing Generative AI Technology team. This team is at the vanguard of reimagining our industry, tasked with constantly evolving our IT infrastructure and engineering capabilities. We experiment with and harness the latest open-source solutions, modern cloud architectures, and sophisticated AI systems within an enterprise agile environment. Our commitment to innovation provides the framework for smarter decision-making and enhances how we build and operate all our platforms and applications.

In this role, you will be the technical backbone for our large-scale generative AI initiatives. You will architect, build, and maintain the critical infrastructure systems that enable our researchers and engineers to develop, train, and deploy models at scale. Your work will have a direct and measurable impact on the velocity and reliability of our AI-driven business solutions. We are deeply committed to your professional growth, supporting you in advancing your technical skills, contributing novel ideas, and satisfying your intellectual curiosity from day one.

Core Responsibilities

Focus on enabling faster model iteration and reliable, scalable, cost-effective AI systems. Key responsibilities (concise and scannable):

  • Design high-performance compute infrastructure: Architect and deploy scalable GPU/accelerator clusters (cloud and on-prem) to support large training and inference workloads.

  • Engineer distributed ML systems: Build and operate end-to-end workflows—from data preprocessing and large-scale training to HPO and high-throughput inference—ensuring reliability and performance.

  • Optimize models for production: Work with researchers and engineers to productionize models, maximize GPU utilization, increase training throughput, and reduce inference latency.

  • Automate CI/CD and MLOps: Implement Infrastructure-as-Code, container orchestration, and repeatable CI/CD pipelines to enable rapid, safe deployments and self-service.

  • Implement observability & FinOps: Build monitoring, alerting, logging, and cost-management for heterogeneous GPU environments to deliver predictable performance and transparent spend.

  • Evaluate and integrate new tech: Lead benchmarking and adoption of emerging hardware (GPUs, interconnects) and software (frameworks, compilers) to improve scalability and throughput.

  • Own security and operations: Define security posture, secrets management, RBAC, compliance automation, and incident response for Generative AI infrastructure.

  • Profile and troubleshoot deep stacks: Diagnose and resolve performance issues across CUDA kernels, drivers, and orchestration layers to remove bottlenecks.

  • Document and mentor: Produce clear architecture decisions and runbooks, and mentor engineers to accelerate adoption of production-ready infrastructure best practices.

Required Technical Stack & Expertise

This role demands a unique blend of deep systems engineering and machine learning operations experience. The successful candidate's expertise will span the following specific tools and technologies:

Cloud & Infrastructure Core

  • Public Cloud Platforms: Deep architecture and implementation experience in at least one of AWS (e.g., EC2 P4d/P5, EKS, ParallelCluster), GCP (e.g., A3 VMs, GKE, Cloud TPU), or Azure (e.g., NDm A100 v4, AKS) . Deep understanding of cloud networking, with a focus on high-performance interconnects (Infiniband, RoCE) and parallel file systems (Lustre, Weka, FSx for Lustre).

  • Container Orchestration: Expert production experience with Kubernetes, including proficiency with operator patterns, custom resource definitions (CRDs), GPU device plugins, and multi-node topology-aware scheduling. Experience with service meshes (Istio, Linkerd) is a plus.

  • Infrastructure-as-Code (IaC): Fluency in Terraform or Pulumi for provisioning and managing the full technology stack, with a strong emphasis on immutable infrastructure principles and GitOps workflows. Experience with configuration management tools like Ansible is expected.

Machine Learning Platform & MLOps

  • Orchestration & Pipelines: Production-grade experience composing complex DAGs with Apache Airflow, Kubeflow Pipelines, or Argo Workflows.

  • Experiment Tracking & Model Registry: Hands-on use of MLflow, Weights & Biases, or Neptune for experiment tracking, model versioning, and lineage management.

  • Distributed Training & Hyperparameter Optimization: Deep operational knowledge of frameworks like Ray (Core, Train, Tune, Serve), PyTorch Distributed (DDP, FSDP), or DeepSpeed. Experience with HPO libraries such as Optuna or Ray Tune is required.

  • Model Serving Infrastructure: Experience building and managing high-performance inference systems using NVIDIA Triton Inference Server, Ray Serve, TorchServe, or BentoML with a focus on optimization techniques like dynamic batching, model quantization, and concurrent model execution.

Programming & Systems Development

  • Primary Language: Expert-level proficiency in Python for all aspects of infrastructure tooling, automation, and SDK development.

  • Systems Programming: Demonstrated ability in one or more systems languages such as Go, C++, or Rust, used for building high-performance infrastructure components, controllers, or operators.

  • Performance & Profiling: Powerful debugging and profiling skills using tools like NVIDIA Nsight Systems, NVIDIA Nsight Compute, py-spy, or flamegraphs to analyze and optimize CPU/GPU interactions, memory bandwidth, and kernel performance.

Monitoring, Observability & FinOps

  • Telemetry Stack: Proven experience implementing monitoring and alerting stacks using Prometheus, Grafana, and the ELK (Elasticsearch, Logstash, Kibana) stack or Grafana Loki. Must know how to build custom metrics and dashboards for GPU-centric metrics (e.g., SM utilization, NVLink bandwidth, GPU temperature/power draw).

  • GPU Cost Management: Familiarity with tools and strategies for GPU resource accounting and cost allocation, such as OpenCost, Kubecost, or cloud-native cost and usage reports, to provide chargeback or showback visibility.

Security & Compliance

  • Secrets & Access: Experience with secrets management tools like HashiCorp Vault, advanced Kubernetes RBAC, and workload identity frameworks (e.g., AWS IRSA, GCP Workload Identity).

  • Security Posture: Understanding of container vulnerability scanning, image signing (Cosign/Notary), and policy-as-code (OPA/Gatekeeper) in the context of securing ML pipelines and data.

What’s Required (Qualifications)

  • Bachelor's or master's degree in computer science, electrical engineering, or a closely related technical field.

  • 5- 7 years of proven, hands-on experience building and maintaining scalable compute platforms or machine learning infrastructure systems.

  • Deep, demonstrable understanding of distributed systems theory and practice.

  • Strong understanding of reinforcement learning concepts and their specific infrastructure implications and complexities.

  • Excellent collaboration and communication skills with a pragmatic, systems-thinking mindset.

  • An unwavering commitment to the highest ethical standards.

Benefits & Culture

We invest in our people, their careers, and their total well-being. Our comprehensive package is designed to support you professionally and personally:

  • Fully-paid health care benefits

  • Generous parental and family leave policies

  • Comprehensive mental and physical wellness programs

  • Paid time off for volunteer opportunities and a non-profit matching gift program

  • Support for employee-led affinity groups representing diverse communities

  • Tuition assistance for continued learning

  • 401(k) retirement savings program with a substantial employer match