Senior Software Engineer, Distributed Systems & Infrastructure

at Vizcom • Full-time

Location

in-office (San Francisco, United States)

Experience

4+ years

Compensation

$180k-$220k • 0.2% - 0.5%

Must have skills

Good to have skills

Highlights

ESOP

Yes

About this Opportunity

About Us

At Vizcom, we empower designers at companies like Nike, General Motors, and Riot Games to turn ideas into reality faster and with more precision. Our tools integrate seamlessly into workflows, providing real-time feedback that bridges creativity and manufacturability.

We’re building a high-performance, reliable job-scheduling system that powers distributed AI/ML workflows and ephemeral jobs. Our platform must handle large-scale concurrency, orchestrate GPU workers, and provide seamless failover and retry. We value engineers who excel at designing robust infrastructure, implementing elegant distributed systems, and writing clean, maintainable code.

The Role

You will be the primary engineer designing and implementing a next-generation Job Scheduling & Distributed Computing platform. This includes everything from a fault-tolerant queue system to advanced load balancing, worker orchestration, real-time monitoring, and autoscaling. You’ll collaborate with product teams to ensure the platform can handle diverse workloads—such as ephemeral AI jobs, data processing, and high-priority tasks.

Key Responsibilities

Design & Build a job scheduling service:
- Architect a robust queuing system (Redis, Postgres, or other) to track, schedule, and distribute jobs across multiple workers/GPUs.
- Implement advanced features: priority scheduling, concurrency limits, retry logic, and timeouts.
Infrastructure & Reliability:
- Ensure the system is highly available, fault tolerant, and horizontally scalable.
- Introduce monitoring, alerting, and logging best practices for distributed workloads.
- Automate provisioning, autoscaling, and failover in cloud environments (AWS, GCP, or similar).
Worker Orchestration:
- Manage worker registration and capacity tracking.
- Implement a load balancing strategy based on resource usage (GPU, CPU, memory).
- Support ephemeral job “mailboxes,” streaming results to clients in real time.
System Integrations:
- Collaborate with AI/ML teams to integrate inference workloads (e.g., GPU-intensive tasks) into the job scheduler.
- Hook into existing deployment pipelines and internal tooling.
Performance & Observability:
- Collect and analyze metrics for scheduling latency, queue lengths, job success/failure, and worker health.
- Optimize throughput, minimize overhead, and detect performance bottlenecks early.

About You

5+ years of experience in backend or infrastructure engineering with a focus on distributed systems or HPC (high-performance computing).
Deep knowledge of concurrency patterns, job queues, or pub/sub frameworks (e.g., BullMQ, RabbitMQ, Kafka, or custom solutions).
Cloud Expertise: Comfortable deploying containerized services (Docker/Kubernetes) on AWS, GCP, or Azure. Knowledge of IaC (Pulumi, Terraform, or CDK) is a plus.
Database & Caching: Skilled with SQL/NoSQL. Familiarity with in-memory datastores like Redis for real-time queueing.
Programming: Proficient in Node.js/TypeScript (or similar backend language). Strong coding skills, comfortable writing production-grade code, testable components, and microservices.
Scalable Infra: Track record of designing and running highly scalable, resilient backends. Experience with autoscaling GPU or HPC clusters is a huge bonus.
Monitoring & DevOps: Good grasp of logging, metrics (Datadog, Prometheus, Grafana), and CI/CD pipelines.

Nice to Have

GPU / ML: Experience orchestrating GPU-intensive jobs, integrating with frameworks like PyTorch or TensorFlow.
Event-Driven: Familiarity with tRPC, GraphQL, or gRPC for real-time or streaming data flows.
Security & Networking: Knowledge of API token management, service-to-service security, TLS termination, etc.
Autoscaling: Practical experience building or tuning an autoscaler.

What We Offer

Ownership & Impact: You’ll design a critical system used by the entire organization—your code is the backbone of large-scale AI/ML workflows.
Cutting-Edge Stack: Work with GPU clusters, ephemeral job management, real-time scheduling, and advanced cloud infra.
Flexible Work Environment: Remote-friendly culture, flexible hours, and supportive of personal development.
Compensation & Benefits: Competitive salary, equity, healthcare, and an allowance for home office or co-working space.
Growth Opportunities: Leadership track potential—help define the engineering culture and best practices for years to come.

Find the perfect job!

Use Job Hunt AI to find the perfect job for you.

Job Hunt AI