Cloud Inference Engineer
Luminal · San Francisco, CA, US
$150k - $350k
On-site
Full-time
Mid
Job Description
Qualifications
- CUDA + GPU inference optimization
- vLLM, SGLang, or TensorRT-LLM experience
- KV caching, paged attention, batching, token streaming, etc.
- Distributed compute (with GPUs is a super plus)
- No degree required
Company
Luminal (YC S25) builds an AI compiler and serving stack that makes models 10x faster and production ready with one line.
Role
Founding, on site in downtown SF. Ship low latency, high throughput model serving on Luminal Cloud.
Day to day responsibilities:
- Deploy and tune models with optimizations like KV caching, paged attention, sequence packing, etc.
- Conducting model performance reviews
- Improve scheduler, batcher, autoscaling; profile latency, cost, utilization
- Sometimes write kernels and, yes, occasional tasteful shitposting
Related jobs
See how well your resume matches this job before you apply
Run a free ATS check