About the role
You'll build and optimize the serving stack that makes Vikasit Inference fast, reliable, and affordable at India scale — kernels, batching, caching, and the OpenAI-compatible API surface developers love.
What you'll do
- —Optimize throughput and latency across our model fleet (vLLM / SGLang / TensorRT-LLM)
- —Build batching, KV-cache, quantization, and routing for MoE models
- —Own reliability, autoscaling, and cost of the inference platform
- —Keep the OpenAI-compatible API rock-solid for streaming and tool calls
What we're looking for
- —Deep systems engineering (GPU, CUDA-adjacent, or high-performance serving)
- —Experience with an LLM serving framework in production
- —Comfort owning latency, throughput, and cost SLOs
Nice to have
- —CUDA / Triton kernels
- —Quantization (GGUF, AWQ, FP8)
- —Multi-region infra
Sound like you?
We hire for skill over credentials. Tell us why you're a fit — links and projects welcome.
Apply for this role