Bundled vLLM
Srasta deploys a GPU-backed OpenAI-compatible vLLM runtime on the selected GPU host.
Inference & Model Routing
Srasta separates the operator decision from the runtime plumbing. Teams choose where inference runs and what each persona needs; Srasta turns that into governed LiteLLM routes, vLLM or Ollama runtime settings, fallback behavior, model-access controls, and smoke tests.
Routing Path
A caller can ask for coding, business, general, or fast.
Those are stable model aliases. The backend behind each alias can be local GPU inference, CPU fallback,
external self-hosted inference, or a hosted API.
Provider Classes
The installer treats inference as a deployment decision. Local inference keeps prompts and responses inside the Srasta cluster. External inference can be self-hosted or hosted, but it creates an explicit data-egress decision that operators must acknowledge.
Srasta deploys a GPU-backed OpenAI-compatible vLLM runtime on the selected GPU host.
CPU-only fallback for trials, dev, smoke tests, and low-resource installations.
Operator-provided vLLM, NVIDIA NIM, or generic OpenAI-compatible endpoint.
Provider APIs such as Anthropic, OpenAI, Hugging Face, Together, or Fireworks through LiteLLM.
Personas
Routes to coding-capable models and requires tool-call parser correctness for agentic workflows.
Optimized for multi-document reasoning, summaries, structured outputs, and decision support.
Absorbs everyday chat and general-purpose workloads with a quality/latency balance.
Lightweight route used for quick responses, fallback chains, and constrained installations.
Recommendation Engine
The installer does more than ask whether a model can load. It evaluates fit against the operator's environment: GPU and RAM, inference class, deployment intent, expected concurrency, latency and quality targets, and cost constraints for hosted providers.
Runtime Contracts
vLLM agentic tool calls require the right parser for the model family. Srasta tracks this per model so tool calls do not silently degrade into plain text.
Models with separate thinking output can declare the matching reasoning parser where supported.
TEI serves the default embedding path where supported. ARM or constrained deployments can fall back to Ollama embeddings.
Primary persona routes can fall back to a lightweight route such as fast when configured by the installer.
Governance
Whether a model is local, self-hosted, or hosted, access still enters through the same governed Srasta API path. Model routing is not a backdoor around identity, authorization, audit, rate limits, or license posture.
Roles can be granted explicit model access; unauthorized requests fail before execution.
Inference requests, failures, policy denials, and external-provider calls produce audit evidence.
External inference is called out because prompts and responses leave the Srasta cluster.
Recommendations are defaults; operators can override persona assignments with visible tradeoffs.
FAQ
Srasta supports local inference with bundled vLLM on GPU or Ollama on CPU, external self-hosted inference such as vLLM, NVIDIA NIM, or OpenAI-compatible endpoints, and hosted APIs such as Anthropic, OpenAI, Hugging Face, Together, or Fireworks.
LiteLLM is the unified inference gateway. Srasta generates model routes and persona aliases so callers can request models like coding, business, general, or fast while LiteLLM routes to the configured backend.
The installer uses hardware profile, intent, inference provider, personas, expected volume, and operator constraints to recommend models per persona. Operators can accept recommendations or override each persona.
Model requests still pass through Srasta API controls: identity, RBAC, per-role model whitelist, license posture, rate limits, audit, and downstream routing through LiteLLM or the configured provider.
Plan the Runtime
A useful Srasta deployment starts by mapping who will use the system, what work they do, where inference is allowed to run, and what latency, quality, cost, and data-boundary constraints matter.