Governed Deployment

Deploy the AI control plane, not another model server.

Srasta runs inside your infrastructure with installation plans, host inventory, runtime verification, model routing, memory boundaries, audit evidence, rollback, and sizing guidance for production workloads.

Request a Deployment Sizing Session → See Runtime Controls →

Use This Page in 4 Steps

If you are evaluating quickly, follow this sequence to move from governance requirements to a practical deployment recommendation.

Runtime Controls Before Hardware Specs

The first deployment question is not only GPU size. It is whether the AI runtime can be verified, governed, evaluated, and recovered by enterprise operators.

Install and Placement Truth

Plan/run state, host inventory, topology, placement, preflight checks, and smoke verification make deployment state explicit.

Model and Memory Boundaries

Requests route through approved models and scoped memory collections so teams can control what context is available.

Evaluated AI Behavior

Observe prompt quality, retrieval behavior, memory drift, policy decisions, and compliance-rule outcomes before scaling access.

Recovery and Evidence

Verification, reset, rollback, backup, release identity, and audit trails give operators a path to prove and recover the runtime.

Deployment Models

Srasta does not require public SaaS hosting. All components operate inside your controlled infrastructure.

On-Prem

GPU servers in your data centre. Full control, zero external dependencies. Recommended for regulated industries and air-gapped requirements.

Private Cloud

AWS, Azure, or GCP — deployed inside your VPC. No data leaves your cloud account. Supports GPU instance types across all major providers.

Self-Hosted Private Cloud

VMware, OpenStack, or Proxmox on your existing data centre infrastructure. Srasta deploys via Docker Compose or Kubernetes.

Hybrid

On-prem inference combined with cloud integrations. Run your models on owned hardware while connecting to cloud-based storage or services.

Reference Evaluation Deployment

A practical lab configuration for demonstrating private inference, governed retrieval, and operator workflows. Final production architecture depends on workload, concurrency, compliance scope, and availability targets.

Reference Hardware
Platform
NVIDIA DGX Spark
Architecture
ARM64 / aarch64
Model running
30B parameter class (FP8)
Inference engine
vLLM (production-grade)
Embeddings
Local embedding model
Vector store
Milvus (hybrid search)

Use this as a reference point for evaluation and demo planning, not a universal production recommendation. Srasta deployment scoping confirms the right hardware, topology, and controls for the customer environment.

Hardware Requirements by Tier

Sizing scales with subscription tier and workload concurrency.

Foundation Tier
Knowledge assistant · Policy lookup · Internal document search · Low concurrency
CPU 8–16 vCPU
RAM 32–64 GB
Storage 500 GB – 1 TB SSD
GPU Recommended. CPU inference viable for low-concurrency evaluation only — expect significantly higher latency.
Cloud examples AWS: c6i / m6i Azure: D-series GCP: n2-standard
Enterprise Plus
Multi-team deployments · Compliance-sensitive environments · Dedicated model routing · High concurrency
CPU 32+ cores
RAM 128–256 GB
Storage 2 TB+ NVMe (RAID for HA)
GPU Multi-GPU cluster. A100 / H100 / L40S. Dedicated per-tenant model instances available.
Supports Dedicated model instances Tenant isolation High-availability configuration Kubernetes scaling

Model Sizing Guidance

7B – 14B
Foundation Tier

Lower VRAM requirements. Suitable for knowledge retrieval, document Q&A, and policy lookup. Runs on entry-level GPU hardware.

MoE
Enterprise Plus

Mixture-of-Experts architectures require higher VRAM and benefit from multi-GPU routing. Srasta supports model pooling and hybrid routing strategies.

Production Operating Practices

For production deployments, reliability and governance need to be designed together.

Separate inference from RAG ingestion

Prevents ingestion workloads from impacting active inference latency.

Dedicated vector database node

Isolate the vector store to its own node for reliable search performance at scale.

Evaluation and observability enabled from day one

Track prompt quality, retrieval behavior, memory drift, policy decisions, compliance rules, latency, and cost before scaling.

Multi-AZ for cloud deployments

Distribute across availability zones for resilience in AWS, Azure, and GCP environments.

Backup and restore configuration

Snapshot vector collections, knowledge ingestion pipelines, and governance configuration on a scheduled basis.

Kubernetes for horizontal scaling

Optional but recommended for Enterprise Plus deployments with unpredictable concurrency.

Hardware Sizing Estimator

Answer three questions to get an indicative configuration. For accurate sizing, schedule a deployment session with our team.

Interactive Calculator

Estimate Your Deployment Footprint

Select one option per step. Your estimated infrastructure appears after the third answer.

Step 1 of 3
Step 1 of 3
What is your primary use case?
Step 2 of 3
Where will Srasta run?
Step 3 of 3
How many active users do you expect?

Deployment FAQ for Enterprise AI Teams

High-intent questions from infrastructure, security, and engineering leaders.

How do we deploy AI on-prem for enterprise use?

Start with private deployment, identity-aware policy gates, secure RAG, and full audit logging. Srasta is designed for this model from day one.

What is the best secure AI deployment model for regulated industries?

On-prem or private cloud with strict access controls, audit trails, and data residency constraints is typically the best fit for regulated organisations. This is why teams evaluate Srasta as an AI governance platform for regulated companies.

Can we run a private RAG platform on our own infrastructure?

Yes. Srasta supports private vector search, local embedding pipelines, and governed retrieval so enterprise knowledge never leaves your environment.

How should we size GPUs for enterprise AI workloads?

It depends on concurrency, model size, and use case. Use the sizing estimator for baseline planning, then confirm with readiness and pilot scoping.

Not sure which configuration fits your environment?

We scope every deployment around the controls that matter: model selection, memory boundaries, evaluation needs, compliance rules, GPU sizing, cloud vs on-prem trade-offs, and recovery expectations.

Request a Deployment Sizing Session →