When teams say they want "AI infrastructure," they usually mean one of three things: calling foundation models, running their own models, or building retrieval pipelines around company data.
On AWS, those are different architectural paths — and mixing them without a plan leads to duplicated effort, security gaps, and runaway costs. Here's the setup I use when a team is ready to move beyond a prototype.
Pick Your Inference Pattern First
Before choosing services, decide how inference will run:
- Managed APIs (Bedrock): Fastest path, lowest ops overhead, best for most app teams
- Self-hosted models (SageMaker, EKS with GPU nodes): More control, higher ops burden, needed for custom fine-tuned models at scale
- Hybrid: Bedrock for general tasks, self-hosted for specialized models
Most startups and SaaS teams should start with Bedrock and add self-hosted inference only when they hit a clear limitation — cost at scale, custom model requirements, or strict latency needs.
Reference Architecture I Deploy Most Often
For a production-ready baseline on AWS:
- App tier: ECS/Fargate or EKS service handling API requests
- Model access: Amazon Bedrock for foundation model calls
- Knowledge layer: S3 for documents + ingestion workers
- Vector search: OpenSearch Serverless or Aurora PostgreSQL with pgvector
- Async jobs: SQS + Lambda/ECS for embedding and reindexing
- Secrets: AWS Secrets Manager for third-party API keys
- Observability: CloudWatch metrics/logs + tracing via X-Ray or OpenTelemetry
This keeps the critical path simple: API → retrieval → Bedrock → response. Everything else runs asynchronously.
Networking and Security Foundations
AI workloads often process sensitive data — support tickets, internal docs, customer information. Treat the network accordingly:
- Run services in private subnets
- Use VPC endpoints for Bedrock, S3, Secrets Manager, and your vector store
- Encrypt data in transit and at rest (KMS keys per environment)
- Apply least-privilege IAM per service — no shared "AI admin" role
Security isn't optional for AI stacks. If you're handling customer data, pair this with the practices we outline in our Security & DevSecOps work — threat modeling, audit trails, and dependency scanning still apply.
Data Pipeline: The Part Everyone Underestimates
Good AI infrastructure is mostly a data pipeline problem. I design ingestion with:
- Source connectors: S3 uploads, CRM exports, Confluence/Notion sync, database snapshots
- Normalization: Convert PDFs, HTML, and docs into clean text
- Chunking strategy: Size and overlap tuned per content type
- Embedding jobs: Batch processing with retry and dead-letter queues
- Versioning: Track source document version in vector metadata
When someone asks why the AI gave a wrong answer, the first place I look is retrieval metadata — not the model.
When to Add GPUs and Kubernetes
Move to GPU-backed inference on EKS or SageMaker when you need:
- Custom fine-tuned models with strict latency targets
- On-premise-like control over model runtime
- High-volume inference where per-token API pricing stops making sense
GPU infrastructure adds real operational cost: node scaling, driver compatibility, model serving frameworks, and capacity planning. Don't jump there on day one unless requirements force it.
CI/CD for AI Infrastructure
AI systems need the same deployment discipline as any production service. In practice that means:
- Infrastructure changes through Terraform/CDK pipelines
- Application deploys through your existing CI/CD pipeline
- Prompt/template changes versioned in Git
- Pre-production evaluation tests (golden questions with expected quality thresholds)
- Canary releases for model or retrieval changes
Teams that treat prompt changes as "config tweaks" with no testing are the ones surprised by regressions on launch day.
Monitoring AI Systems in Production
Track both system health and answer quality:
- System: latency, error rate, queue depth, GPU utilization, token usage
- Quality: retrieval precision signals, human feedback, escalation rate to support
- Cost: spend per feature, per customer, per model
Build dashboards for engineering and product teams. Engineers need latency and error data. Product needs quality and cost per successful task completion.
Common Mistakes I See
- Jumping to GPUs too early: Expensive and operationally heavy
- No retrieval versioning: Stale answers from outdated documents
- Shared credentials: One API key used by every service
- Missing fallbacks: Single model dependency with no backup route
- Ignoring costs until month-end: Token spend surprises nobody should be surprised by
Practical Rollout Plan (30/60/90 Days)
First 30 days: Bedrock prototype with IAM, VPC endpoints, basic RAG, and logging.
By 60 days: Guardrails, automated ingestion, cost alarms, staging environment parity.
By 90 days: CI/CD for infra and prompts, quality evaluation suite, on-call runbooks for inference failures.
AI infrastructure on AWS isn't one service — it's a system. Bedrock is often the right inference layer, but the surrounding platform (data, security, monitoring, deployment) is what makes it production-ready.
If you're building AI infrastructure and want a second pair of eyes on architecture, cost, or security, our AWS DevOps team helps teams design and operate production AI platforms. Drop us a line — happy to review your setup.