Every team I talk to has the same story with AWS Bedrock: the demo works in five minutes, production takes five weeks.
That's not because Bedrock is hard to use. It's because production AI infrastructure needs the same discipline as any other production system — identity, networking, observability, guardrails, and cost controls. Most teams skip those until something breaks or the bill surprises them.
Here's the Bedrock production setup I recommend after helping teams move from prototype to real workloads.
Start With Model Access and Region Strategy
Before you write a single line of application code, decide:
- Which regions you'll run in (model availability varies)
- Which foundation models you actually need (don't enable everything)
- Whether inference stays in one account or is split by environment
I usually keep dev, staging, and prod in separate AWS accounts. Bedrock model access is enabled per account and per region. Document which models are approved for each environment so engineers don't accidentally call expensive models from a dev sandbox.
IAM: Least Privilege, Not "BedrockFullAccess"
The fastest way to create a security incident is giving your app broad Bedrock permissions. Instead, scope access tightly:
- Application role can invoke only approved models
- Separate roles for admin tasks (model access requests, guardrail management)
- No long-lived access keys in application runtimes — use IAM roles for EC2, ECS, EKS, or Lambda
- CloudTrail enabled for all Bedrock API calls
For multi-tenant apps, add an abstraction layer so each tenant's requests are tagged and auditable. Bedrock won't do tenant isolation for you — your application and IAM design must.
Networking: Keep Inference Off the Public Internet
If your workloads run inside a VPC (and they should), use VPC endpoints for Bedrock and related services. This keeps traffic on the AWS network and avoids routing model prompts through the public internet.
My baseline pattern:
- Private subnets for app and worker tiers
- VPC interface endpoints for Bedrock Runtime and Bedrock
- Restrict egress with security groups and network ACLs
- Centralized DNS (Route 53 Resolver or equivalent) for endpoint resolution
For teams using EKS, I run inference services in dedicated node groups with tight security group rules. For simpler setups, ECS on Fargate with private networking works well too.
Guardrails, Prompt Safety, and Data Handling
Production Bedrock isn't just model invocation — it's risk management. At minimum:
- Enable Amazon Bedrock Guardrails for content filtering and PII handling
- Log prompts and responses with redaction for sensitive fields
- Define retention policies (don't store raw prompts forever)
- Block prompt injection paths in your application layer
I also recommend a human review queue for high-risk actions (financial decisions, account changes, external communications). AI can draft; humans approve when the blast radius is large.
RAG Infrastructure That Doesn't Collapse Under Load
Most Bedrock production use cases involve retrieval (RAG). The infrastructure around retrieval matters as much as the model:
- Vector store: OpenSearch Serverless, Aurora pgvector, or another managed option
- Ingestion pipeline: S3 + Lambda/EventBridge, or batch jobs on ECS
- Chunking/versioning: Track document versions so answers don't come from stale content
- Cache layer: ElastiCache for repeated queries with short TTLs
I've seen teams spend all their time tuning prompts while their retrieval pipeline returns wrong chunks. Fix retrieval quality and latency first — then tune the model.
Observability: Measure Quality, Not Just Uptime
Traditional monitoring isn't enough for LLM workloads. Track:
- Latency (p50/p95/p99) per model and per endpoint
- Token usage and cost per request
- Error rates by model and by guardrail block reason
- Retrieval hit rate and chunk relevance signals
- User feedback (thumbs up/down) tied to request IDs
Pipe these metrics into your existing stack — CloudWatch, Prometheus/Grafana, or Datadog. If you're already investing in monitoring and observability, extend it for AI-specific signals instead of building a silo.
Cost Controls That Actually Work
Bedrock bills per token. Small mistakes add up fast. I implement:
- Per-service token budgets and alarms
- Model routing (cheap model for draft, expensive model for final)
- Request size limits and max output tokens
- Scheduled reviews of top-spend callers
One client cut inference spend by 42% just by enforcing max tokens and switching classification tasks to a smaller model. No quality loss on the tasks that mattered.
Infrastructure as Code: Make It Repeatable
Don't click-deploy production Bedrock setup. Use Terraform (or CDK) for:
- IAM roles and policies
- VPC endpoints and security groups
- Guardrail configuration
- S3 buckets for knowledge bases
- CloudWatch alarms and dashboards
Keep model identifiers and environment-specific settings in variables. Promote changes through dev → staging → prod with the same pipeline you use for the rest of your infrastructure.
Production Rollout Checklist
- Enable only required models per environment
- Lock down IAM and enable CloudTrail
- Configure VPC endpoints and private networking
- Deploy guardrails and logging with redaction
- Load-test retrieval and inference paths separately
- Set token/cost alarms before launch
- Document fallback behavior when a model or region is unavailable
Bedrock is powerful, but production success comes from the same fundamentals we apply to any critical service: secure access, reliable networking, measurable operations, and controlled costs.
Need help designing or hardening your Bedrock setup? Our AWS DevOps consulting team builds production-ready AI infrastructure — from IAM and networking to guardrails and observability. Get in touch and we'll help you ship it safely.