Azure AKS GitHub Actions FastAPI Β· Python Terraform IaC Production Grade E-Commerce
SHOPWAVE
A complete, production-grade E-Commerce platform built with Python FastAPI microservices, deployed on Azure Kubernetes Service via GitHub Actions β infrastructure provisioned entirely with Terraform. From zero to live, nothing skipped.
π
February 2026 β± 40 min read π Python 3.12 + FastAPI βοΈ Azure AKS + ACR + PostgreSQL ποΈ Terraform 1.7+ βοΈ GitHub Actions
This is not a tutorial with placeholder code. Every file, pipeline, and configuration in this walkthrough is real, production-ready, and battle-tested . By the end you will have a complete monorepo containing four Python FastAPI microservices, a full Azure infrastructure stack provisioned with Terraform, and GitHub Actions pipelines that take code from a developer's laptop to a live Kubernetes cluster β with security scanning, automated testing, canary deployments, and instant rollback capability built in.
We are building ShopWave β a modern e-commerce backend with an API Gateway, Auth Service, Product Service, and Order Service. Each service is independently deployable, containerised, and operated through a fully automated CI/CD pipeline. The entire Azure infrastructure β AKS cluster, Container Registry, PostgreSQL databases, Key Vault, private networking, and GitHub OIDC federation β is declared as Terraform code and applied through its own dedicated pipeline.
ShopWave β Full System Architecture
CLIENT HTTPS API GATEWAY :8000 Rate limit Β· Auth Β· Route nginx ingress β FastAPI AUTH SERVICE :8001 JWT Β· OAuth2 Azure AD Β· bcrypt FastAPI + PostgreSQL PRODUCT SERVICE :8002 Catalog Β· Search Inventory Β· Pricing FastAPI + PostgreSQL ORDER SERVICE :8003 Cart Β· Checkout Payment Β· Fulfilment FastAPI + PostgreSQL AZURE DB PostgreSQL Flexible auth_db products_db orders_db SSL Β· Private endpoint KEY VAULT DB passwords JWT secrets API keys AZURE ACR Container Registry Image storage AZURE AKS 3-node cluster Standard_D2s_v3 staging + prod ns GH ACTIONS CI Β· CD Β· Terraform 7 pipelines AZURE MONITOR Prometheus Β· Grafana App Insights Β· Alerts Single entry point. Routes requests, enforces rate limits, validates JWT tokens from auth-service.
GET /health
ANY /api/v1/*
User registration, login, JWT issuance, Azure AD SSO integration, password hashing with bcrypt.
POST /auth/register
POST /auth/login
POST /auth/refresh
Product catalog, search, inventory management, category hierarchy, image metadata.
GET /products
GET /products/{id}
POST /products
Cart management, checkout flow, order lifecycle, payment intent creation, fulfilment tracking.
POST /orders
GET /orders/{id}
PATCH /orders/{id}/status
SECTION 01
PROJECT STRUCTURE & MONOREPO LAYOUT Every file, every directory β nothing hand-wavy.
ShopWave lives in a single monorepo. GitHub Actions uses path-based filtering to detect which services changed and runs only the relevant pipelines β so deploying a hotfix to order-service does not trigger a rebuild of auth-service. A separate terraform.yml pipeline manages all Azure infrastructure changes, keeping application and infrastructure deployments completely independent.
shopwave/
βββ .github/
β βββ workflows/
β βββ ci-auth-service.yml β CI for auth-service (PR + merge)
β βββ ci-product-service.yml β CI for product-service
β βββ ci-order-service.yml β CI for order-service
β βββ ci-api-gateway.yml β CI for api-gateway
β βββ cd-staging.yml β Deploy ALL changed services β staging
β βββ cd-production.yml β Promote staging β production (manual gate)
β βββ terraform.yml β Infrastructure CI/CD (plan on PR Β· apply on dispatch)
β
βββ services/
β βββ api-gateway/
β β βββ app/
β β β βββ main.py β FastAPI app entrypoint
β β β βββ routes/
β β β β βββ proxy.py β Reverse proxy logic
β β β βββ middleware/
β β β β βββ auth.py β JWT validation middleware
β β β β βββ rate_limit.py β Redis-backed rate limiter
β β β βββ config.py β Pydantic settings
β β βββ tests/
β β βββ Dockerfile
β β βββ requirements.txt
β β βββ pyproject.toml
β β
β βββ auth-service/
β β βββ app/
β β β βββ main.py
β β β βββ api/v1/
β β β β βββ auth.py β Login / register / refresh
β β β β βββ users.py β User CRUD
β β β βββ core/
β β β β βββ security.py β JWT + bcrypt
β β β β βββ config.py
β β β βββ db/
β β β β βββ models.py β SQLAlchemy ORM models
β β β β βββ session.py β Async DB session
β β β βββ schemas/auth.py β Pydantic schemas
β β βββ alembic/versions/
β β β βββ 001_init.py
β β βββ tests/
β β βββ Dockerfile
β β βββ requirements.txt
β β βββ pyproject.toml
β β
β βββ product-service/ β same structure as auth-service
β βββ order-service/ β same structure as auth-service
β
βββ infrastructure/
β βββ terraform/
β βββ main.tf β Root: wires all modules + OIDC federation
β βββ variables.tf β All input variables with validation
β βββ outputs.tf β GitHub Secrets summary + resource IDs
β βββ terraform.tfvars.example β Copy β terraform.tfvars to configure
β βββ modules/
β β βββ networking/ β VNet Β· subnets Β· NSGs Β· private DNS zone
β β βββ aks/ β AKS cluster Β· node pools Β· Log Analytics
β β βββ acr/ β Azure Container Registry
β β βββ postgres/ β PostgreSQL Flexible Server Β· databases
β β βββ keyvault/ β Azure Key Vault Β· seeds JWT + origins
β βββ environments/
β β βββ staging/staging.tfvars
β β βββ production/production.tfvars
β βββ scripts/
β βββ bootstrap-tfstate.sh β One-time: creates remote state backend
β
βββ helm/
β βββ shopwave/
β βββ Chart.yaml
β βββ values.yaml β default values
β βββ values-staging.yaml β staging overrides
β βββ values-production.yaml β production overrides
β βββ templates/
β βββ deployment.yaml
β βββ service.yaml
β βββ ingress.yaml
β βββ hpa.yaml β Horizontal Pod Autoscaler
β βββ pdb.yaml β Pod Disruption Budget
β βββ externalsecret.yaml β Azure Key Vault secret injection
β
βββ k8s/
β βββ network-policies.yaml
β βββ rbac.yaml
β
βββ scripts/
β βββ init-databases.sql β Local dev: create all three databases
β
βββ docker-compose.yml β Local development
βββ .env.example
βββ README.md SECTION 02
THE PYTHON FASTAPI MICROSERVICES Real application code β auth, products, orders, and gateway.
Each service follows the same internal structure: a FastAPI application with async SQLAlchemy for database access, Pydantic v2 for data validation, Alembic for migrations, and structured JSON logging. Here is the auth-service in full, which establishes the pattern every other service follows.
auth-service / app / main.py
services/auth-service/app/main.py python from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import structlog
from app.api.v1 import auth, users
from app.core.config import settings
from app.db.session import engine, Base
log = structlog.get_logger()
@asynccontextmanager
async def lifespan(app: FastAPI):
log.info("auth_service.startup", env=settings.ENVIRONMENT)
async with engine.begin() as conn:
await conn.run_sync(Base.metadata.create_all)
yield
log.info("auth_service.shutdown")
app = FastAPI(
title="ShopWave Auth Service",
version="1.0.0",
docs_url="/docs" if settings.ENVIRONMENT != "production" else None,
lifespan=lifespan,
)
app.add_middleware(
CORSMiddleware,
allow_origins=settings.ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
app.include_router(auth.router, prefix="/api/v1/auth", tags=["auth"])
app.include_router(users.router, prefix="/api/v1/users", tags=["users"])
@app.get("/health", tags=["ops"])
async def health():
return {"status": "ok", "service": "auth-service",
"version": "1.0.0", "env": settings.ENVIRONMENT}
services/auth-service/app/core/security.py python from datetime import datetime, timedelta, timezone
from typing import Any
import bcrypt
from jose import jwt, JWTError
from fastapi import HTTPException, status
from app.core.config import settings
ALGORITHM = "HS256"
def hash_password(password: str) -> str:
return bcrypt.hashpw(password.encode(), bcrypt.gensalt()).decode()
def verify_password(plain: str, hashed: str) -> bool:
return bcrypt.checkpw(plain.encode(), hashed.encode())
def create_access_token(subject: Any, expires_delta: timedelta | None = None) -> str:
expire = datetime.now(timezone.utc) + (
expires_delta or timedelta(minutes=settings.ACCESS_TOKEN_EXPIRE_MINUTES)
)
payload = {"sub": str(subject), "exp": expire, "type": "access"}
return jwt.encode(payload, settings.SECRET_KEY, algorithm=ALGORITHM)
def create_refresh_token(subject: Any) -> str:
expire = datetime.now(timezone.utc) + timedelta(days=settings.REFRESH_TOKEN_EXPIRE_DAYS)
payload = {"sub": str(subject), "exp": expire, "type": "refresh"}
return jwt.encode(payload, settings.SECRET_KEY, algorithm=ALGORITHM)
def decode_token(token: str) -> dict:
try:
return jwt.decode(token, settings.SECRET_KEY, algorithms=[ALGORITHM])
except JWTError:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid or expired token",
headers={"WWW-Authenticate": "Bearer"},
)
services/auth-service/app/api/v1/auth.py python from fastapi import APIRouter, Depends, HTTPException, status
from sqlalchemy.ext.asyncio import AsyncSession
import structlog
from app.db.session import get_db
from app.db.models import User
from app.core.security import (
hash_password, verify_password,
create_access_token, create_refresh_token, decode_token
)
from app.schemas.auth import (
RegisterRequest, LoginRequest, TokenResponse, RefreshRequest
)
router = APIRouter()
log = structlog.get_logger()
@router.post("/register", response_model=TokenResponse, status_code=201)
async def register(body: RegisterRequest, db: AsyncSession = Depends(get_db)):
existing = await User.get_by_email(db, body.email)
if existing:
raise HTTPException(status_code=400, detail="Email already registered")
user = User(
email=body.email,
full_name=body.full_name,
hashed_password=hash_password(body.password),
)
db.add(user)
await db.commit()
await db.refresh(user)
log.info("user.registered", user_id=str(user.id), email=user.email)
return TokenResponse(
access_token=create_access_token(user.id),
refresh_token=create_refresh_token(user.id),
token_type="bearer",
)
@router.post("/login", response_model=TokenResponse)
async def login(body: LoginRequest, db: AsyncSession = Depends(get_db)):
user = await User.get_by_email(db, body.email)
if not user or not verify_password(body.password, user.hashed_password):
raise HTTPException(status_code=401, detail="Invalid credentials")
if not user.is_active:
raise HTTPException(status_code=403, detail="Account disabled")
log.info("user.login", user_id=str(user.id))
return TokenResponse(
access_token=create_access_token(user.id),
refresh_token=create_refresh_token(user.id),
token_type="bearer",
)
@router.post("/refresh", response_model=TokenResponse)
async def refresh(body: RefreshRequest):
payload = decode_token(body.refresh_token)
if payload.get("type") != "refresh":
raise HTTPException(status_code=401, detail="Invalid token type")
user_id = payload["sub"]
return TokenResponse(
access_token=create_access_token(user_id),
refresh_token=create_refresh_token(user_id),
token_type="bearer",
)
services/auth-service/app/db/models.py python import uuid
from datetime import datetime, timezone
from sqlalchemy import String, Boolean, DateTime
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
class Base(DeclarativeBase):
pass
class User(Base):
__tablename__ = "users"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
email: Mapped[str] = mapped_column(String(255), unique=True, nullable=False, index=True)
full_name: Mapped[str] = mapped_column(String(255), nullable=False)
hashed_password: Mapped[str] = mapped_column(String(255), nullable=False)
is_active: Mapped[bool] = mapped_column(Boolean, default=True)
is_superuser: Mapped[bool] = mapped_column(Boolean, default=False)
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc))
updated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc), onupdate=lambda: datetime.now(timezone.utc))
@classmethod
async def get_by_email(cls, db: AsyncSession, email: str) -> "User | None":
result = await db.execute(select(cls).where(cls.email == email))
return result.scalar_one_or_none()
services/auth-service/Dockerfile dockerfile # ββ Stage 1: build & test ββββββββββββββββββββββββββββββ
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --upgrade pip && pip install --no-cache-dir --prefix=/install -r requirements.txt
# ββ Stage 2: production image βββββββββββββββββββββββββ
FROM python:3.12-slim AS production
# Security: non-root user
RUN groupadd -r shopwave && useradd -r -g shopwave -s /sbin/nologin shopwave
WORKDIR /app
# Copy installed dependencies from builder
COPY --from=builder /install /usr/local
# Copy application source
COPY app/ ./app/
COPY alembic/ ./alembic/
COPY alembic.ini .
# Set ownership
RUN chown -R shopwave:shopwave /app
USER shopwave
EXPOSE 8001
# Structured JSON logging, single worker per pod (HPA handles scale)
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8001", "--workers", "1", "--log-config", "/dev/null", "--access-log"]
services/auth-service/requirements.txt text # Web framework
fastapi==0.115.0
uvicorn[standard]==0.32.0
httpx==0.27.2
# Database
sqlalchemy[asyncio]==2.0.36
asyncpg==0.30.0
alembic==1.13.3
# Auth & security
python-jose[cryptography]==3.3.0
bcrypt==4.2.0
passlib==1.7.4
# Config & validation
pydantic==2.9.2
pydantic-settings==2.6.1
# Observability
structlog==24.4.0
opentelemetry-sdk==1.27.0
opentelemetry-instrumentation-fastapi==0.48b0
prometheus-fastapi-instrumentator==7.0.0
# Testing
pytest==8.3.3
pytest-asyncio==0.24.0
pytest-cov==5.0.0
httpx==0.27.2 SECTION 03
GITHUB ACTIONS β CI PIPELINES Per-service pipelines with lint, test, security scan, and image push to ACR.
Each service has its own CI workflow triggered by pull requests and pushes to main. The paths filter ensures only the relevant service pipeline runs when its code changes. Every pipeline follows the same five-stage flow.
CI Pipeline Flow β Per Service
β CHECKOUT actions/checkout@v4 β‘ LINT + FORMAT ruff Β· black Β· mypy β’ UNIT TESTS pytest --cov β₯80% β£ SECURITY SCAN Trivy + Semgrep β€ BUILD + PUSH ACR Β· sha tag Fails at any stage β PR blocked Β· Slack alert Β· No image pushed CI Pipeline β auth-service
.github/workflows/ci-auth-service.yml yaml name: CI β auth-service
on:
push:
branches: [main, develop]
paths:
- "services/auth-service/**"
- ".github/workflows/ci-auth-service.yml"
pull_request:
branches: [main]
paths:
- "services/auth-service/**"
env:
SERVICE: auth-service
IMAGE_NAME: shopwave/auth-service
PYTHON_VERSION: "3.12"
REGISTRY: ${{ secrets.ACR_LOGIN_SERVER }}
defaults:
run:
working-directory: services/auth-service
jobs:
# ββββββββββββββββββββββββββββββββββββββββββββββ
# JOB 1: Lint & Static Analysis
# ββββββββββββββββββββββββββββββββββββββββββββββ
lint:
name: "Lint & Type Check"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: pip
cache-dependency-path: services/auth-service/requirements.txt
- name: Install dev dependencies
run: pip install ruff black mypy
- name: Ruff lint
run: ruff check app/ tests/
- name: Black format check
run: black --check app/ tests/
- name: Mypy type check
run: mypy app/ --ignore-missing-imports
# ββββββββββββββββββββββββββββββββββββββββββββββ
# JOB 2: Unit & Integration Tests
# ββββββββββββββββββββββββββββββββββββββββββββββ
test:
name: "Tests (Python ${{ matrix.python-version }})"
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
python-version: ["3.11", "3.12"]
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_USER: shopwave_test
POSTGRES_PASSWORD: shopwave_test
POSTGRES_DB: auth_test
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
env:
DATABASE_URL: postgresql+asyncpg://shopwave_test:shopwave_test@localhost:5432/auth_test
SECRET_KEY: ci-test-secret-key-not-for-production
ENVIRONMENT: test
ACCESS_TOKEN_EXPIRE_MINUTES: 30
REFRESH_TOKEN_EXPIRE_DAYS: 7
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip
cache-dependency-path: services/auth-service/requirements.txt
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run Alembic migrations (test DB)
run: alembic upgrade head
- name: Run tests with coverage
run: |
pytest tests/ --cov=app --cov-report=xml --cov-report=term-missing --cov-fail-under=80 -v
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage.xml
flags: auth-service
# ββββββββββββββββββββββββββββββββββββββββββββββ
# JOB 3: SAST β Semgrep
# ββββββββββββββββββββββββββββββββββββββββββββββ
sast:
name: "SAST β Semgrep"
runs-on: ubuntu-latest
needs: lint
permissions:
security-events: write
steps:
- uses: actions/checkout@v4
- name: Run Semgrep
uses: semgrep/semgrep-action@v1
with:
config: >-
p/python
p/owasp-top-ten
p/jwt
p/sql-injection
generateSarif: "1"
- name: Upload SARIF to GitHub Security
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: semgrep.sarif
# ββββββββββββββββββββββββββββββββββββββββββββββ
# JOB 4: Build Docker Image & Trivy Scan
# ββββββββββββββββββββββββββββββββββββββββββββββ
build-and-scan:
name: "Build Image & Security Scan"
runs-on: ubuntu-latest
needs: [test, sast]
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
outputs:
image-tag: ${{ steps.meta.outputs.sha-tag }}
image-digest: ${{ steps.push.outputs.digest }}
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Azure Container Registry
uses: azure/docker-login@v2
with:
login-server: ${{ env.REGISTRY }}
username: ${{ secrets.ACR_USERNAME }}
password: ${{ secrets.ACR_PASSWORD }}
- name: Generate image metadata
id: meta
run: |
SHA_SHORT=$(echo ${{ github.sha }} | cut -c1-8)
echo "sha-tag=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${SHA_SHORT}" >> $GITHUB_OUTPUT
echo "latest-tag=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest" >> $GITHUB_OUTPUT
- name: Build Docker image (multi-stage)
uses: docker/build-push-action@v6
with:
context: services/auth-service
file: services/auth-service/Dockerfile
push: false
load: true
tags: ${{ steps.meta.outputs.sha-tag }}
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
BUILD_SHA=${{ github.sha }}
BUILD_DATE=${{ github.event.head_commit.timestamp }}
- name: Trivy β scan image for CVEs
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ steps.meta.outputs.sha-tag }}
format: sarif
output: trivy-results.sarif
severity: CRITICAL,HIGH
exit-code: "1" # Fail pipeline on CRITICAL/HIGH with fix available
ignore-unfixed: true
- name: Upload Trivy SARIF to GitHub Security
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: trivy-results.sarif
- name: Push image to ACR (only after clean scan)
id: push
uses: docker/build-push-action@v6
with:
context: services/auth-service
file: services/auth-service/Dockerfile
push: true
tags: |
${{ steps.meta.outputs.sha-tag }}
${{ steps.meta.outputs.latest-tag }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Sign image with Cosign
uses: sigstore/cosign-installer@v3
then: cosign sign --yes ${{ steps.meta.outputs.sha-tag }}@${{ steps.push.outputs.digest }}
# ββββββββββββββββββββββββββββββββββββββββββββββ
# JOB 5: Notify on failure
# ββββββββββββββββββββββββββββββββββββββββββββββ
notify-failure:
name: "Notify on Failure"
runs-on: ubuntu-latest
needs: [lint, test, sast, build-and-scan]
if: failure()
steps:
- name: Slack alert
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "β CI FAILED: *auth-service* on branch `${{ github.ref_name }}`",
"blocks": [{
"type": "section",
"text": { "type": "mrkdwn",
"text": "β *auth-service CI failed*
Branch: `${{ github.ref_name }}`
Commit: `${{ github.sha }}`
<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>"
}
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK π‘ Path-based CI β other services
The CI pipelines for product-service, order-service, and api-gateway are identical in structure β only the paths filter, SERVICE env var, IMAGE_NAME, exposed port, and any service-specific test environment variables change. This pattern means a single-service hotfix triggers exactly one CI pipeline run, not four.
SECTION 04
GITHUB ACTIONS β CD STAGING PIPELINE Automatic deploy to AKS staging on every merge to main.
When any service's CI pipeline completes successfully on main, the staging CD pipeline automatically deploys the new image to the shopwave-staging namespace in AKS. It uses Helm with per-environment values files, runs smoke tests post-deploy, and rolls back automatically if the smoke tests fail.
CD Staging Flow β Triggered after CI success on main
AZ LOGIN OIDC Federated KUBECONFIG Get AKS creds HELM UPGRADE --atomic --wait SMOKE TESTS httpx health check PASS β Notify + tag ready AUTO ROLLBACK on smoke fail
.github/workflows/cd-staging.yml yaml name: CD β Deploy to Staging
on:
workflow_run:
workflows:
- "CI β auth-service"
- "CI β product-service"
- "CI β order-service"
- "CI β api-gateway"
types: [completed]
branches: [main]
permissions:
id-token: write # Required for OIDC Azure login
contents: read
env:
RESOURCE_GROUP: shopwave-rg
AKS_CLUSTER: shopwave-aks
NAMESPACE: shopwave-staging
REGISTRY: ${{ secrets.ACR_LOGIN_SERVER }}
HELM_CHART: ./helm/shopwave
jobs:
deploy-staging:
name: "Deploy ${{ github.event.workflow_run.name }} β Staging"
runs-on: ubuntu-latest
if: github.event.workflow_run.conclusion == 'success'
environment:
name: staging
url: https://staging.shopwave.io
steps:
- uses: actions/checkout@v4
# ββ Azure OIDC login (no stored credentials) ββ
- name: Login to Azure via OIDC
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
# ββ Get AKS kubeconfig ββ
- name: Get AKS credentials
uses: azure/aks-set-context@v4
with:
resource-group: ${{ env.RESOURCE_GROUP }}
cluster-name: ${{ env.AKS_CLUSTER }}
# ββ Resolve which service changed + its new image tag ββ
- name: Resolve service and image tag
id: resolve
run: |
WORKFLOW="${{ github.event.workflow_run.name }}"
SHA=$(echo ${{ github.event.workflow_run.head_sha }} | cut -c1-8)
case "$WORKFLOW" in
"CI β auth-service") SVC="auth-service"; PORT=8001 ;;
"CI β product-service") SVC="product-service"; PORT=8002 ;;
"CI β order-service") SVC="order-service"; PORT=8003 ;;
"CI β api-gateway") SVC="api-gateway"; PORT=8000 ;;
esac
IMAGE="${{ env.REGISTRY }}/shopwave/${SVC}:sha-${SHA}"
echo "service=${SVC}" >> $GITHUB_OUTPUT
echo "image=${IMAGE}" >> $GITHUB_OUTPUT
echo "port=${PORT}" >> $GITHUB_OUTPUT
echo "sha=sha-${SHA}" >> $GITHUB_OUTPUT
# ββ Install / Upgrade Helm release ββ
- name: Setup Helm
uses: azure/setup-helm@v4
with:
version: "3.16.0"
- name: Helm upgrade β ${{ steps.resolve.outputs.service }}
id: helm
run: |
helm upgrade --install ${{ steps.resolve.outputs.service }} ${{ env.HELM_CHART }} --namespace ${{ env.NAMESPACE }} --create-namespace --values helm/shopwave/values.yaml --values helm/shopwave/values-staging.yaml --set image.repository=${{ env.REGISTRY }}/shopwave/${{ steps.resolve.outputs.service }} --set image.tag=${{ steps.resolve.outputs.sha }} --set service.port=${{ steps.resolve.outputs.port }} --atomic --timeout 5m --wait --history-max 10
# ββ Post-deploy smoke tests ββ
- name: Smoke test β health endpoint
id: smoke
run: |
SVC="${{ steps.resolve.outputs.service }}"
PORT="${{ steps.resolve.outputs.port }}"
BASE="https://staging.shopwave.io/${SVC}"
echo "Running smoke tests for ${SVC}..."
# Health check with retry
for i in $(seq 1 10); do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "${BASE}/health" -H "Host: staging.shopwave.io" --max-time 5)
if [ "$STATUS" = "200" ]; then
echo "β
Health check passed (${i}/10)"
exit 0
fi
echo "Attempt ${i}/10 β status: ${STATUS}, retrying in 10s..."
sleep 10
done
echo "β Smoke test failed after 10 attempts"
exit 1
# ββ Auto-rollback on smoke test failure ββ
- name: Rollback on smoke failure
if: failure() && steps.smoke.outcome == 'failure'
run: |
echo "β οΈ Rolling back ${{ steps.resolve.outputs.service }}..."
helm rollback ${{ steps.resolve.outputs.service }} --namespace ${{ env.NAMESPACE }} --wait --timeout 3m
# ββ Tag commit as staging-ready ββ
- name: Tag commit as staging-ready
if: success()
run: |
git tag "staging-${{ steps.resolve.outputs.service }}-${{ steps.resolve.outputs.sha }}" ${{ github.event.workflow_run.head_sha }}
git push origin "staging-${{ steps.resolve.outputs.service }}-${{ steps.resolve.outputs.sha }}"
# ββ Slack notification ββ
- name: Notify Slack
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "${{ job.status == 'success' && 'β
' || 'β' }} Staging deploy: *${{ steps.resolve.outputs.service }}* β `${{ steps.resolve.outputs.sha }}`",
"blocks": [{
"type": "section",
"text": { "type": "mrkdwn",
"text": "${{ job.status == 'success' && 'β
*Staging deploy succeeded*' || 'β *Staging deploy FAILED*' }}
Service: `${{ steps.resolve.outputs.service }}`
Image: `${{ steps.resolve.outputs.sha }}`
Environment: *staging*"
}
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK SECTION 05
GITHUB ACTIONS β CD PRODUCTION PIPELINE Manual approval gate, rolling deploy, automatic rollback.
Production deployments require a manual approval from a designated reviewer in the GitHub production environment. Once approved, the same immutable image SHA that passed staging is deployed β never rebuilt. The pipeline uses a rolling update strategy with a Pod Disruption Budget ensuring at least one replica is always available.
CD Production Flow β Manual Gate + Rolling Deploy
TRIGGER workflow_dispatch or staging success + image SHA input βΈ MANUAL APPROVAL GitHub Environment Protection Rules Required reviewers β₯1 Wait up to 7 days PRE-FLIGHT Verify image exists Check staging tag Validate SHA HELM UPGRADE Rolling update maxSurge: 1 maxUnavailable: 0 VERIFY + TAG Production smoke Git release tag Notify stakeholders helm rollback (auto on --atomic fail)
.github/workflows/cd-production.yml yaml name: CD β Deploy to Production
on:
workflow_dispatch:
inputs:
service:
description: "Service to deploy"
required: true
type: choice
options: [auth-service, product-service, order-service, api-gateway, all]
image_tag:
description: "Image SHA tag (e.g. sha-a3f7c2b1)"
required: true
type: string
reason:
description: "Deployment reason (for audit log)"
required: true
type: string
permissions:
id-token: write
contents: write
deployments: write
env:
RESOURCE_GROUP: shopwave-rg
AKS_CLUSTER: shopwave-aks
NAMESPACE: shopwave-production
REGISTRY: ${{ secrets.ACR_LOGIN_SERVER }}
HELM_CHART: ./helm/shopwave
jobs:
pre-flight:
name: "Pre-flight Checks"
runs-on: ubuntu-latest
outputs:
approved: ${{ steps.checks.outputs.approved }}
steps:
- uses: actions/checkout@v4
- name: Login to Azure via OIDC
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Verify image exists in ACR
id: checks
run: |
SERVICE="${{ github.event.inputs.service }}"
TAG="${{ github.event.inputs.image_tag }}"
if [ "$SERVICE" = "all" ]; then
SERVICES="auth-service product-service order-service api-gateway"
else
SERVICES="$SERVICE"
fi
for SVC in $SERVICES; do
IMAGE="${{ env.REGISTRY }}/shopwave/${SVC}:${TAG}"
echo "Checking ${IMAGE}..."
az acr repository show-tags --name $(echo ${{ env.REGISTRY }} | cut -d. -f1) --repository "shopwave/${SVC}" --query "[?@=='${TAG}']" --output tsv | grep -q "${TAG}" || { echo "β Image ${IMAGE} not found in ACR!"; exit 1; }
echo "β
Image verified: ${IMAGE}"
done
echo "approved=true" >> $GITHUB_OUTPUT
deploy-production:
name: "Deploy to Production"
runs-on: ubuntu-latest
needs: pre-flight
if: needs.pre-flight.outputs.approved == 'true'
environment:
name: production
url: https://shopwave.io
steps:
- uses: actions/checkout@v4
- name: Audit log β deployment started
run: |
echo "## Production Deployment" >> $GITHUB_STEP_SUMMARY
echo "| Field | Value |" >> $GITHUB_STEP_SUMMARY
echo "|---|---|" >> $GITHUB_STEP_SUMMARY
echo "| Service | `${{ github.event.inputs.service }}` |" >> $GITHUB_STEP_SUMMARY
echo "| Image Tag | `${{ github.event.inputs.image_tag }}` |" >> $GITHUB_STEP_SUMMARY
echo "| Deployed by | @${{ github.actor }} |" >> $GITHUB_STEP_SUMMARY
echo "| Reason | ${{ github.event.inputs.reason }} |" >> $GITHUB_STEP_SUMMARY
echo "| Timestamp | $(date -u +%Y-%m-%dT%H:%M:%SZ) |" >> $GITHUB_STEP_SUMMARY
- name: Login to Azure via OIDC
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Get AKS credentials
uses: azure/aks-set-context@v4
with:
resource-group: ${{ env.RESOURCE_GROUP }}
cluster-name: ${{ env.AKS_CLUSTER }}
- name: Setup Helm
uses: azure/setup-helm@v4
with:
version: "3.16.0"
- name: Resolve services to deploy
id: services
run: |
if [ "${{ github.event.inputs.service }}" = "all" ]; then
echo "list=auth-service product-service order-service api-gateway" >> $GITHUB_OUTPUT
else
echo "list=${{ github.event.inputs.service }}" >> $GITHUB_OUTPUT
fi
- name: Helm upgrade β production (rolling update)
id: deploy
run: |
TAG="${{ github.event.inputs.image_tag }}"
for SVC in ${{ steps.services.outputs.list }}; do
PORT=$(case "$SVC" in
api-gateway) echo 8000 ;;
auth-service) echo 8001 ;;
product-service) echo 8002 ;;
order-service) echo 8003 ;;
esac)
echo "π Deploying ${SVC}:${TAG} to production..."
helm upgrade --install "${SVC}" ${{ env.HELM_CHART }} --namespace ${{ env.NAMESPACE }} --create-namespace --values helm/shopwave/values.yaml --values helm/shopwave/values-production.yaml --set image.repository=${{ env.REGISTRY }}/shopwave/${SVC} --set image.tag=${TAG} --set service.port=${PORT} --atomic --timeout 8m --wait --history-max 5
echo "β
${SVC} deployed successfully"
done
# ββ Production smoke tests ββ
- name: Production smoke tests
id: smoke
run: |
BASE="https://shopwave.io"
ENDPOINTS=(
"${BASE}/api/v1/health"
"${BASE}/auth/health"
"${BASE}/products/health"
"${BASE}/orders/health"
)
ALL_PASS=true
for URL in "${ENDPOINTS[@]}"; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$URL" --max-time 10)
if [ "$STATUS" = "200" ]; then
echo "β
$URL β $STATUS"
else
echo "β $URL β $STATUS"
ALL_PASS=false
fi
done
[ "$ALL_PASS" = "true" ] || exit 1
# ββ Create GitHub Release ββ
- name: Create GitHub Release
if: success()
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: "prod-${{ github.event.inputs.image_tag }}-${{ github.run_number }}"
release_name: "Production Release β ${{ github.event.inputs.image_tag }}"
body: |
## Production Deployment
**Service(s):** `${{ github.event.inputs.service }}`
**Image Tag:** `${{ github.event.inputs.image_tag }}`
**Deployed by:** @${{ github.actor }}
**Reason:** ${{ github.event.inputs.reason }}
# ββ Slack stakeholder notification ββ
- name: Notify Slack β stakeholders
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "${{ job.status == 'success' && 'π' || 'π₯' }} Production deploy: *${{ github.event.inputs.service }}*",
"blocks": [
{
"type": "header",
"text": { "type": "plain_text", "text": "${{ job.status == 'success' && 'π Production Deploy Succeeded' || 'π₯ Production Deploy FAILED' }}" }
},
{
"type": "section",
"fields": [
{ "type": "mrkdwn", "text": "*Service:*
`${{ github.event.inputs.service }}`" },
{ "type": "mrkdwn", "text": "*Tag:*
`${{ github.event.inputs.image_tag }}`" },
{ "type": "mrkdwn", "text": "*By:*
@${{ github.actor }}" },
{ "type": "mrkdwn", "text": "*Reason:*
${{ github.event.inputs.reason }}" }
]
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK SECTION 06
HELM CHART β KUBERNETES MANIFESTS One chart, all four services, environment-specific overrides.
helm/shopwave/values.yaml yaml # Default values β overridden per environment
replicaCount: 2
image:
repository: "" # Set by pipeline: ACR_URL/shopwave/SERVICE
tag: "latest" # Overridden by pipeline with SHA tag
pullPolicy: Always
service:
type: ClusterIP
port: 8000 # Overridden per service
ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
tls:
- secretName: shopwave-tls
hosts: [] # Set per environment
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
podDisruptionBudget:
enabled: true
minAvailable: 1
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 20
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
env:
ENVIRONMENT: production
LOG_LEVEL: INFO
# Secrets injected from Azure Key Vault via External Secrets Operator
externalSecrets:
enabled: true
secretStoreName: azure-keyvault-store
refreshInterval: "1h"
secrets: [] # Defined per service in values-*.yaml
helm/shopwave/values-production.yaml yaml replicaCount: 3
image:
pullPolicy: Always
ingress:
tls:
- secretName: shopwave-prod-tls
hosts:
- shopwave.io
- "*.shopwave.io"
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "1Gi"
autoscaling:
minReplicas: 3
maxReplicas: 20
env:
ENVIRONMENT: production
LOG_LEVEL: WARNING
externalSecrets:
secrets:
- secretKey: DATABASE_URL
remoteRef:
key: shopwave-prod-db-url
- secretKey: SECRET_KEY
remoteRef:
key: shopwave-prod-jwt-secret
- secretKey: ALLOWED_ORIGINS
remoteRef:
key: shopwave-prod-allowed-origins
helm/shopwave/templates/deployment.yaml yaml apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "shopwave.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
{{- include "shopwave.labels" . | nindent 4 }}
annotations:
deployment.kubernetes.io/revision: "{{ .Release.Revision }}"
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "shopwave.selectorLabels" . | nindent 6 }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime: never kill before new pod ready
template:
metadata:
labels:
{{- include "shopwave.selectorLabels" . | nindent 8 }}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "{{ .Values.service.port }}"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: {{ include "shopwave.serviceAccountName" . }}
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.port }}
protocol: TCP
env:
{{- range $key, $val := .Values.env }}
- name: {{ $key }}
value: {{ $val | quote }}
{{- end }}
envFrom:
- secretRef:
name: {{ include "shopwave.fullname" . }}-secrets
resources:
{{- toYaml .Values.resources | nindent 12 }}
livenessProbe:
{{- toYaml .Values.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.readinessProbe | nindent 12 }}
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: [ALL] SECTION 07
TERRAFORM β INFRASTRUCTURE AS CODE Every Azure resource declared, versioned, and applied through a pipeline.
All Azure infrastructure is managed by Terraform β no manual portal clicks, no ad-hoc CLI commands. The configuration is split into five focused modules, each owning a single concern. The root main.tf wires them together and additionally provisions GitHub Actions OIDC federation so the CD pipelines can authenticate to Azure without any stored credentials.
Terraform Module Dependency Graph
ROOT main.tf OIDC Β· namespaces Β· role assignments networking VNet Β· subnets NSGs Β· DNS zone acr Container Registry admin creds keyvault RBAC auth seeds JWT secret aks cluster Β· node pools Log Analytics Β· Defender postgres Flexible Server 3 databases β KV infrastructure/terraform/main.tf (root β key excerpts) The root module wires the five child modules together, passing outputs as inputs between them (e.g. the networking module's subnet IDs flow into both aks and postgres), and finishes by configuring GitHub Actions OIDC federation and creating the Kubernetes namespaces.
Step 0 β Bootstrap remote state (run once) Before running terraform init you need an Azure Storage account to hold the state file. A small helper script creates it and prints the values you need to paste into the backend block in main.tf.
infrastructure/terraform/scripts/bootstrap-tfstate.sh bash #!/usr/bin/env bash
# Run ONCE before terraform init.
# Creates the Azure Blob Storage backend for Terraform remote state.
set -euo pipefail
TFSTATE_RG="shopwave-tfstate-rg"
TFSTATE_SA="shopwavetfstate${RANDOM}" # globally unique suffix
LOCATION="${1:-eastus}"
az group create --name "$TFSTATE_RG" --location "$LOCATION" --output none
az storage account create \
--name "$TFSTATE_SA" --resource-group "$TFSTATE_RG" \
--sku Standard_LRS --kind StorageV2 \
--min-tls-version TLS1_2 --allow-blob-public-access false --output none
# Enable versioning β protects state from accidental overwrites
az storage account blob-service-properties update \
--account-name "$TFSTATE_SA" --resource-group "$TFSTATE_RG" \
--enable-versioning true --output none
az storage container create \
--name tfstate --account-name "$TFSTATE_SA" --auth-mode login --output none
echo "Update main.tf backend block with:"
echo " storage_account_name = \"$TFSTATE_SA\""
echo " resource_group_name = \"$TFSTATE_RG\""
echo "Then: cp terraform.tfvars.example terraform.tfvars && terraform init" infrastructure/terraform/main.tf β key excerpts The root module calls all five child modules in dependency order and then handles cross-cutting concerns: GitHub Actions OIDC federation, role assignments, and Kubernetes namespace creation via the kubernetes provider (which is configured using the AKS module's output).
infrastructure/terraform/main.tf hcl terraform {
required_version = ">= 1.7.0"
required_providers {
azurerm = { source = "hashicorp/azurerm", version = "~> 3.110" }
azuread = { source = "hashicorp/azuread", version = "~> 2.53" }
random = { source = "hashicorp/random", version = "~> 3.6" }
helm = { source = "hashicorp/helm", version = "~> 2.14" }
kubernetes = { source = "hashicorp/kubernetes", version = "~> 2.31" }
}
backend "azurerm" {
resource_group_name = "shopwave-tfstate-rg"
storage_account_name = "shopwavetfstate" # from bootstrap-tfstate.sh
container_name = "tfstate"
key = "shopwave.tfstate"
}
}
module "networking" {
source = "./modules/networking"
project = var.project
environment = var.environment
location = var.location
resource_group_name = azurerm_resource_group.main.name
vnet_address_space = var.vnet_address_space
aks_subnet_cidr = var.aks_subnet_cidr
pg_subnet_cidr = var.pg_subnet_cidr
tags = local.common_tags
}
module "keyvault" {
source = "./modules/keyvault"
project = var.project
environment = var.environment
location = var.location
resource_group_name = azurerm_resource_group.main.name
tenant_id = data.azurerm_client_config.current.tenant_id
admin_object_ids = var.keyvault_admin_object_ids
tags = local.common_tags
}
module "acr" {
source = "./modules/acr"
project = var.project
environment = var.environment
location = var.location
resource_group_name = azurerm_resource_group.main.name
sku = var.acr_sku
tags = local.common_tags
}
module "aks" {
source = "./modules/aks"
project = var.project
environment = var.environment
location = var.location
resource_group_name = azurerm_resource_group.main.name
subnet_id = module.networking.aks_subnet_id # β cross-module wiring
acr_id = module.acr.id
keyvault_id = module.keyvault.id
node_count = var.aks_node_count
node_vm_size = var.aks_node_vm_size
kubernetes_version = var.kubernetes_version
min_node_count = var.aks_min_node_count
max_node_count = var.aks_max_node_count
tags = local.common_tags
depends_on = [module.networking, module.acr, module.keyvault]
}
module "postgres" {
source = "./modules/postgres"
project = var.project
environment = var.environment
location = var.location
resource_group_name = azurerm_resource_group.main.name
subnet_id = module.networking.pg_subnet_id # β private subnet
private_dns_zone_id = module.networking.pg_private_dns_zone_id
sku_name = var.postgres_sku
storage_mb = var.postgres_storage_mb
postgres_version = var.postgres_version
databases = var.postgres_databases
keyvault_id = module.keyvault.id # β stores connection strings here
tags = local.common_tags
depends_on = [module.networking, module.keyvault]
}
# ββ GitHub Actions OIDC federation βββββββββββββββββββββββββββββββββββββββββββ
resource "azuread_application" "github_actions" {
display_name = "${var.project}-github-actions-${var.environment}"
}
resource "azuread_service_principal" "github_actions" {
client_id = azuread_application.github_actions.client_id
}
resource "azuread_application_federated_identity_credential" "staging" {
application_id = azuread_application.github_actions.id
display_name = "${var.project}-staging"
audiences = ["api://AzureADTokenExchange"]
issuer = "https://token.actions.githubusercontent.com"
subject = "repo:${var.github_org}/${var.github_repo}:environment:staging"
}
resource "azuread_application_federated_identity_credential" "production" {
application_id = azuread_application.github_actions.id
display_name = "${var.project}-production"
audiences = ["api://AzureADTokenExchange"]
issuer = "https://token.actions.githubusercontent.com"
subject = "repo:${var.github_org}/${var.github_repo}:environment:production"
}
# Role assignments β principle of least privilege
resource "azurerm_role_assignment" "gh_acr_push" {
scope = module.acr.id; role_definition_name = "AcrPush"
principal_id = azuread_service_principal.github_actions.id
}
resource "azurerm_role_assignment" "gh_aks_user" {
scope = module.aks.id
role_definition_name = "Azure Kubernetes Service Cluster User Role"
principal_id = azuread_service_principal.github_actions.id
}
resource "azurerm_role_assignment" "gh_kv_secrets" {
scope = module.keyvault.id; role_definition_name = "Key Vault Secrets User"
principal_id = azuread_service_principal.github_actions.id
}
# Kubernetes namespaces β created via k8s provider post-AKS provisioning
resource "kubernetes_namespace" "staging" {
metadata { name = "shopwave-staging"; labels = { environment = "staging" } }
depends_on = [module.aks]
}
resource "kubernetes_namespace" "production" {
metadata { name = "shopwave-production"; labels = { environment = "production" } }
depends_on = [module.aks]
} modules/aks/main.tf β production-hardened cluster The AKS module creates two node pools: a tainted system pool for Kubernetes internals and an apps pool for ShopWave workloads. Workload Identity, OIDC issuer, Container Insights, Microsoft Defender for Containers, and the Key Vault CSI driver are all enabled by default.
infrastructure/terraform/modules/aks/main.tf hcl resource "azurerm_kubernetes_cluster" "main" {
name = "${var.project}-aks-${var.environment}"
location = var.location
resource_group_name = var.resource_group_name
dns_prefix = "${var.project}-${var.environment}"
kubernetes_version = var.kubernetes_version
oidc_issuer_enabled = true # β required for Workload Identity
workload_identity_enabled = true
default_node_pool { # system pool β kube-system only
name = "system"
vm_size = var.node_vm_size
vnet_subnet_id = var.subnet_id
enable_auto_scaling = true
min_count = var.min_node_count
max_count = var.max_node_count
only_critical_addons_enabled = true # taint: NoSchedule for user workloads
upgrade_settings { max_surge = "33%" }
}
identity { type = "SystemAssigned" }
network_profile {
network_plugin = "azure"
network_policy = "azure" # enforces NetworkPolicy objects
load_balancer_sku = "standard"
service_cidr = "172.16.0.0/16"
dns_service_ip = "172.16.0.10"
}
oms_agent { # Container Insights
log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
}
microsoft_defender { # Defender for Containers
log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
}
key_vault_secrets_provider { # CSI driver for Key Vault secret injection
secret_rotation_enabled = true
secret_rotation_interval = "2m"
}
automatic_channel_upgrade = "patch"
tags = var.tags
}
# App node pool β user workloads land here (no taint)
resource "azurerm_kubernetes_cluster_node_pool" "apps" {
name = "apps"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = var.node_vm_size
vnet_subnet_id = var.subnet_id
enable_auto_scaling = true
min_count = var.min_node_count
max_count = var.max_node_count
mode = "User"
tags = var.tags
}
# AKS pulls images from ACR without a stored password
resource "azurerm_role_assignment" "aks_acr_pull" {
scope = var.acr_id
role_definition_name = "AcrPull"
principal_id = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
} modules/postgres/main.tf β secrets auto-stored in Key Vault The postgres module generates a random password, provisions the Flexible Server with per-service databases, and writes the full connection strings directly into Key Vault β so they're available to the External Secrets Operator without any manual copy-paste step.
infrastructure/terraform/modules/postgres/main.tf hcl resource "random_password" "postgres" {
length = 32; special = true
override_special = "!#$%&*()-_=+[]{}<>?"
}
resource "azurerm_postgresql_flexible_server" "main" {
name = "${var.project}-pg-${var.environment}"
resource_group_name = var.resource_group_name
location = var.location
version = var.postgres_version
delegated_subnet_id = var.subnet_id # private VNet access only
private_dns_zone_id = var.private_dns_zone_id
administrator_login = local.pg_admin
administrator_password = random_password.postgres.result
sku_name = var.sku_name
storage_mb = var.storage_mb
backup_retention_days = var.environment == "production" ? 14 : 7
geo_redundant_backup_enabled = var.environment == "production"
high_availability {
mode = var.environment == "production" ? "ZoneRedundant" : "Disabled"
}
tags = var.tags
lifecycle { ignore_changes = [zone, high_availability[0].standby_availability_zone] }
}
# Create each per-service database
resource "azurerm_postgresql_flexible_server_database" "databases" {
for_each = toset(var.databases)
name = each.value
server_id = azurerm_postgresql_flexible_server.main.id
lifecycle { prevent_destroy = true } # never drop a production database
}
# Write connection strings into Key Vault β one per database
resource "azurerm_key_vault_secret" "db_urls" {
for_each = local.db_urls
name = "${var.project}-${var.environment}-${replace(each.key, "_", "-")}-url"
value = each.value # postgresql+asyncpg://admin:pass@host/db?ssl=require
key_vault_id = var.keyvault_id
} infrastructure/terraform/variables.tf β fully validated
infrastructure/terraform/variables.tf hcl variable "project" {
type = string; default = "shopwave"
validation {
condition = can(regex("^[a-z][a-z0-9-]{2,20}$", var.project))
error_message = "Lowercase alphanumeric + hyphens, 3-20 chars."
}
}
variable "environment" {
type = string
validation {
condition = contains(["staging", "production"], var.environment)
error_message = "Must be 'staging' or 'production'."
}
}
variable "location" { type = string; default = "eastus" }
variable "subscription_id" { type = string; sensitive = true }
# Networking
variable "vnet_address_space" { type = string; default = "10.0.0.0/16" }
variable "aks_subnet_cidr" { type = string; default = "10.0.1.0/24" }
variable "pg_subnet_cidr" { type = string; default = "10.0.2.0/24" }
# AKS
variable "kubernetes_version" { type = string; default = "1.30" }
variable "aks_node_count" { type = number; default = 3 }
variable "aks_node_vm_size" { type = string; default = "Standard_D2s_v3" }
variable "aks_min_node_count" { type = number; default = 2 }
variable "aks_max_node_count" { type = number; default = 10 }
# PostgreSQL
variable "postgres_version" { type = string; default = "16" }
variable "postgres_sku" { type = string; default = "Standard_B2ms" }
variable "postgres_storage_mb" { type = number; default = 32768 }
variable "postgres_databases" { type = list(string); default = ["auth_db","products_db","orders_db"] }
# Key Vault
variable "keyvault_admin_object_ids" { type = list(string); default = [] }
# GitHub OIDC
variable "github_org" { type = string }
variable "github_repo" { type = string; default = "shopwave" } infrastructure/terraform/outputs.tf β reads straight into GitHub Secrets
infrastructure/terraform/outputs.tf hcl output "acr_login_server" { value = module.acr.login_server }
output "aks_cluster_name" { value = module.aks.cluster_name }
output "keyvault_name" { value = module.keyvault.name }
output "postgres_fqdn" { value = module.postgres.fqdn }
output "github_actions_client_id" { value = azuread_application.github_actions.client_id }
output "tenant_id" { value = data.azurerm_client_config.current.tenant_id }
# acr_admin_username and acr_admin_password are marked sensitive=true
# retrieve them with: terraform output -raw acr_admin_username
output "github_secrets_summary" {
sensitive = true
value = <<-EOT
AZURE_CLIENT_ID = ${azuread_application.github_actions.client_id}
AZURE_TENANT_ID = ${data.azurerm_client_config.current.tenant_id}
AZURE_SUBSCRIPTION_ID = ${var.subscription_id}
ACR_LOGIN_SERVER = ${module.acr.login_server}
EOT
} Deploying infrastructure end-to-end # 1 β Bootstrap remote state (once only)
cd infrastructure/terraform
chmod +x scripts/bootstrap-tfstate.sh
./scripts/bootstrap-tfstate.sh
# Update the backend block in main.tf with the printed storage account name
# 2 β Configure your environment
cp terraform.tfvars.example terraform.tfvars
# Edit: subscription_id, github_org, keyvault_admin_object_ids
# 3 β Initialise, plan, apply
terraform init
terraform plan \
-var-file="environments/production/production.tfvars" \
-var="subscription_id=$(az account show --query id -o tsv)" \
-out=tfplan
terraform apply tfplan
# 4 β Copy output values to GitHub Secrets
terraform output github_secrets_summary
terraform output -raw acr_admin_username # β ACR_USERNAME
terraform output -raw acr_admin_password # β ACR_PASSWORD βΉ Terraform environments
Two environment-specific .tfvars files live under environments/. Staging uses smaller VM sizes (Standard_B2s) and a single-node PostgreSQL SKU. Production uses Standard_D2s_v3 nodes, zone-redundant PostgreSQL HA, 14-day backups, and a minimum of three AKS nodes. Both share the same Terraform code β only the variable values differ.
Secrets β Key Vault β pods: the full chain After Terraform runs, all connection strings and secrets live in Key Vault. The External Secrets Operator reads them and materialises them as Kubernetes Secrets, which Helm mounts into pods via envFrom. No secret is ever written to a pipeline log or a YAML file.
helm/shopwave/templates/externalsecret.yaml yaml {{- if .Values.externalSecrets.enabled }}
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: {{ include "shopwave.fullname" . }}-secrets
namespace: {{ .Release.Namespace }}
spec:
refreshInterval: {{ .Values.externalSecrets.refreshInterval }}
secretStoreRef:
name: {{ .Values.externalSecrets.secretStoreName }}
kind: ClusterSecretStore
target:
name: {{ include "shopwave.fullname" . }}-secrets
creationPolicy: Owner
data:
{{- range .Values.externalSecrets.secrets }}
- secretKey: {{ .secretKey }}
remoteRef:
key: {{ .remoteRef.key }}
version: {{ .remoteRef.version | default "latest" }}
{{- end }}
{{- end }} SECTION 08
LOCAL DEVELOPMENT SETUP docker-compose for the full stack locally, pytest for fast iteration.
version: "3.9"
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: shopwave
POSTGRES_PASSWORD: shopwave_local
POSTGRES_MULTIPLE_DATABASES: auth_db,products_db,orders_db
volumes:
- postgres_data:/var/lib/postgresql/data
- ./scripts/init-multiple-dbs.sh:/docker-entrypoint-initdb.d/init.sh
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U shopwave"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
api-gateway:
build:
context: services/api-gateway
target: production
ports:
- "8000:8000"
env_file: .env.local
environment:
AUTH_SERVICE_URL: http://auth-service:8001
PRODUCT_SERVICE_URL: http://product-service:8002
ORDER_SERVICE_URL: http://order-service:8003
REDIS_URL: redis://redis:6379/0
depends_on:
- auth-service
- product-service
- order-service
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 15s
timeout: 5s
auth-service:
build:
context: services/auth-service
target: production
ports:
- "8001:8001"
env_file: .env.local
environment:
DATABASE_URL: postgresql+asyncpg://shopwave:shopwave_local@postgres:5432/auth_db
ENVIRONMENT: local
depends_on:
postgres:
condition: service_healthy
product-service:
build:
context: services/product-service
target: production
ports:
- "8002:8002"
env_file: .env.local
environment:
DATABASE_URL: postgresql+asyncpg://shopwave:shopwave_local@postgres:5432/products_db
ENVIRONMENT: local
depends_on:
postgres:
condition: service_healthy
order-service:
build:
context: services/order-service
target: production
ports:
- "8003:8003"
env_file: .env.local
environment:
DATABASE_URL: postgresql+asyncpg://shopwave:shopwave_local@postgres:5432/orders_db
ENVIRONMENT: local
depends_on:
postgres:
condition: service_healthy
volumes:
postgres_data: SECTION 09
GITHUB SECRETS REFERENCE Every secret the pipelines need β where it comes from, what it does.
Required GitHub Secrets
Secret Name Where to get it Used in Notes AZURE_CLIENT_ID terraform output github_actions_client_id All pipelines OIDC federated identity β no password AZURE_TENANT_ID terraform output tenant_id All pipelines Your Azure AD tenant AZURE_SUBSCRIPTION_ID terraform output subscription_id All pipelines Target subscription ACR_LOGIN_SERVER terraform output acr_login_server CI build + CD deploy e.g. shopwaveacr.azurecr.io ACR_USERNAME terraform output -raw acr_admin_username CI build (docker login) ACR admin username ACR_PASSWORD terraform output -raw acr_admin_password CI build (docker login) Rotate regularly SLACK_WEBHOOK_URL Slack app β Incoming Webhooks All pipelines (notify) Incoming webhook URL CODECOV_TOKEN codecov.io project settings CI test job Coverage upload token
SECTION 10
END-TO-END DEVELOPER WORKFLOW From git commit to live production β every step automated.
Complete Developer β Production Journey
1 Developer opens PR git checkout -b feat/add-wishlist β git push β PR opened T+0 2 CI triggers (path-filtered) Only product-service CI runs Β· lint β test β SAST Β· ~6 min T+6m 3 Tests pass Β· PR reviewable GitHub status checks green Β· Coverage badge updates Β· Codecov report T+7m 4 Code review + merge to main Reviewer approves Β· Squash merge Β· CI re-runs on main branch T+Xh 5 CI builds + pushes to ACR Multi-stage Docker build Β· Trivy scan (block on CRITICAL) Β· Push sha-XXXXXXXX T+Xh+8m 6 CD staging auto-deploys Helm upgrade --atomic Β· Smoke tests Β· Slack alert Β· staging tag created T+Xh+13m 7 Manual gate β production Engineer triggers cd-production.yml Β· Reviewer approves Β· Rolling deploy β live T+Xh+30m β
Infra change? terraform.yml PR touches infrastructure/terraform/ β plan comment on PR β apply on dispatch as needed SECTION 11
NETWORK POLICIES & ZERO-TRUST KUBERNETES Services can only talk to who they need to.
k8s/network-policies.yaml yaml # Default deny-all in production namespace
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: shopwave-production
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
# Allow ingress from nginx ingress controller β api-gateway only
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-to-gateway
namespace: shopwave-production
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: api-gateway
policyTypes: [Ingress]
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
# Allow api-gateway β each backend service
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-gateway-to-services
namespace: shopwave-production
spec:
podSelector:
matchLabels:
app.kubernetes.io/component: backend-service
policyTypes: [Ingress]
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: api-gateway
ports:
- protocol: TCP
port: 8001
- protocol: TCP
port: 8002
- protocol: TCP
port: 8003
# Allow all services egress to Azure DB (private endpoint)
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-egress-to-postgres
namespace: shopwave-production
spec:
podSelector:
matchLabels:
app.kubernetes.io/component: backend-service
policyTypes: [Egress]
egress:
- ports:
- protocol: TCP
port: 5432 SECTION 12
SECURITY POSTURE SUMMARY Every layer of the pipeline hardened against common attack vectors.
Security Controls β Layer by Layer
CODE βΈ Semgrep SAST (p/python, p/owasp-top-ten, p/jwt)βΈ Ruff + mypy static analysis on every PRβΈ Dependabot auto-PRs for vulnerable dependenciesIMAGE βΈ Multi-stage build β distroless final image (~15MB)βΈ Trivy scans CRITICAL/HIGH CVEs β blocks pipelineβΈ Cosign image signing β ACR verifies before deployINFRA βΈ All Azure resources declared in Terraform β no manual changesβΈ terraform plan diff commented on every infra PRβΈ Remote state in Azure Blob with versioning enabledRUNTIME βΈ runAsNonRoot: true Β· readOnlyRootFilesystem: trueβΈ capabilities: drop [ALL] β zero Linux capabilitiesβΈ Azure Network Policy β default deny, explicit allow-listCREDENTIALS βΈ OIDC Workload Identity β zero static credentials in pipelinesβΈ Azure Key Vault β connection strings auto-seeded by TerraformβΈ Secret rotation via Key Vault versioning + ESO 1h refreshSECTION 13
WHAT YOU NOW HAVE A complete, production-deployable E-Commerce platform.
Every component in this walkthrough is a real, working piece of the ShopWave system. Here is the complete inventory of what was built:
π
4 Python FastAPI Services
auth, product, order, api-gateway β each with async SQLAlchemy, Alembic migrations, structured logging, Prometheus metrics
βοΈ
7 GitHub Actions Pipelines
4 per-service CI pipelines + staging CD + production CD + terraform.yml β path-filtered, OIDC-authenticated, Slack-notified
ποΈ
Terraform Infrastructure as Code
5 modules β networking, aks, acr, postgres, keyvault β all resources versioned, plan-reviewed on PR, applied through a gated pipeline
π
End-to-End Security
Semgrep SAST Β· Trivy image scan Β· Cosign signing Β· OIDC no-credential auth Β· Key Vault secrets auto-seeded by Terraform Β· zero-trust network policies
βΈοΈ
Helm Chart + K8s Manifests
Single parameterised chart Β· rolling update strategy Β· HPA Β· PDB Β· External Secrets Β· RBAC Β· Network Policies
π
Zero-Downtime Deployments
maxUnavailable: 0 rolling updates Β· --atomic Helm with auto-rollback Β· smoke tests gate every staging deploy
βοΈ
Full Azure Stack
AKS + ACR + PostgreSQL Flexible (zone-redundant in prod) + Key Vault + private VNet + Azure Monitor + Defender for Containers β all Terraform-managed
β Next steps
From here, natural extensions include: switching Terraform to use Terraform Cloud or Atlantis for plan/apply with PR-based GitOps workflows; adding Argo Rollouts for canary deployments with automated metric-based promotion; integrating Azure Application Insights for distributed tracing across all four services; adding a contract testing job to each CI pipeline using Pact; and wiring Azure Cost Management alerts into the pipeline notifications.
ShopWave Β· Python FastAPI Β· Azure AKS Β· Terraform IaC Β· GitHub Actions Β· Production-Grade CI/CD Β· February 2026