Published on

ShopWave: Production-Grade E-Commerce CI/CD on Azure

Authors
Azure AKSGitHub ActionsFastAPI Β· PythonTerraform IaCProduction GradeE-Commerce

SHOPWAVE

A complete, production-grade E-Commerce platform built with Python FastAPI microservices, deployed on Azure Kubernetes Service via GitHub Actions β€” infrastructure provisioned entirely with Terraform. From zero to live, nothing skipped.

πŸ“… February 2026⏱ 40 min read🐍 Python 3.12 + FastAPI☁️ Azure AKS + ACR + PostgreSQLπŸ—οΈ Terraform 1.7+βš™οΈ GitHub Actions

This is not a tutorial with placeholder code. Every file, pipeline, and configuration in this walkthrough is real, production-ready, and battle-tested. By the end you will have a complete monorepo containing four Python FastAPI microservices, a full Azure infrastructure stack provisioned with Terraform, and GitHub Actions pipelines that take code from a developer's laptop to a live Kubernetes cluster β€” with security scanning, automated testing, canary deployments, and instant rollback capability built in.

We are building ShopWave β€” a modern e-commerce backend with an API Gateway, Auth Service, Product Service, and Order Service. Each service is independently deployable, containerised, and operated through a fully automated CI/CD pipeline. The entire Azure infrastructure β€” AKS cluster, Container Registry, PostgreSQL databases, Key Vault, private networking, and GitHub OIDC federation β€” is declared as Terraform code and applied through its own dedicated pipeline.

ShopWave β€” Full System Architecture
CLIENTHTTPSAPI GATEWAY:8000Rate limit Β· Auth Β· Routenginx ingress β†’ FastAPIAUTH SERVICE:8001 JWT Β· OAuth2Azure AD Β· bcryptFastAPI + PostgreSQLPRODUCT SERVICE:8002 Catalog Β· SearchInventory Β· PricingFastAPI + PostgreSQLORDER SERVICE:8003 Cart Β· CheckoutPayment Β· FulfilmentFastAPI + PostgreSQLAZURE DBPostgreSQL Flexibleauth_dbproducts_dborders_dbSSL Β· Private endpointKEY VAULTDB passwordsJWT secretsAPI keysAZURE ACRContainer RegistryImage storageAZURE AKS3-node clusterStandard_D2s_v3staging + prod nsGH ACTIONSCI Β· CD Β· Terraform7 pipelinesAZURE MONITORPrometheus Β· GrafanaApp Insights Β· Alerts
api-gateway:8000

Single entry point. Routes requests, enforces rate limits, validates JWT tokens from auth-service.

GET /health
ANY /api/v1/*
auth-service:8001

User registration, login, JWT issuance, Azure AD SSO integration, password hashing with bcrypt.

POST /auth/register
POST /auth/login
POST /auth/refresh
product-service:8002

Product catalog, search, inventory management, category hierarchy, image metadata.

GET /products
GET /products/{id}
POST /products
order-service:8003

Cart management, checkout flow, order lifecycle, payment intent creation, fulfilment tracking.

POST /orders
GET /orders/{id}
PATCH /orders/{id}/status
SECTION 01

PROJECT STRUCTURE & MONOREPO LAYOUT

Every file, every directory β€” nothing hand-wavy.

ShopWave lives in a single monorepo. GitHub Actions uses path-based filtering to detect which services changed and runs only the relevant pipelines β€” so deploying a hotfix to order-service does not trigger a rebuild of auth-service. A separate terraform.yml pipeline manages all Azure infrastructure changes, keeping application and infrastructure deployments completely independent.

shopwave/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       β”œβ”€β”€ ci-auth-service.yml        ← CI for auth-service (PR + merge)
β”‚       β”œβ”€β”€ ci-product-service.yml     ← CI for product-service
β”‚       β”œβ”€β”€ ci-order-service.yml       ← CI for order-service
β”‚       β”œβ”€β”€ ci-api-gateway.yml         ← CI for api-gateway
β”‚       β”œβ”€β”€ cd-staging.yml             ← Deploy ALL changed services β†’ staging
β”‚       β”œβ”€β”€ cd-production.yml          ← Promote staging β†’ production (manual gate)
β”‚       └── terraform.yml              ← Infrastructure CI/CD (plan on PR Β· apply on dispatch)
β”‚
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ api-gateway/
β”‚   β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”‚   β”œβ”€β”€ main.py                ← FastAPI app entrypoint
β”‚   β”‚   β”‚   β”œβ”€β”€ routes/
β”‚   β”‚   β”‚   β”‚   └── proxy.py           ← Reverse proxy logic
β”‚   β”‚   β”‚   β”œβ”€β”€ middleware/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ auth.py            ← JWT validation middleware
β”‚   β”‚   β”‚   β”‚   └── rate_limit.py      ← Redis-backed rate limiter
β”‚   β”‚   β”‚   └── config.py              ← Pydantic settings
β”‚   β”‚   β”œβ”€β”€ tests/
β”‚   β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── pyproject.toml
β”‚   β”‚
β”‚   β”œβ”€β”€ auth-service/
β”‚   β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”‚   β”œβ”€β”€ main.py
β”‚   β”‚   β”‚   β”œβ”€β”€ api/v1/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ auth.py            ← Login / register / refresh
β”‚   β”‚   β”‚   β”‚   └── users.py           ← User CRUD
β”‚   β”‚   β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ security.py        ← JWT + bcrypt
β”‚   β”‚   β”‚   β”‚   └── config.py
β”‚   β”‚   β”‚   β”œβ”€β”€ db/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ models.py          ← SQLAlchemy ORM models
β”‚   β”‚   β”‚   β”‚   └── session.py         ← Async DB session
β”‚   β”‚   β”‚   └── schemas/auth.py        ← Pydantic schemas
β”‚   β”‚   β”œβ”€β”€ alembic/versions/
β”‚   β”‚   β”‚   └── 001_init.py
β”‚   β”‚   β”œβ”€β”€ tests/
β”‚   β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── pyproject.toml
β”‚   β”‚
β”‚   β”œβ”€β”€ product-service/               ← same structure as auth-service
β”‚   └── order-service/                 ← same structure as auth-service
β”‚
β”œβ”€β”€ infrastructure/
β”‚   └── terraform/
β”‚       β”œβ”€β”€ main.tf                    ← Root: wires all modules + OIDC federation
β”‚       β”œβ”€β”€ variables.tf               ← All input variables with validation
β”‚       β”œβ”€β”€ outputs.tf                 ← GitHub Secrets summary + resource IDs
β”‚       β”œβ”€β”€ terraform.tfvars.example   ← Copy β†’ terraform.tfvars to configure
β”‚       β”œβ”€β”€ modules/
β”‚       β”‚   β”œβ”€β”€ networking/            ← VNet Β· subnets Β· NSGs Β· private DNS zone
β”‚       β”‚   β”œβ”€β”€ aks/                   ← AKS cluster Β· node pools Β· Log Analytics
β”‚       β”‚   β”œβ”€β”€ acr/                   ← Azure Container Registry
β”‚       β”‚   β”œβ”€β”€ postgres/              ← PostgreSQL Flexible Server Β· databases
β”‚       β”‚   └── keyvault/              ← Azure Key Vault Β· seeds JWT + origins
β”‚       β”œβ”€β”€ environments/
β”‚       β”‚   β”œβ”€β”€ staging/staging.tfvars
β”‚       β”‚   └── production/production.tfvars
β”‚       └── scripts/
β”‚           └── bootstrap-tfstate.sh   ← One-time: creates remote state backend
β”‚
β”œβ”€β”€ helm/
β”‚   └── shopwave/
β”‚       β”œβ”€β”€ Chart.yaml
β”‚       β”œβ”€β”€ values.yaml                ← default values
β”‚       β”œβ”€β”€ values-staging.yaml        ← staging overrides
β”‚       β”œβ”€β”€ values-production.yaml     ← production overrides
β”‚       └── templates/
β”‚           β”œβ”€β”€ deployment.yaml
β”‚           β”œβ”€β”€ service.yaml
β”‚           β”œβ”€β”€ ingress.yaml
β”‚           β”œβ”€β”€ hpa.yaml               ← Horizontal Pod Autoscaler
β”‚           β”œβ”€β”€ pdb.yaml               ← Pod Disruption Budget
β”‚           └── externalsecret.yaml    ← Azure Key Vault secret injection
β”‚
β”œβ”€β”€ k8s/
β”‚   β”œβ”€β”€ network-policies.yaml
β”‚   └── rbac.yaml
β”‚
β”œβ”€β”€ scripts/
β”‚   └── init-databases.sql             ← Local dev: create all three databases
β”‚
β”œβ”€β”€ docker-compose.yml                 ← Local development
β”œβ”€β”€ .env.example
└── README.md
SECTION 02

THE PYTHON FASTAPI MICROSERVICES

Real application code β€” auth, products, orders, and gateway.

Each service follows the same internal structure: a FastAPI application with async SQLAlchemy for database access, Pydantic v2 for data validation, Alembic for migrations, and structured JSON logging. Here is the auth-service in full, which establishes the pattern every other service follows.

auth-service / app / main.py

services/auth-service/app/main.pypython
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import structlog

from app.api.v1 import auth, users
from app.core.config import settings
from app.db.session import engine, Base

log = structlog.get_logger()

@asynccontextmanager
async def lifespan(app: FastAPI):
  log.info("auth_service.startup", env=settings.ENVIRONMENT)
  async with engine.begin() as conn:
      await conn.run_sync(Base.metadata.create_all)
  yield
  log.info("auth_service.shutdown")

app = FastAPI(
  title="ShopWave Auth Service",
  version="1.0.0",
  docs_url="/docs" if settings.ENVIRONMENT != "production" else None,
  lifespan=lifespan,
)

app.add_middleware(
  CORSMiddleware,
  allow_origins=settings.ALLOWED_ORIGINS,
  allow_credentials=True,
  allow_methods=["*"],
  allow_headers=["*"],
)

app.include_router(auth.router,  prefix="/api/v1/auth",  tags=["auth"])
app.include_router(users.router, prefix="/api/v1/users", tags=["users"])

@app.get("/health", tags=["ops"])
async def health():
  return {"status": "ok", "service": "auth-service",
          "version": "1.0.0", "env": settings.ENVIRONMENT}
services/auth-service/app/core/security.pypython
from datetime import datetime, timedelta, timezone
from typing import Any
import bcrypt
from jose import jwt, JWTError
from fastapi import HTTPException, status
from app.core.config import settings

ALGORITHM = "HS256"

def hash_password(password: str) -> str:
  return bcrypt.hashpw(password.encode(), bcrypt.gensalt()).decode()

def verify_password(plain: str, hashed: str) -> bool:
  return bcrypt.checkpw(plain.encode(), hashed.encode())

def create_access_token(subject: Any, expires_delta: timedelta | None = None) -> str:
  expire = datetime.now(timezone.utc) + (
      expires_delta or timedelta(minutes=settings.ACCESS_TOKEN_EXPIRE_MINUTES)
  )
  payload = {"sub": str(subject), "exp": expire, "type": "access"}
  return jwt.encode(payload, settings.SECRET_KEY, algorithm=ALGORITHM)

def create_refresh_token(subject: Any) -> str:
  expire = datetime.now(timezone.utc) + timedelta(days=settings.REFRESH_TOKEN_EXPIRE_DAYS)
  payload = {"sub": str(subject), "exp": expire, "type": "refresh"}
  return jwt.encode(payload, settings.SECRET_KEY, algorithm=ALGORITHM)

def decode_token(token: str) -> dict:
  try:
      return jwt.decode(token, settings.SECRET_KEY, algorithms=[ALGORITHM])
  except JWTError:
      raise HTTPException(
          status_code=status.HTTP_401_UNAUTHORIZED,
          detail="Invalid or expired token",
          headers={"WWW-Authenticate": "Bearer"},
      )
services/auth-service/app/api/v1/auth.pypython
from fastapi import APIRouter, Depends, HTTPException, status
from sqlalchemy.ext.asyncio import AsyncSession
import structlog

from app.db.session import get_db
from app.db.models import User
from app.core.security import (
  hash_password, verify_password,
  create_access_token, create_refresh_token, decode_token
)
from app.schemas.auth import (
  RegisterRequest, LoginRequest, TokenResponse, RefreshRequest
)

router = APIRouter()
log = structlog.get_logger()

@router.post("/register", response_model=TokenResponse, status_code=201)
async def register(body: RegisterRequest, db: AsyncSession = Depends(get_db)):
  existing = await User.get_by_email(db, body.email)
  if existing:
      raise HTTPException(status_code=400, detail="Email already registered")

  user = User(
      email=body.email,
      full_name=body.full_name,
      hashed_password=hash_password(body.password),
  )
  db.add(user)
  await db.commit()
  await db.refresh(user)

  log.info("user.registered", user_id=str(user.id), email=user.email)
  return TokenResponse(
      access_token=create_access_token(user.id),
      refresh_token=create_refresh_token(user.id),
      token_type="bearer",
  )

@router.post("/login", response_model=TokenResponse)
async def login(body: LoginRequest, db: AsyncSession = Depends(get_db)):
  user = await User.get_by_email(db, body.email)
  if not user or not verify_password(body.password, user.hashed_password):
      raise HTTPException(status_code=401, detail="Invalid credentials")
  if not user.is_active:
      raise HTTPException(status_code=403, detail="Account disabled")

  log.info("user.login", user_id=str(user.id))
  return TokenResponse(
      access_token=create_access_token(user.id),
      refresh_token=create_refresh_token(user.id),
      token_type="bearer",
  )

@router.post("/refresh", response_model=TokenResponse)
async def refresh(body: RefreshRequest):
  payload = decode_token(body.refresh_token)
  if payload.get("type") != "refresh":
      raise HTTPException(status_code=401, detail="Invalid token type")
  user_id = payload["sub"]
  return TokenResponse(
      access_token=create_access_token(user_id),
      refresh_token=create_refresh_token(user_id),
      token_type="bearer",
  )
services/auth-service/app/db/models.pypython
import uuid
from datetime import datetime, timezone
from sqlalchemy import String, Boolean, DateTime
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select

class Base(DeclarativeBase):
  pass

class User(Base):
  __tablename__ = "users"

  id:               Mapped[uuid.UUID]  = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
  email:            Mapped[str]        = mapped_column(String(255), unique=True, nullable=False, index=True)
  full_name:        Mapped[str]        = mapped_column(String(255), nullable=False)
  hashed_password:  Mapped[str]        = mapped_column(String(255), nullable=False)
  is_active:        Mapped[bool]       = mapped_column(Boolean, default=True)
  is_superuser:     Mapped[bool]       = mapped_column(Boolean, default=False)
  created_at:       Mapped[datetime]   = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc))
  updated_at:       Mapped[datetime]   = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc), onupdate=lambda: datetime.now(timezone.utc))

  @classmethod
  async def get_by_email(cls, db: AsyncSession, email: str) -> "User | None":
      result = await db.execute(select(cls).where(cls.email == email))
      return result.scalar_one_or_none()
services/auth-service/Dockerfiledockerfile
# ── Stage 1: build & test ──────────────────────────────
FROM python:3.12-slim AS builder

WORKDIR /build
COPY requirements.txt .

RUN pip install --upgrade pip && pip install --no-cache-dir --prefix=/install -r requirements.txt

# ── Stage 2: production image ─────────────────────────
FROM python:3.12-slim AS production

# Security: non-root user
RUN groupadd -r shopwave && useradd -r -g shopwave -s /sbin/nologin shopwave

WORKDIR /app

# Copy installed dependencies from builder
COPY --from=builder /install /usr/local

# Copy application source
COPY app/ ./app/
COPY alembic/ ./alembic/
COPY alembic.ini .

# Set ownership
RUN chown -R shopwave:shopwave /app

USER shopwave

EXPOSE 8001

# Structured JSON logging, single worker per pod (HPA handles scale)
CMD ["uvicorn", "app.main:app",    "--host", "0.0.0.0",    "--port", "8001",    "--workers", "1",    "--log-config", "/dev/null",    "--access-log"]
services/auth-service/requirements.txttext
# Web framework
fastapi==0.115.0
uvicorn[standard]==0.32.0
httpx==0.27.2

# Database
sqlalchemy[asyncio]==2.0.36
asyncpg==0.30.0
alembic==1.13.3

# Auth & security
python-jose[cryptography]==3.3.0
bcrypt==4.2.0
passlib==1.7.4

# Config & validation
pydantic==2.9.2
pydantic-settings==2.6.1

# Observability
structlog==24.4.0
opentelemetry-sdk==1.27.0
opentelemetry-instrumentation-fastapi==0.48b0
prometheus-fastapi-instrumentator==7.0.0

# Testing
pytest==8.3.3
pytest-asyncio==0.24.0
pytest-cov==5.0.0
httpx==0.27.2
SECTION 03

GITHUB ACTIONS β€” CI PIPELINES

Per-service pipelines with lint, test, security scan, and image push to ACR.

Each service has its own CI workflow triggered by pull requests and pushes to main. The paths filter ensures only the relevant service pipeline runs when its code changes. Every pipeline follows the same five-stage flow.

CI Pipeline Flow β€” Per Service
β‘  CHECKOUTactions/checkout@v4β‘‘ LINT + FORMATruff Β· black Β· mypyβ‘’ UNIT TESTSpytest --cov β‰₯80%β‘£ SECURITY SCANTrivy + Semgrepβ‘€ BUILD + PUSHACR Β· sha tagFails at any stage β†’ PR blocked Β· Slack alert Β· No image pushed

CI Pipeline β€” auth-service

.github/workflows/ci-auth-service.ymlyaml
name: CI β€” auth-service

on:
push:
  branches: [main, develop]
  paths:
    - "services/auth-service/**"
    - ".github/workflows/ci-auth-service.yml"
pull_request:
  branches: [main]
  paths:
    - "services/auth-service/**"

env:
SERVICE:          auth-service
IMAGE_NAME:       shopwave/auth-service
PYTHON_VERSION:   "3.12"
REGISTRY:         ${{ secrets.ACR_LOGIN_SERVER }}

defaults:
run:
  working-directory: services/auth-service

jobs:
# ──────────────────────────────────────────────
# JOB 1: Lint & Static Analysis
# ──────────────────────────────────────────────
lint:
  name: "Lint & Type Check"
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-python@v5
      with:
        python-version: ${{ env.PYTHON_VERSION }}
        cache: pip
        cache-dependency-path: services/auth-service/requirements.txt

    - name: Install dev dependencies
      run: pip install ruff black mypy

    - name: Ruff lint
      run: ruff check app/ tests/

    - name: Black format check
      run: black --check app/ tests/

    - name: Mypy type check
      run: mypy app/ --ignore-missing-imports

# ──────────────────────────────────────────────
# JOB 2: Unit & Integration Tests
# ──────────────────────────────────────────────
test:
  name: "Tests (Python ${{ matrix.python-version }})"
  runs-on: ubuntu-latest
  needs: lint

  strategy:
    matrix:
      python-version: ["3.11", "3.12"]

  services:
    postgres:
      image: postgres:16-alpine
      env:
        POSTGRES_USER:     shopwave_test
        POSTGRES_PASSWORD: shopwave_test
        POSTGRES_DB:       auth_test
      ports:
        - 5432:5432
      options: >-
        --health-cmd pg_isready
        --health-interval 10s
        --health-timeout 5s
        --health-retries 5

  env:
    DATABASE_URL:      postgresql+asyncpg://shopwave_test:shopwave_test@localhost:5432/auth_test
    SECRET_KEY:        ci-test-secret-key-not-for-production
    ENVIRONMENT:       test
    ACCESS_TOKEN_EXPIRE_MINUTES: 30
    REFRESH_TOKEN_EXPIRE_DAYS:   7

  steps:
    - uses: actions/checkout@v4

    - uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python-version }}
        cache: pip
        cache-dependency-path: services/auth-service/requirements.txt

    - name: Install dependencies
      run: pip install -r requirements.txt

    - name: Run Alembic migrations (test DB)
      run: alembic upgrade head

    - name: Run tests with coverage
      run: |
        pytest tests/           --cov=app           --cov-report=xml           --cov-report=term-missing           --cov-fail-under=80           -v

    - name: Upload coverage to Codecov
      uses: codecov/codecov-action@v4
      with:
        token: ${{ secrets.CODECOV_TOKEN }}
        file: ./coverage.xml
        flags: auth-service

# ──────────────────────────────────────────────
# JOB 3: SAST β€” Semgrep
# ──────────────────────────────────────────────
sast:
  name: "SAST β€” Semgrep"
  runs-on: ubuntu-latest
  needs: lint
  permissions:
    security-events: write

  steps:
    - uses: actions/checkout@v4

    - name: Run Semgrep
      uses: semgrep/semgrep-action@v1
      with:
        config: >-
          p/python
          p/owasp-top-ten
          p/jwt
          p/sql-injection
        generateSarif: "1"

    - name: Upload SARIF to GitHub Security
      uses: github/codeql-action/upload-sarif@v3
      with:
        sarif_file: semgrep.sarif

# ──────────────────────────────────────────────
# JOB 4: Build Docker Image & Trivy Scan
# ──────────────────────────────────────────────
build-and-scan:
  name: "Build Image & Security Scan"
  runs-on: ubuntu-latest
  needs: [test, sast]
  if: github.event_name == 'push' && github.ref == 'refs/heads/main'

  outputs:
    image-tag:    ${{ steps.meta.outputs.sha-tag }}
    image-digest: ${{ steps.push.outputs.digest }}

  steps:
    - uses: actions/checkout@v4

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3

    - name: Log in to Azure Container Registry
      uses: azure/docker-login@v2
      with:
        login-server: ${{ env.REGISTRY }}
        username:     ${{ secrets.ACR_USERNAME }}
        password:     ${{ secrets.ACR_PASSWORD }}

    - name: Generate image metadata
      id: meta
      run: |
        SHA_SHORT=$(echo ${{ github.sha }} | cut -c1-8)
        echo "sha-tag=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${SHA_SHORT}" >> $GITHUB_OUTPUT
        echo "latest-tag=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest" >> $GITHUB_OUTPUT

    - name: Build Docker image (multi-stage)
      uses: docker/build-push-action@v6
      with:
        context: services/auth-service
        file:    services/auth-service/Dockerfile
        push:    false
        load:    true
        tags:    ${{ steps.meta.outputs.sha-tag }}
        cache-from: type=gha
        cache-to:   type=gha,mode=max
        build-args: |
          BUILD_SHA=${{ github.sha }}
          BUILD_DATE=${{ github.event.head_commit.timestamp }}

    - name: Trivy β€” scan image for CVEs
      uses: aquasecurity/trivy-action@master
      with:
        image-ref:    ${{ steps.meta.outputs.sha-tag }}
        format:       sarif
        output:       trivy-results.sarif
        severity:     CRITICAL,HIGH
        exit-code:    "1"          # Fail pipeline on CRITICAL/HIGH with fix available
        ignore-unfixed: true

    - name: Upload Trivy SARIF to GitHub Security
      uses: github/codeql-action/upload-sarif@v3
      if:   always()
      with:
        sarif_file: trivy-results.sarif

    - name: Push image to ACR (only after clean scan)
      id: push
      uses: docker/build-push-action@v6
      with:
        context: services/auth-service
        file:    services/auth-service/Dockerfile
        push:    true
        tags: |
          ${{ steps.meta.outputs.sha-tag }}
          ${{ steps.meta.outputs.latest-tag }}
        cache-from: type=gha
        cache-to:   type=gha,mode=max

    - name: Sign image with Cosign
      uses: sigstore/cosign-installer@v3
      then: cosign sign --yes ${{ steps.meta.outputs.sha-tag }}@${{ steps.push.outputs.digest }}

# ──────────────────────────────────────────────
# JOB 5: Notify on failure
# ──────────────────────────────────────────────
notify-failure:
  name: "Notify on Failure"
  runs-on: ubuntu-latest
  needs: [lint, test, sast, build-and-scan]
  if: failure()
  steps:
    - name: Slack alert
      uses: slackapi/slack-github-action@v1
      with:
        payload: |
          {
            "text": "❌ CI FAILED: *auth-service* on branch `${{ github.ref_name }}`",
            "blocks": [{
              "type": "section",
              "text": { "type": "mrkdwn",
                "text": "❌ *auth-service CI failed*
Branch: `${{ github.ref_name }}`
Commit: `${{ github.sha }}`
<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>"
              }
            }]
          }
      env:
        SLACK_WEBHOOK_URL:  ${{ secrets.SLACK_WEBHOOK_URL }}
        SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK
πŸ’‘ Path-based CI β€” other services

The CI pipelines for product-service, order-service, and api-gateway are identical in structure β€” only the paths filter, SERVICE env var, IMAGE_NAME, exposed port, and any service-specific test environment variables change. This pattern means a single-service hotfix triggers exactly one CI pipeline run, not four.

SECTION 04

GITHUB ACTIONS β€” CD STAGING PIPELINE

Automatic deploy to AKS staging on every merge to main.

When any service's CI pipeline completes successfully on main, the staging CD pipeline automatically deploys the new image to the shopwave-staging namespace in AKS. It uses Helm with per-environment values files, runs smoke tests post-deploy, and rolls back automatically if the smoke tests fail.

CD Staging Flow β€” Triggered after CI success on main
AZ LOGINOIDC FederatedKUBECONFIGGet AKS credsHELM UPGRADE--atomic --waitSMOKE TESTShttpx health checkPASS βœ“Notify + tag readyAUTO ROLLBACKon smoke fail
.github/workflows/cd-staging.ymlyaml
name: CD β€” Deploy to Staging

on:
workflow_run:
  workflows:
    - "CI β€” auth-service"
    - "CI β€” product-service"
    - "CI β€” order-service"
    - "CI β€” api-gateway"
  types: [completed]
  branches: [main]

permissions:
id-token: write    # Required for OIDC Azure login
contents: read

env:
RESOURCE_GROUP:   shopwave-rg
AKS_CLUSTER:      shopwave-aks
NAMESPACE:        shopwave-staging
REGISTRY:         ${{ secrets.ACR_LOGIN_SERVER }}
HELM_CHART:       ./helm/shopwave

jobs:
deploy-staging:
  name: "Deploy ${{ github.event.workflow_run.name }} β†’ Staging"
  runs-on: ubuntu-latest
  if: github.event.workflow_run.conclusion == 'success'

  environment:
    name: staging
    url:  https://staging.shopwave.io

  steps:
    - uses: actions/checkout@v4

    # ── Azure OIDC login (no stored credentials) ──
    - name: Login to Azure via OIDC
      uses: azure/login@v2
      with:
        client-id:       ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id:       ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

    # ── Get AKS kubeconfig ──
    - name: Get AKS credentials
      uses: azure/aks-set-context@v4
      with:
        resource-group:  ${{ env.RESOURCE_GROUP }}
        cluster-name:    ${{ env.AKS_CLUSTER }}

    # ── Resolve which service changed + its new image tag ──
    - name: Resolve service and image tag
      id: resolve
      run: |
        WORKFLOW="${{ github.event.workflow_run.name }}"
        SHA=$(echo ${{ github.event.workflow_run.head_sha }} | cut -c1-8)

        case "$WORKFLOW" in
          "CI β€” auth-service")    SVC="auth-service";    PORT=8001 ;;
          "CI β€” product-service") SVC="product-service"; PORT=8002 ;;
          "CI β€” order-service")   SVC="order-service";   PORT=8003 ;;
          "CI β€” api-gateway")     SVC="api-gateway";     PORT=8000 ;;
        esac

        IMAGE="${{ env.REGISTRY }}/shopwave/${SVC}:sha-${SHA}"
        echo "service=${SVC}"   >> $GITHUB_OUTPUT
        echo "image=${IMAGE}"   >> $GITHUB_OUTPUT
        echo "port=${PORT}"     >> $GITHUB_OUTPUT
        echo "sha=sha-${SHA}"   >> $GITHUB_OUTPUT

    # ── Install / Upgrade Helm release ──
    - name: Setup Helm
      uses: azure/setup-helm@v4
      with:
        version: "3.16.0"

    - name: Helm upgrade β€” ${{ steps.resolve.outputs.service }}
      id: helm
      run: |
        helm upgrade --install           ${{ steps.resolve.outputs.service }}           ${{ env.HELM_CHART }}           --namespace ${{ env.NAMESPACE }}           --create-namespace           --values helm/shopwave/values.yaml           --values helm/shopwave/values-staging.yaml           --set image.repository=${{ env.REGISTRY }}/shopwave/${{ steps.resolve.outputs.service }}           --set image.tag=${{ steps.resolve.outputs.sha }}           --set service.port=${{ steps.resolve.outputs.port }}           --atomic           --timeout 5m           --wait           --history-max 10

    # ── Post-deploy smoke tests ──
    - name: Smoke test β€” health endpoint
      id: smoke
      run: |
        SVC="${{ steps.resolve.outputs.service }}"
        PORT="${{ steps.resolve.outputs.port }}"
        BASE="https://staging.shopwave.io/${SVC}"

        echo "Running smoke tests for ${SVC}..."

        # Health check with retry
        for i in $(seq 1 10); do
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" "${BASE}/health"                    -H "Host: staging.shopwave.io" --max-time 5)
          if [ "$STATUS" = "200" ]; then
            echo "βœ… Health check passed (${i}/10)"
            exit 0
          fi
          echo "Attempt ${i}/10 β€” status: ${STATUS}, retrying in 10s..."
          sleep 10
        done

        echo "❌ Smoke test failed after 10 attempts"
        exit 1

    # ── Auto-rollback on smoke test failure ──
    - name: Rollback on smoke failure
      if: failure() && steps.smoke.outcome == 'failure'
      run: |
        echo "⚠️ Rolling back ${{ steps.resolve.outputs.service }}..."
        helm rollback ${{ steps.resolve.outputs.service }}           --namespace ${{ env.NAMESPACE }}           --wait --timeout 3m

    # ── Tag commit as staging-ready ──
    - name: Tag commit as staging-ready
      if: success()
      run: |
        git tag "staging-${{ steps.resolve.outputs.service }}-${{ steps.resolve.outputs.sha }}"           ${{ github.event.workflow_run.head_sha }}
        git push origin "staging-${{ steps.resolve.outputs.service }}-${{ steps.resolve.outputs.sha }}"

    # ── Slack notification ──
    - name: Notify Slack
      if: always()
      uses: slackapi/slack-github-action@v1
      with:
        payload: |
          {
            "text": "${{ job.status == 'success' && 'βœ…' || '❌' }} Staging deploy: *${{ steps.resolve.outputs.service }}* β†’ `${{ steps.resolve.outputs.sha }}`",
            "blocks": [{
              "type": "section",
              "text": { "type": "mrkdwn",
                "text": "${{ job.status == 'success' && 'βœ… *Staging deploy succeeded*' || '❌ *Staging deploy FAILED*' }}
Service: `${{ steps.resolve.outputs.service }}`
Image: `${{ steps.resolve.outputs.sha }}`
Environment: *staging*"
              }
            }]
          }
      env:
        SLACK_WEBHOOK_URL:  ${{ secrets.SLACK_WEBHOOK_URL }}
        SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK
SECTION 05

GITHUB ACTIONS β€” CD PRODUCTION PIPELINE

Manual approval gate, rolling deploy, automatic rollback.

Production deployments require a manual approval from a designated reviewer in the GitHub production environment. Once approved, the same immutable image SHA that passed staging is deployed β€” never rebuilt. The pipeline uses a rolling update strategy with a Pod Disruption Budget ensuring at least one replica is always available.

CD Production Flow β€” Manual Gate + Rolling Deploy
TRIGGERworkflow_dispatchor staging success+ image SHA input⏸ MANUAL APPROVALGitHub EnvironmentProtection RulesRequired reviewers β‰₯1Wait up to 7 daysPRE-FLIGHTVerify image existsCheck staging tagValidate SHAHELM UPGRADERolling updatemaxSurge: 1maxUnavailable: 0VERIFY + TAGProduction smokeGit release tagNotify stakeholdershelm rollback (auto on --atomic fail)
.github/workflows/cd-production.ymlyaml
name: CD β€” Deploy to Production

on:
workflow_dispatch:
  inputs:
    service:
      description: "Service to deploy"
      required: true
      type: choice
      options: [auth-service, product-service, order-service, api-gateway, all]
    image_tag:
      description: "Image SHA tag (e.g. sha-a3f7c2b1)"
      required: true
      type: string
    reason:
      description: "Deployment reason (for audit log)"
      required: true
      type: string

permissions:
id-token:    write
contents:    write
deployments: write

env:
RESOURCE_GROUP:  shopwave-rg
AKS_CLUSTER:     shopwave-aks
NAMESPACE:       shopwave-production
REGISTRY:        ${{ secrets.ACR_LOGIN_SERVER }}
HELM_CHART:      ./helm/shopwave

jobs:
pre-flight:
  name: "Pre-flight Checks"
  runs-on: ubuntu-latest
  outputs:
    approved: ${{ steps.checks.outputs.approved }}

  steps:
    - uses: actions/checkout@v4

    - name: Login to Azure via OIDC
      uses: azure/login@v2
      with:
        client-id:       ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id:       ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

    - name: Verify image exists in ACR
      id: checks
      run: |
        SERVICE="${{ github.event.inputs.service }}"
        TAG="${{ github.event.inputs.image_tag }}"

        if [ "$SERVICE" = "all" ]; then
          SERVICES="auth-service product-service order-service api-gateway"
        else
          SERVICES="$SERVICE"
        fi

        for SVC in $SERVICES; do
          IMAGE="${{ env.REGISTRY }}/shopwave/${SVC}:${TAG}"
          echo "Checking ${IMAGE}..."
          az acr repository show-tags             --name $(echo ${{ env.REGISTRY }} | cut -d. -f1)             --repository "shopwave/${SVC}"             --query "[?@=='${TAG}']"             --output tsv | grep -q "${TAG}"             || { echo "❌ Image ${IMAGE} not found in ACR!"; exit 1; }
          echo "βœ… Image verified: ${IMAGE}"
        done
        echo "approved=true" >> $GITHUB_OUTPUT

deploy-production:
  name: "Deploy to Production"
  runs-on: ubuntu-latest
  needs: pre-flight
  if: needs.pre-flight.outputs.approved == 'true'

  environment:
    name: production
    url:  https://shopwave.io

  steps:
    - uses: actions/checkout@v4

    - name: Audit log β€” deployment started
      run: |
        echo "## Production Deployment" >> $GITHUB_STEP_SUMMARY
        echo "| Field | Value |" >> $GITHUB_STEP_SUMMARY
        echo "|---|---|" >> $GITHUB_STEP_SUMMARY
        echo "| Service | `${{ github.event.inputs.service }}` |" >> $GITHUB_STEP_SUMMARY
        echo "| Image Tag | `${{ github.event.inputs.image_tag }}` |" >> $GITHUB_STEP_SUMMARY
        echo "| Deployed by | @${{ github.actor }} |" >> $GITHUB_STEP_SUMMARY
        echo "| Reason | ${{ github.event.inputs.reason }} |" >> $GITHUB_STEP_SUMMARY
        echo "| Timestamp | $(date -u +%Y-%m-%dT%H:%M:%SZ) |" >> $GITHUB_STEP_SUMMARY

    - name: Login to Azure via OIDC
      uses: azure/login@v2
      with:
        client-id:       ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id:       ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

    - name: Get AKS credentials
      uses: azure/aks-set-context@v4
      with:
        resource-group: ${{ env.RESOURCE_GROUP }}
        cluster-name:   ${{ env.AKS_CLUSTER }}

    - name: Setup Helm
      uses: azure/setup-helm@v4
      with:
        version: "3.16.0"

    - name: Resolve services to deploy
      id: services
      run: |
        if [ "${{ github.event.inputs.service }}" = "all" ]; then
          echo "list=auth-service product-service order-service api-gateway" >> $GITHUB_OUTPUT
        else
          echo "list=${{ github.event.inputs.service }}" >> $GITHUB_OUTPUT
        fi

    - name: Helm upgrade β€” production (rolling update)
      id: deploy
      run: |
        TAG="${{ github.event.inputs.image_tag }}"
        for SVC in ${{ steps.services.outputs.list }}; do
          PORT=$(case "$SVC" in
            api-gateway)     echo 8000 ;;
            auth-service)    echo 8001 ;;
            product-service) echo 8002 ;;
            order-service)   echo 8003 ;;
          esac)

          echo "πŸš€ Deploying ${SVC}:${TAG} to production..."

          helm upgrade --install "${SVC}"             ${{ env.HELM_CHART }}             --namespace ${{ env.NAMESPACE }}             --create-namespace             --values helm/shopwave/values.yaml             --values helm/shopwave/values-production.yaml             --set image.repository=${{ env.REGISTRY }}/shopwave/${SVC}             --set image.tag=${TAG}             --set service.port=${PORT}             --atomic             --timeout 8m             --wait             --history-max 5

          echo "βœ… ${SVC} deployed successfully"
        done

    # ── Production smoke tests ──
    - name: Production smoke tests
      id: smoke
      run: |
        BASE="https://shopwave.io"
        ENDPOINTS=(
          "${BASE}/api/v1/health"
          "${BASE}/auth/health"
          "${BASE}/products/health"
          "${BASE}/orders/health"
        )

        ALL_PASS=true
        for URL in "${ENDPOINTS[@]}"; do
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$URL" --max-time 10)
          if [ "$STATUS" = "200" ]; then
            echo "βœ… $URL β†’ $STATUS"
          else
            echo "❌ $URL β†’ $STATUS"
            ALL_PASS=false
          fi
        done

        [ "$ALL_PASS" = "true" ] || exit 1

    # ── Create GitHub Release ──
    - name: Create GitHub Release
      if: success()
      uses: actions/create-release@v1
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      with:
        tag_name:     "prod-${{ github.event.inputs.image_tag }}-${{ github.run_number }}"
        release_name: "Production Release β€” ${{ github.event.inputs.image_tag }}"
        body: |
          ## Production Deployment

          **Service(s):** `${{ github.event.inputs.service }}`
          **Image Tag:** `${{ github.event.inputs.image_tag }}`
          **Deployed by:** @${{ github.actor }}
          **Reason:** ${{ github.event.inputs.reason }}

    # ── Slack stakeholder notification ──
    - name: Notify Slack β€” stakeholders
      if: always()
      uses: slackapi/slack-github-action@v1
      with:
        payload: |
          {
            "text": "${{ job.status == 'success' && 'πŸš€' || 'πŸ”₯' }} Production deploy: *${{ github.event.inputs.service }}*",
            "blocks": [
              {
                "type": "header",
                "text": { "type": "plain_text", "text": "${{ job.status == 'success' && 'πŸš€ Production Deploy Succeeded' || 'πŸ”₯ Production Deploy FAILED' }}" }
              },
              {
                "type": "section",
                "fields": [
                  { "type": "mrkdwn", "text": "*Service:*
`${{ github.event.inputs.service }}`" },
                  { "type": "mrkdwn", "text": "*Tag:*
`${{ github.event.inputs.image_tag }}`" },
                  { "type": "mrkdwn", "text": "*By:*
@${{ github.actor }}" },
                  { "type": "mrkdwn", "text": "*Reason:*
${{ github.event.inputs.reason }}" }
                ]
              }
            ]
          }
      env:
        SLACK_WEBHOOK_URL:  ${{ secrets.SLACK_WEBHOOK_URL }}
        SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK
SECTION 06

HELM CHART β€” KUBERNETES MANIFESTS

One chart, all four services, environment-specific overrides.

helm/shopwave/values.yamlyaml
# Default values β€” overridden per environment

replicaCount: 2

image:
repository: ""          # Set by pipeline: ACR_URL/shopwave/SERVICE
tag:        "latest"    # Overridden by pipeline with SHA tag
pullPolicy: Always

service:
type: ClusterIP
port: 8000              # Overridden per service

ingress:
enabled:   true
className: nginx
annotations:
  nginx.ingress.kubernetes.io/ssl-redirect:        "true"
  nginx.ingress.kubernetes.io/proxy-body-size:     "10m"
  nginx.ingress.kubernetes.io/proxy-read-timeout:  "60"
  cert-manager.io/cluster-issuer:                  "letsencrypt-prod"
tls:
  - secretName: shopwave-tls
    hosts:       []     # Set per environment

resources:
requests:
  cpu:    "100m"
  memory: "128Mi"
limits:
  cpu:    "500m"
  memory: "512Mi"

autoscaling:
enabled:                          true
minReplicas:                      2
maxReplicas:                      10
targetCPUUtilizationPercentage:   70
targetMemoryUtilizationPercentage: 80

podDisruptionBudget:
enabled:        true
minAvailable:   1

livenessProbe:
httpGet:
  path:   /health
  port:   http
initialDelaySeconds: 20
periodSeconds:       15
failureThreshold:    3

readinessProbe:
httpGet:
  path:   /health
  port:   http
initialDelaySeconds: 10
periodSeconds:       10
failureThreshold:    3

env:
ENVIRONMENT:    production
LOG_LEVEL:      INFO

# Secrets injected from Azure Key Vault via External Secrets Operator
externalSecrets:
enabled:        true
secretStoreName: azure-keyvault-store
refreshInterval: "1h"
secrets: []     # Defined per service in values-*.yaml
helm/shopwave/values-production.yamlyaml
replicaCount: 3

image:
pullPolicy: Always

ingress:
tls:
  - secretName: shopwave-prod-tls
    hosts:
      - shopwave.io
      - "*.shopwave.io"

resources:
requests:
  cpu:    "200m"
  memory: "256Mi"
limits:
  cpu:    "1000m"
  memory: "1Gi"

autoscaling:
minReplicas: 3
maxReplicas: 20

env:
ENVIRONMENT: production
LOG_LEVEL:   WARNING

externalSecrets:
secrets:
  - secretKey: DATABASE_URL
    remoteRef:
      key:      shopwave-prod-db-url
  - secretKey: SECRET_KEY
    remoteRef:
      key:      shopwave-prod-jwt-secret
  - secretKey: ALLOWED_ORIGINS
    remoteRef:
      key:      shopwave-prod-allowed-origins
helm/shopwave/templates/deployment.yamlyaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "shopwave.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
  {{- include "shopwave.labels" . | nindent 4 }}
annotations:
  deployment.kubernetes.io/revision: "{{ .Release.Revision }}"
spec:
replicas: {{ .Values.replicaCount }}
selector:
  matchLabels:
    {{- include "shopwave.selectorLabels" . | nindent 6 }}
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge:       1
    maxUnavailable: 0        # Zero-downtime: never kill before new pod ready
template:
  metadata:
    labels:
      {{- include "shopwave.selectorLabels" . | nindent 8 }}
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/port:   "{{ .Values.service.port }}"
      prometheus.io/path:   "/metrics"
  spec:
    serviceAccountName: {{ include "shopwave.serviceAccountName" . }}
    securityContext:
      runAsNonRoot: true
      runAsUser:    1000
      fsGroup:      1000
    containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        ports:
          - name: http
            containerPort: {{ .Values.service.port }}
            protocol: TCP
        env:
          {{- range $key, $val := .Values.env }}
          - name:  {{ $key }}
            value: {{ $val | quote }}
          {{- end }}
        envFrom:
          - secretRef:
              name: {{ include "shopwave.fullname" . }}-secrets
        resources:
          {{- toYaml .Values.resources | nindent 12 }}
        livenessProbe:
          {{- toYaml .Values.livenessProbe | nindent 12 }}
        readinessProbe:
          {{- toYaml .Values.readinessProbe | nindent 12 }}
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem:   true
          capabilities:
            drop: [ALL]
SECTION 07

TERRAFORM β€” INFRASTRUCTURE AS CODE

Every Azure resource declared, versioned, and applied through a pipeline.

All Azure infrastructure is managed by Terraform β€” no manual portal clicks, no ad-hoc CLI commands. The configuration is split into five focused modules, each owning a single concern. The root main.tf wires them together and additionally provisions GitHub Actions OIDC federation so the CD pipelines can authenticate to Azure without any stored credentials.

Terraform Module Dependency Graph
ROOT main.tfOIDC Β· namespaces Β· role assignmentsnetworkingVNet Β· subnetsNSGs Β· DNS zoneacrContainer Registryadmin credskeyvaultRBAC authseeds JWT secretakscluster Β· node poolsLog Analytics Β· DefenderpostgresFlexible Server3 databases β†’ KV

infrastructure/terraform/main.tf (root β€” key excerpts)

The root module wires the five child modules together, passing outputs as inputs between them (e.g. the networking module's subnet IDs flow into both aks and postgres), and finishes by configuring GitHub Actions OIDC federation and creating the Kubernetes namespaces.

Step 0 β€” Bootstrap remote state (run once)

Before running terraform init you need an Azure Storage account to hold the state file. A small helper script creates it and prints the values you need to paste into the backend block in main.tf.

infrastructure/terraform/scripts/bootstrap-tfstate.shbash
#!/usr/bin/env bash
# Run ONCE before terraform init.
# Creates the Azure Blob Storage backend for Terraform remote state.
set -euo pipefail

TFSTATE_RG="shopwave-tfstate-rg"
TFSTATE_SA="shopwavetfstate${RANDOM}"   # globally unique suffix
LOCATION="${1:-eastus}"

az group create --name "$TFSTATE_RG" --location "$LOCATION" --output none

az storage account create \
--name "$TFSTATE_SA" --resource-group "$TFSTATE_RG" \
--sku Standard_LRS --kind StorageV2 \
--min-tls-version TLS1_2 --allow-blob-public-access false --output none

# Enable versioning β€” protects state from accidental overwrites
az storage account blob-service-properties update \
--account-name "$TFSTATE_SA" --resource-group "$TFSTATE_RG" \
--enable-versioning true --output none

az storage container create \
--name tfstate --account-name "$TFSTATE_SA" --auth-mode login --output none

echo "Update main.tf backend block with:"
echo "  storage_account_name = \"$TFSTATE_SA\""
echo "  resource_group_name  = \"$TFSTATE_RG\""
echo "Then: cp terraform.tfvars.example terraform.tfvars && terraform init"

infrastructure/terraform/main.tf β€” key excerpts

The root module calls all five child modules in dependency order and then handles cross-cutting concerns: GitHub Actions OIDC federation, role assignments, and Kubernetes namespace creation via the kubernetes provider (which is configured using the AKS module's output).

infrastructure/terraform/main.tfhcl
terraform {
required_version = ">= 1.7.0"
required_providers {
  azurerm = { source = "hashicorp/azurerm", version = "~> 3.110" }
  azuread = { source = "hashicorp/azuread", version = "~> 2.53"  }
  random  = { source = "hashicorp/random",  version = "~> 3.6"   }
  helm       = { source = "hashicorp/helm",       version = "~> 2.14" }
  kubernetes = { source = "hashicorp/kubernetes", version = "~> 2.31" }
}
backend "azurerm" {
  resource_group_name  = "shopwave-tfstate-rg"
  storage_account_name = "shopwavetfstate"    # from bootstrap-tfstate.sh
  container_name       = "tfstate"
  key                  = "shopwave.tfstate"
}
}

module "networking" {
source              = "./modules/networking"
project             = var.project
environment         = var.environment
location            = var.location
resource_group_name = azurerm_resource_group.main.name
vnet_address_space  = var.vnet_address_space
aks_subnet_cidr     = var.aks_subnet_cidr
pg_subnet_cidr      = var.pg_subnet_cidr
tags                = local.common_tags
}

module "keyvault" {
source              = "./modules/keyvault"
project             = var.project
environment         = var.environment
location            = var.location
resource_group_name = azurerm_resource_group.main.name
tenant_id           = data.azurerm_client_config.current.tenant_id
admin_object_ids    = var.keyvault_admin_object_ids
tags                = local.common_tags
}

module "acr" {
source              = "./modules/acr"
project             = var.project
environment         = var.environment
location            = var.location
resource_group_name = azurerm_resource_group.main.name
sku                 = var.acr_sku
tags                = local.common_tags
}

module "aks" {
source              = "./modules/aks"
project             = var.project
environment         = var.environment
location            = var.location
resource_group_name = azurerm_resource_group.main.name
subnet_id           = module.networking.aks_subnet_id   # ← cross-module wiring
acr_id              = module.acr.id
keyvault_id         = module.keyvault.id
node_count          = var.aks_node_count
node_vm_size        = var.aks_node_vm_size
kubernetes_version  = var.kubernetes_version
min_node_count      = var.aks_min_node_count
max_node_count      = var.aks_max_node_count
tags                = local.common_tags
depends_on          = [module.networking, module.acr, module.keyvault]
}

module "postgres" {
source              = "./modules/postgres"
project             = var.project
environment         = var.environment
location            = var.location
resource_group_name = azurerm_resource_group.main.name
subnet_id           = module.networking.pg_subnet_id          # ← private subnet
private_dns_zone_id = module.networking.pg_private_dns_zone_id
sku_name            = var.postgres_sku
storage_mb          = var.postgres_storage_mb
postgres_version    = var.postgres_version
databases           = var.postgres_databases
keyvault_id         = module.keyvault.id   # ← stores connection strings here
tags                = local.common_tags
depends_on          = [module.networking, module.keyvault]
}

# ── GitHub Actions OIDC federation ───────────────────────────────────────────
resource "azuread_application" "github_actions" {
display_name = "${var.project}-github-actions-${var.environment}"
}
resource "azuread_service_principal" "github_actions" {
client_id = azuread_application.github_actions.client_id
}
resource "azuread_application_federated_identity_credential" "staging" {
application_id = azuread_application.github_actions.id
display_name   = "${var.project}-staging"
audiences      = ["api://AzureADTokenExchange"]
issuer         = "https://token.actions.githubusercontent.com"
subject        = "repo:${var.github_org}/${var.github_repo}:environment:staging"
}
resource "azuread_application_federated_identity_credential" "production" {
application_id = azuread_application.github_actions.id
display_name   = "${var.project}-production"
audiences      = ["api://AzureADTokenExchange"]
issuer         = "https://token.actions.githubusercontent.com"
subject        = "repo:${var.github_org}/${var.github_repo}:environment:production"
}

# Role assignments β€” principle of least privilege
resource "azurerm_role_assignment" "gh_acr_push"  {
scope = module.acr.id; role_definition_name = "AcrPush"
principal_id = azuread_service_principal.github_actions.id
}
resource "azurerm_role_assignment" "gh_aks_user"  {
scope = module.aks.id
role_definition_name = "Azure Kubernetes Service Cluster User Role"
principal_id = azuread_service_principal.github_actions.id
}
resource "azurerm_role_assignment" "gh_kv_secrets" {
scope = module.keyvault.id; role_definition_name = "Key Vault Secrets User"
principal_id = azuread_service_principal.github_actions.id
}

# Kubernetes namespaces β€” created via k8s provider post-AKS provisioning
resource "kubernetes_namespace" "staging"    {
metadata { name = "shopwave-staging";    labels = { environment = "staging"    } }
depends_on = [module.aks]
}
resource "kubernetes_namespace" "production" {
metadata { name = "shopwave-production"; labels = { environment = "production" } }
depends_on = [module.aks]
}

modules/aks/main.tf β€” production-hardened cluster

The AKS module creates two node pools: a tainted system pool for Kubernetes internals and an apps pool for ShopWave workloads. Workload Identity, OIDC issuer, Container Insights, Microsoft Defender for Containers, and the Key Vault CSI driver are all enabled by default.

infrastructure/terraform/modules/aks/main.tfhcl
resource "azurerm_kubernetes_cluster" "main" {
name                = "${var.project}-aks-${var.environment}"
location            = var.location
resource_group_name = var.resource_group_name
dns_prefix          = "${var.project}-${var.environment}"
kubernetes_version  = var.kubernetes_version

oidc_issuer_enabled       = true   # ← required for Workload Identity
workload_identity_enabled = true

default_node_pool {               # system pool β€” kube-system only
  name                         = "system"
  vm_size                      = var.node_vm_size
  vnet_subnet_id               = var.subnet_id
  enable_auto_scaling          = true
  min_count                    = var.min_node_count
  max_count                    = var.max_node_count
  only_critical_addons_enabled = true   # taint: NoSchedule for user workloads
  upgrade_settings { max_surge = "33%" }
}

identity { type = "SystemAssigned" }

network_profile {
  network_plugin    = "azure"
  network_policy    = "azure"       # enforces NetworkPolicy objects
  load_balancer_sku = "standard"
  service_cidr      = "172.16.0.0/16"
  dns_service_ip    = "172.16.0.10"
}

oms_agent {                         # Container Insights
  log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
}
microsoft_defender {                # Defender for Containers
  log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
}
key_vault_secrets_provider {        # CSI driver for Key Vault secret injection
  secret_rotation_enabled  = true
  secret_rotation_interval = "2m"
}

automatic_channel_upgrade = "patch"
tags = var.tags
}

# App node pool β€” user workloads land here (no taint)
resource "azurerm_kubernetes_cluster_node_pool" "apps" {
name                  = "apps"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size               = var.node_vm_size
vnet_subnet_id        = var.subnet_id
enable_auto_scaling   = true
min_count             = var.min_node_count
max_count             = var.max_node_count
mode                  = "User"
tags                  = var.tags
}

# AKS pulls images from ACR without a stored password
resource "azurerm_role_assignment" "aks_acr_pull" {
scope                = var.acr_id
role_definition_name = "AcrPull"
principal_id         = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
}

modules/postgres/main.tf β€” secrets auto-stored in Key Vault

The postgres module generates a random password, provisions the Flexible Server with per-service databases, and writes the full connection strings directly into Key Vault β€” so they're available to the External Secrets Operator without any manual copy-paste step.

infrastructure/terraform/modules/postgres/main.tfhcl
resource "random_password" "postgres" {
length  = 32; special = true
override_special = "!#$%&*()-_=+[]{}<>?"
}

resource "azurerm_postgresql_flexible_server" "main" {
name                   = "${var.project}-pg-${var.environment}"
resource_group_name    = var.resource_group_name
location               = var.location
version                = var.postgres_version
delegated_subnet_id    = var.subnet_id           # private VNet access only
private_dns_zone_id    = var.private_dns_zone_id
administrator_login    = local.pg_admin
administrator_password = random_password.postgres.result
sku_name               = var.sku_name
storage_mb             = var.storage_mb

backup_retention_days        = var.environment == "production" ? 14 : 7
geo_redundant_backup_enabled = var.environment == "production"

high_availability {
  mode = var.environment == "production" ? "ZoneRedundant" : "Disabled"
}
tags = var.tags
lifecycle { ignore_changes = [zone, high_availability[0].standby_availability_zone] }
}

# Create each per-service database
resource "azurerm_postgresql_flexible_server_database" "databases" {
for_each  = toset(var.databases)
name      = each.value
server_id = azurerm_postgresql_flexible_server.main.id
lifecycle { prevent_destroy = true }   # never drop a production database
}

# Write connection strings into Key Vault β€” one per database
resource "azurerm_key_vault_secret" "db_urls" {
for_each     = local.db_urls
name         = "${var.project}-${var.environment}-${replace(each.key, "_", "-")}-url"
value        = each.value       # postgresql+asyncpg://admin:pass@host/db?ssl=require
key_vault_id = var.keyvault_id
}

infrastructure/terraform/variables.tf β€” fully validated

infrastructure/terraform/variables.tfhcl
variable "project" {
type    = string; default = "shopwave"
validation {
  condition     = can(regex("^[a-z][a-z0-9-]{2,20}$", var.project))
  error_message = "Lowercase alphanumeric + hyphens, 3-20 chars."
}
}
variable "environment" {
type = string
validation {
  condition     = contains(["staging", "production"], var.environment)
  error_message = "Must be 'staging' or 'production'."
}
}
variable "location"           { type = string; default = "eastus" }
variable "subscription_id"    { type = string; sensitive = true }

# Networking
variable "vnet_address_space" { type = string; default = "10.0.0.0/16" }
variable "aks_subnet_cidr"    { type = string; default = "10.0.1.0/24" }
variable "pg_subnet_cidr"     { type = string; default = "10.0.2.0/24" }

# AKS
variable "kubernetes_version"  { type = string; default = "1.30" }
variable "aks_node_count"      { type = number; default = 3 }
variable "aks_node_vm_size"    { type = string; default = "Standard_D2s_v3" }
variable "aks_min_node_count"  { type = number; default = 2 }
variable "aks_max_node_count"  { type = number; default = 10 }

# PostgreSQL
variable "postgres_version"    { type = string; default = "16" }
variable "postgres_sku"        { type = string; default = "Standard_B2ms" }
variable "postgres_storage_mb" { type = number; default = 32768 }
variable "postgres_databases"  { type = list(string); default = ["auth_db","products_db","orders_db"] }

# Key Vault
variable "keyvault_admin_object_ids" { type = list(string); default = [] }

# GitHub OIDC
variable "github_org"  { type = string }
variable "github_repo" { type = string; default = "shopwave" }

infrastructure/terraform/outputs.tf β€” reads straight into GitHub Secrets

infrastructure/terraform/outputs.tfhcl
output "acr_login_server"          { value = module.acr.login_server }
output "aks_cluster_name"          { value = module.aks.cluster_name }
output "keyvault_name"             { value = module.keyvault.name }
output "postgres_fqdn"             { value = module.postgres.fqdn }
output "github_actions_client_id"  { value = azuread_application.github_actions.client_id }
output "tenant_id"                 { value = data.azurerm_client_config.current.tenant_id }

# acr_admin_username and acr_admin_password are marked sensitive=true
# retrieve them with: terraform output -raw acr_admin_username

output "github_secrets_summary" {
sensitive = true
value = <<-EOT
  AZURE_CLIENT_ID        = ${azuread_application.github_actions.client_id}
  AZURE_TENANT_ID        = ${data.azurerm_client_config.current.tenant_id}
  AZURE_SUBSCRIPTION_ID  = ${var.subscription_id}
  ACR_LOGIN_SERVER       = ${module.acr.login_server}
EOT
}

Deploying infrastructure end-to-end

terminalbash
# 1 β€” Bootstrap remote state (once only)
cd infrastructure/terraform
chmod +x scripts/bootstrap-tfstate.sh
./scripts/bootstrap-tfstate.sh
# Update the backend block in main.tf with the printed storage account name

# 2 β€” Configure your environment
cp terraform.tfvars.example terraform.tfvars
# Edit: subscription_id, github_org, keyvault_admin_object_ids

# 3 β€” Initialise, plan, apply
terraform init
terraform plan \
-var-file="environments/production/production.tfvars" \
-var="subscription_id=$(az account show --query id -o tsv)" \
-out=tfplan
terraform apply tfplan

# 4 β€” Copy output values to GitHub Secrets
terraform output github_secrets_summary
terraform output -raw acr_admin_username   # β†’ ACR_USERNAME
terraform output -raw acr_admin_password   # β†’ ACR_PASSWORD
β„Ή Terraform environments

Two environment-specific .tfvars files live under environments/. Staging uses smaller VM sizes (Standard_B2s) and a single-node PostgreSQL SKU. Production uses Standard_D2s_v3 nodes, zone-redundant PostgreSQL HA, 14-day backups, and a minimum of three AKS nodes. Both share the same Terraform code β€” only the variable values differ.

Secrets β†’ Key Vault β†’ pods: the full chain

After Terraform runs, all connection strings and secrets live in Key Vault. The External Secrets Operator reads them and materialises them as Kubernetes Secrets, which Helm mounts into pods via envFrom. No secret is ever written to a pipeline log or a YAML file.

helm/shopwave/templates/externalsecret.yamlyaml
{{- if .Values.externalSecrets.enabled }}
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: {{ include "shopwave.fullname" . }}-secrets
namespace: {{ .Release.Namespace }}
spec:
refreshInterval: {{ .Values.externalSecrets.refreshInterval }}
secretStoreRef:
  name: {{ .Values.externalSecrets.secretStoreName }}
  kind: ClusterSecretStore
target:
  name:           {{ include "shopwave.fullname" . }}-secrets
  creationPolicy: Owner
data:
  {{- range .Values.externalSecrets.secrets }}
  - secretKey: {{ .secretKey }}
    remoteRef:
      key:     {{ .remoteRef.key }}
      version: {{ .remoteRef.version | default "latest" }}
  {{- end }}
{{- end }}
SECTION 08

LOCAL DEVELOPMENT SETUP

docker-compose for the full stack locally, pytest for fast iteration.

docker-compose.ymlyaml
version: "3.9"

services:

postgres:
  image: postgres:16-alpine
  environment:
    POSTGRES_USER:     shopwave
    POSTGRES_PASSWORD: shopwave_local
    POSTGRES_MULTIPLE_DATABASES: auth_db,products_db,orders_db
  volumes:
    - postgres_data:/var/lib/postgresql/data
    - ./scripts/init-multiple-dbs.sh:/docker-entrypoint-initdb.d/init.sh
  ports:
    - "5432:5432"
  healthcheck:
    test:     ["CMD-SHELL", "pg_isready -U shopwave"]
    interval: 10s
    timeout:  5s
    retries:  5

redis:
  image: redis:7-alpine
  ports:
    - "6379:6379"
  command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru

api-gateway:
  build:
    context: services/api-gateway
    target:  production
  ports:
    - "8000:8000"
  env_file: .env.local
  environment:
    AUTH_SERVICE_URL:    http://auth-service:8001
    PRODUCT_SERVICE_URL: http://product-service:8002
    ORDER_SERVICE_URL:   http://order-service:8003
    REDIS_URL:           redis://redis:6379/0
  depends_on:
    - auth-service
    - product-service
    - order-service
  healthcheck:
    test:     ["CMD", "curl", "-f", "http://localhost:8000/health"]
    interval: 15s
    timeout:  5s

auth-service:
  build:
    context: services/auth-service
    target:  production
  ports:
    - "8001:8001"
  env_file: .env.local
  environment:
    DATABASE_URL: postgresql+asyncpg://shopwave:shopwave_local@postgres:5432/auth_db
    ENVIRONMENT:  local
  depends_on:
    postgres:
      condition: service_healthy

product-service:
  build:
    context: services/product-service
    target:  production
  ports:
    - "8002:8002"
  env_file: .env.local
  environment:
    DATABASE_URL: postgresql+asyncpg://shopwave:shopwave_local@postgres:5432/products_db
    ENVIRONMENT:  local
  depends_on:
    postgres:
      condition: service_healthy

order-service:
  build:
    context: services/order-service
    target:  production
  ports:
    - "8003:8003"
  env_file: .env.local
  environment:
    DATABASE_URL: postgresql+asyncpg://shopwave:shopwave_local@postgres:5432/orders_db
    ENVIRONMENT:  local
  depends_on:
    postgres:
      condition: service_healthy

volumes:
postgres_data:
SECTION 09

GITHUB SECRETS REFERENCE

Every secret the pipelines need β€” where it comes from, what it does.

Required GitHub Secrets
Secret NameWhere to get itUsed inNotes
AZURE_CLIENT_IDterraform output github_actions_client_idAll pipelinesOIDC federated identity β€” no password
AZURE_TENANT_IDterraform output tenant_idAll pipelinesYour Azure AD tenant
AZURE_SUBSCRIPTION_IDterraform output subscription_idAll pipelinesTarget subscription
ACR_LOGIN_SERVERterraform output acr_login_serverCI build + CD deploye.g. shopwaveacr.azurecr.io
ACR_USERNAMEterraform output -raw acr_admin_usernameCI build (docker login)ACR admin username
ACR_PASSWORDterraform output -raw acr_admin_passwordCI build (docker login)Rotate regularly
SLACK_WEBHOOK_URLSlack app β†’ Incoming WebhooksAll pipelines (notify)Incoming webhook URL
CODECOV_TOKENcodecov.io project settingsCI test jobCoverage upload token
SECTION 10

END-TO-END DEVELOPER WORKFLOW

From git commit to live production β€” every step automated.

Complete Developer β†’ Production Journey
1Developer opens PRgit checkout -b feat/add-wishlist β†’ git push β†’ PR openedT+02CI triggers (path-filtered)Only product-service CI runs Β· lint β†’ test β†’ SAST Β· ~6 minT+6m3Tests pass Β· PR reviewableGitHub status checks green Β· Coverage badge updates Β· Codecov reportT+7m4Code review + merge to mainReviewer approves Β· Squash merge Β· CI re-runs on main branchT+Xh5CI builds + pushes to ACRMulti-stage Docker build Β· Trivy scan (block on CRITICAL) Β· Push sha-XXXXXXXXT+Xh+8m6CD staging auto-deploysHelm upgrade --atomic Β· Smoke tests Β· Slack alert Β· staging tag createdT+Xh+13m7Manual gate β†’ productionEngineer triggers cd-production.yml Β· Reviewer approves Β· Rolling deploy β†’ liveT+Xh+30mβ˜…Infra change? terraform.ymlPR touches infrastructure/terraform/ β†’ plan comment on PR β†’ apply on dispatchas needed
SECTION 11

NETWORK POLICIES & ZERO-TRUST KUBERNETES

Services can only talk to who they need to.

k8s/network-policies.yamlyaml
# Default deny-all in production namespace
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: shopwave-production
spec:
podSelector: {}
policyTypes: [Ingress, Egress]

# Allow ingress from nginx ingress controller β†’ api-gateway only
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-ingress-to-gateway
namespace: shopwave-production
spec:
podSelector:
  matchLabels:
    app.kubernetes.io/name: api-gateway
policyTypes: [Ingress]
ingress:
  - from:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: ingress-nginx

# Allow api-gateway β†’ each backend service
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-gateway-to-services
namespace: shopwave-production
spec:
podSelector:
  matchLabels:
    app.kubernetes.io/component: backend-service
policyTypes: [Ingress]
ingress:
  - from:
      - podSelector:
          matchLabels:
            app.kubernetes.io/name: api-gateway
    ports:
      - protocol: TCP
        port:     8001
      - protocol: TCP
        port:     8002
      - protocol: TCP
        port:     8003

# Allow all services egress to Azure DB (private endpoint)
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-egress-to-postgres
namespace: shopwave-production
spec:
podSelector:
  matchLabels:
    app.kubernetes.io/component: backend-service
policyTypes: [Egress]
egress:
  - ports:
      - protocol: TCP
        port:     5432
SECTION 12

SECURITY POSTURE SUMMARY

Every layer of the pipeline hardened against common attack vectors.

Security Controls β€” Layer by Layer
CODEβ–Έ Semgrep SAST (p/python, p/owasp-top-ten, p/jwt)β–Έ Ruff + mypy static analysis on every PRβ–Έ Dependabot auto-PRs for vulnerable dependenciesIMAGEβ–Έ Multi-stage build β€” distroless final image (~15MB)β–Έ Trivy scans CRITICAL/HIGH CVEs β€” blocks pipelineβ–Έ Cosign image signing β€” ACR verifies before deployINFRAβ–Έ All Azure resources declared in Terraform β€” no manual changesβ–Έ terraform plan diff commented on every infra PRβ–Έ Remote state in Azure Blob with versioning enabledRUNTIMEβ–Έ runAsNonRoot: true Β· readOnlyRootFilesystem: trueβ–Έ capabilities: drop [ALL] β€” zero Linux capabilitiesβ–Έ Azure Network Policy β€” default deny, explicit allow-listCREDENTIALSβ–Έ OIDC Workload Identity β€” zero static credentials in pipelinesβ–Έ Azure Key Vault β€” connection strings auto-seeded by Terraformβ–Έ Secret rotation via Key Vault versioning + ESO 1h refresh
SECTION 13

WHAT YOU NOW HAVE

A complete, production-deployable E-Commerce platform.

Every component in this walkthrough is a real, working piece of the ShopWave system. Here is the complete inventory of what was built:

🐍
4 Python FastAPI Services
auth, product, order, api-gateway β€” each with async SQLAlchemy, Alembic migrations, structured logging, Prometheus metrics
βš™οΈ
7 GitHub Actions Pipelines
4 per-service CI pipelines + staging CD + production CD + terraform.yml β€” path-filtered, OIDC-authenticated, Slack-notified
πŸ—οΈ
Terraform Infrastructure as Code
5 modules β€” networking, aks, acr, postgres, keyvault β€” all resources versioned, plan-reviewed on PR, applied through a gated pipeline
πŸ”’
End-to-End Security
Semgrep SAST Β· Trivy image scan Β· Cosign signing Β· OIDC no-credential auth Β· Key Vault secrets auto-seeded by Terraform Β· zero-trust network policies
☸️
Helm Chart + K8s Manifests
Single parameterised chart Β· rolling update strategy Β· HPA Β· PDB Β· External Secrets Β· RBAC Β· Network Policies
πŸš€
Zero-Downtime Deployments
maxUnavailable: 0 rolling updates Β· --atomic Helm with auto-rollback Β· smoke tests gate every staging deploy
☁️
Full Azure Stack
AKS + ACR + PostgreSQL Flexible (zone-redundant in prod) + Key Vault + private VNet + Azure Monitor + Defender for Containers β€” all Terraform-managed
βœ“ Next steps

From here, natural extensions include: switching Terraform to use Terraform Cloud or Atlantis for plan/apply with PR-based GitOps workflows; adding Argo Rollouts for canary deployments with automated metric-based promotion; integrating Azure Application Insights for distributed tracing across all four services; adding a contract testing job to each CI pipeline using Pact; and wiring Azure Cost Management alerts into the pipeline notifications.

ShopWave Β· Python FastAPI Β· Azure AKS Β· Terraform IaC Β· GitHub Actions Β· Production-Grade CI/CD Β· February 2026