Terminal prompt logo for Ali Haidry β€” alihaidry.dev~/AHAli Haidryalihaidry.dev
Published on

Building a Production-Grade Multi-Environment GitOps Pipeline on Azure AKS

Authors

Building a Production-Grade Multi-Environment GitOps Pipeline on Azure AKS

πŸ“… May 2026⏱ 20 min read🏷 GitOps Β· Argo CD Β· GitHub Actions Β· Azure AKS

This post documents the PakTech Multi-Environment GitOps Pipeline β€” a personal project I built to practise production-grade CI/CD patterns on Azure Kubernetes Service. It is not a toy. It has OIDC federation, per-environment Helm overrides, a human approval gate before production, targeted auto-rollback, Trivy + Checkov security scanning, Slack notifications, and a Terraform pipeline with drift detection. This post walks through every design decision and the concrete problems I ran into along the way.


What the project does

One-line summary: Every push to main automatically deploys to dev via Argo CD. A validated image SHA can then be manually promoted through staging β†’ human approval β†’ production, with full auto-rollback if anything goes wrong.

The application itself is deliberately simple β€” a Flask JSON API that returns environment name, build SHA, and timestamp. That's not the point. The point is everything around it: the pipeline that gets it to a cluster safely, the mechanism that promotes it between environments without rebuilding, and the safety net that rolls it back if production health checks fail.

The project spans three GitHub repositories:

  • app-paktech-prod β€” Flask source, Dockerfile, Helm chart, GitHub Actions workflows
  • gitops-paktech-app β€” Per-environment Helm value overrides, Argo CD ApplicationSet + AppProject bootstrap manifests
  • (infrastructure) β€” Terraform for AKS, ACR, OIDC federation

Architecture β€” the dual-repo GitOps pattern

System Architecture
Developer push to main
        β”‚
        β–Ό
  GitHub Actions CI                      ACR
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        paktechacr.azurecr.io
  β”‚  test β†’ build β†’ scan        │──────► image:$SHA
  β”‚  β†’ patch gitops dev/values  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        gitops-paktech-app
                                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  Manual promote (workflow_dispatch)     β”‚  dev/values.yaml │◄─── CI writes
  ──────────────────────────────────►    β”‚  staging/values  │◄─── promote writes
  staging β†’ smoke β†’ approval β†’ prod      β”‚  prod/values     │◄─── promote + approval
                                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                  β”‚
                                         Argo CD watches
                                                  β”‚
                                          AKS cluster
                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              paktech-dev  paktech-staging  paktech-prod

The key principle is separation of concerns. The app repo owns what to run β€” source code, Dockerfile, Helm chart, pipelines. The GitOps repo owns where and how β€” the per-environment configuration that Argo CD reconciles against the cluster. Neither repo needs to know the details of the other. CI never touches the cluster directly. Argo CD never rebuilds images. The pipeline that promotes a SHA is completely decoupled from the pipeline that built it.

This matters operationally. If Argo CD is down, CI still runs and still updates the GitOps repo. When Argo CD recovers, it reconciles. If CI is broken, the cluster keeps running whatever was last synced. The two concerns fail independently.


The Flask application

The app is minimal by design:

@app.route("/")
def index():
    return jsonify({
        "app": "paktech-app",
        "environment": APP_ENV,
        "version": APP_VERSION,
        "build_sha": BUILD_SHA,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    })

@app.route("/health")
def health():
    return jsonify({"status": "healthy", "environment": APP_ENV}), 200

@app.route("/info")
def info():
    return jsonify({
        "environment": APP_ENV,
        "version": APP_VERSION,
        "build_sha": BUILD_SHA,
    })

APP_ENV, APP_VERSION, and BUILD_SHA are injected as Docker build args by CI and baked into the image at build time. This means /info always tells you exactly what image is running β€” useful for verifying that a promotion actually landed the right SHA on the cluster.

The Dockerfile is multi-stage: a builder stage installs dependencies in isolation, and the runtime stage copies only what's needed into a python:3.12-slim image running as a non-root user with a read-only root filesystem. Gunicorn is the production server β€” the Flask dev server never runs outside tests.


The CI Pipeline (ci.yml)

CI Pipeline Flow β€” push to main
1
test
pytest -n auto (parallel) Β· JUnit XML β†’ dorny/test-reporter for GitHub Checks Β· permissions: checks:write
2
build
docker/build-push-action β†’ paktechacr.azurecr.io/paktech-app:${{ github.sha }} + :latest Β· ACR registry cache mode=max Β· build args: APP_VERSION, BUILD_SHA, APP_ENV=prod
3a
scan-image
Trivy: HIGH + CRITICAL CVEs, exit-code 0 (report without blocking) Β· OIDC login to pull from ACR
3b
scan-iac
Checkov: Terraform + Kubernetes + Dockerfile Β· soft_fail: true Β· runs in parallel with scan-image
4
deploy-dev
yq patches gitops/apps/paktech-app/dev/values.yaml Β· git -C gitops commit + push Β· Argo CD auto-syncs paktech-dev namespace Β· Slack notification

The pipeline only runs the full build/scan/deploy sequence on pushes to main. PRs run the test job only β€” no point building images for work-in-progress code. This is enforced with if: github.event_name == 'push' && github.ref == 'refs/heads/main' on the build job, and all downstream jobs needs: build.

OIDC authentication

All Azure operations β€” ACR login, AKS credentials, Terraform β€” authenticate via OIDC federation. The service principal has a federated credential configured for this exact repository and branch:

Issuer:   https://token.actions.githubusercontent.com
Subject:  repo:PakTechLimited/app-paktech-prod:ref:refs/heads/main
Audience: api://AzureADTokenExchange

No client secrets. No PATs for Azure. The only stored credential is GITOPS_APP_TOKEN β€” a classic PAT with repo scope on the GitOps repo β€” because GitHub's OIDC token can't be used to push to a different repository.

Image tagging

Every image is tagged with the full 40-character Git SHA. Never a short SHA, never a semver that could collide, never latest as the primary tag. The latest tag is pushed as a convenience for humans, but the GitOps repo always pins the exact SHA. This means every running deployment is traceable to a specific commit.

The deploy-dev step β€” how it patches the GitOps repo

yq e '.image.tag = "${{ github.sha }}"' \
  -i gitops/apps/paktech-app/dev/values.yaml

git -C gitops config user.name  "github-actions[bot]"
git -C gitops config user.email "github-actions[bot]@users.noreply.github.com"
git -C gitops add apps/paktech-app/dev/values.yaml
git diff --staged --quiet || git -C gitops commit -m "chore(dev): bump paktech-app to ${{ github.sha }}"
git -C gitops push

The git diff --staged --quiet || git commit guard is important. Without it, re-running the pipeline with the same SHA (e.g. after a transient failure in a later step) would fail with "nothing to commit, working tree clean". The guard makes the step idempotent.


The Promotion Pipeline (promote.yml)

Design principle: The promote pipeline never rebuilds. It takes an image SHA that was already built, scanned, and deployed to dev, and walks it through staging and prod. This ensures what hits production is byte-for-byte identical to what was tested.

Triggered manually via workflow_dispatch with two inputs: image_tag (the full Git SHA) and target_environment (staging or prod).

Promotion Pipeline Flow
workflow_dispatch
Input: image_tag=<sha>, target_environment=staging|prod
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  validate          β”‚  az acr repository show-tags
β”‚                    β”‚  Aborts if SHA not found in paktechacr
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  deploy-staging    β”‚  Patches staging/values.yaml
β”‚                    β”‚  Waits 90s for Argo CD sync
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  smoke-test-stagingβ”‚  Placeholder (ClusterIP β€” no public URL)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚  (only if target_environment == 'prod')
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  deploy-prod       β”‚  ⏸ Pauses for human approval
β”‚  (env: prod)       β”‚  Patches prod/values.yaml
β”‚                    β”‚  5-min monitoring window
β”‚                    β”‚  Slack notification
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚  (on failure only)
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  rollback-prod     β”‚  Targeted file revert + push
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The validate step

Before touching any environment, the pipeline confirms the SHA actually exists in ACR:

az acr repository show-tags \
  --name ${{ vars.ACR_NAME }} \
  --repository paktech-app \
  --query "[?@=='${{ inputs.image_tag }}']" \
  --output tsv | grep -q "${{ inputs.image_tag }}" || {
    echo "Image tag not found in ACR. Aborting."
    exit 1
  }

This prevents the pipeline from patching the GitOps repo with a SHA that can't be pulled β€” which would leave the cluster in a ImagePullBackOff state.

Manual approval gate

The deploy-prod job is assigned to the prod GitHub Environment, which has required reviewers configured. When the job reaches this point, GitHub pauses the workflow and sends a notification to the designated reviewer. Only after explicit approval does the job execute.

This is the right level to put the gate β€” not in a shell script, not in a Slack bot, but in GitHub's native environment protection, which is audited and visible in the deployment history.


Auto-Rollback β€” Why git checkout HEAD~1 -- file and not git revert

The auto-rollback job fires when deploy-prod fails:

rollback-prod:
  if: failure() && inputs.target_environment == 'prod' && needs.deploy-prod.result == 'failure'

The rollback command:

git checkout HEAD~1 -- apps/paktech-app/prod/values.yaml
git commit -m "revert(prod): roll back paktech-app image tag"
git push

Why not git revert HEAD?

git revert HEAD undoes every file changed in the last commit. In a busy GitOps repo, the last commit might have touched staging or bootstrap manifests in addition to prod. You'd be reverting work that had nothing to do with the failed deployment. git checkout HEAD~1 -- prod/values.yaml is surgical: it restores exactly one file to exactly the version it held before the promotion. Nothing else is touched.

This was one of the more interesting design decisions in the project. The naive approach (git revert) is unsafe in a shared repo. The targeted approach is safe and composable β€” even if multiple files were changed in a single commit, you can roll back each independently.


Argo CD Configuration β€” ApplicationSet with Dual-Source

The ApplicationSet generates three Argo CD Applications from a single YAML template:

generators:
  - list:
      elements:
        - env: dev
          namespace: paktech-dev
        - env: staging
          namespace: paktech-staging
        - env: prod
          namespace: paktech-prod

template:
  spec:
    sources:
      - repoURL: https://github.com/PakTechLimited/gitops-paktech-app
        targetRevision: main
        ref: gitopsValues

      - repoURL: https://github.com/PakTechLimited/app-paktech-prod
        targetRevision: main
        path: helm
        helm:
          valueFiles:
            - $gitopsValues/apps/paktech-app/{{env}}/values.yaml

The dual-source pattern lets the Helm chart live in the app repo (where it belongs β€” alongside the code that uses it) while the per-environment values live in the GitOps repo (where they belong β€” versioned separately from the application). This requires Argo CD v2.6+.

ignoreDifferences for HPA

ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
      - /spec/replicas

Without this, Argo CD fights the Horizontal Pod Autoscaler. HPA scales replicas up under load; Argo CD sees drift from the Helm-rendered value and resets them. The ignoreDifferences block tells Argo CD: "don't touch /spec/replicas, HPA owns that." The replicaCount in values.yaml still sets the initial baseline when HPA is off or when the Deployment is first created.


Per-Environment Configuration

dev
Replicas1
SyncAuto (Argo CD)
GateNone
PDBDisabled
AZ SpreadNone
staging
Replicas2
SyncAuto (Argo CD)
GateManual promote
PDBminAvailable: 1
AZ SpreadScheduleAnyway
prod
Replicas3 β†’ HPA max 10
SyncAuto (Argo CD)
GateManual + approval
PDBminAvailable: 2
AZ SpreadDoNotSchedule

The key difference between staging and prod topology spread: staging uses whenUnsatisfiable: ScheduleAnyway (a soft constraint β€” it will schedule pods even if it can't spread them across zones, which matters when staging may not have node capacity across all AZs). Prod uses whenUnsatisfiable: DoNotSchedule β€” a hard constraint that refuses to schedule a pod if it would violate the spread. This is intentional: better to have prod reject a rollout than to silently deploy all pods to a single AZ.

The AppProject also enforces a deny sync window for prod: no Argo CD syncs β€” including manual triggers β€” between 22:00 and 06:00 UTC. A promotion committed just before that window queues and syncs when it opens.


Problems I Ran Into (And How I Solved Them)

1. GitOps checkout silently failing

The CI deploy-dev job was failing with stat apps/paktech-app/dev/values.yaml: no such file or directory. After several iterations of cd gitops + working-directory: gitops approaches, I added a debug step that dumped $GITHUB_WORKSPACE:

echo "Workspace: $GITHUB_WORKSPACE"
ls -la $GITHUB_WORKSPACE/
ls -la $GITHUB_WORKSPACE/gitops/ 2>/dev/null || echo "gitops dir not found"
find $GITHUB_WORKSPACE -name "values.yaml" 2>/dev/null

The output showed the gitops/ folder existed but contained only .git β€” the checkout had failed silently (the GITOPS_APP_TOKEN wasn't propagating credentials properly). The fix was adding persist-credentials: true to the checkout step. Once credentials persisted, the full clone succeeded.

2. Working directory ambiguity

Several iterations of the yq command failed because of conflicting cd gitops and working-directory: gitops approaches. The cleanest fix was to drop both and use explicit paths everywhere:

yq e '.image.tag = "..."' -i gitops/apps/paktech-app/dev/values.yaml
git -C gitops config user.name "github-actions[bot]"
git -C gitops add apps/paktech-app/dev/values.yaml
git -C gitops push

git -C <dir> tells git to operate from that directory without changing the shell's working directory. No cd, no working-directory:, no ambiguity.

3. Rollback firing on staging-only runs

The original rollback condition was:

if: failure() && inputs.target_environment == 'prod'

When a staging-only promotion failed (target was staging, so deploy-prod was skipped), failure() was still true β€” a skipped job counts as "not success". The rollback fired unnecessarily. The fix:

if: failure() && inputs.target_environment == 'prod' && needs.deploy-prod.result == 'failure'

The third clause ensures rollback only fires when deploy-prod actually ran and actually failed, not when it was skipped.

4. git revert scoping

The original rollback used git revert HEAD --no-edit. During testing I noticed this reverted staging changes that happened to be in the same commit as the prod promotion. Switching to git checkout HEAD~1 -- prod/values.yaml + explicit commit fixed the scope.


Tech Stack

LayerTechnologyRole in this project
ApplicationPython 3.12, Flask 3.0.3App factory pattern, structured JSON logging, 4 endpoints
ServerGunicorn 22.0.02 workers Γ— 4 threads in every environment, never Flask dev server
ContainerDocker multi-stagepython:3.12-slim, non-root user, read-only FS, emptyDir /tmp
RegistryAzure Container Registrypaktechacr.azurecr.io β€” ACR cache, OIDC push auth
OrchestrationAKS (East US)aks-paktech-prod β€” rolling update, PDB, AZ topology spread
GitOpsArgo CD v3.4.2ApplicationSet dual-source, AppProject security boundary, sync window
Package managerHelm 3Chart in app repo; per-env values in GitOps repo
CI/CDGitHub Actionsci.yml Β· promote.yml Β· terraform.yml β€” 3 separate pipelines
InfrastructureTerraform 1.7.5AKS, ACR, OIDC federation β€” OIDC auth, Azure Storage backend
CVE scanningTrivyImage scan on every build β€” HIGH + CRITICAL
IaC scanningCheckovTerraform + Kubernetes + Dockerfile on every push
AuthAzure OIDC FederationNo stored Azure credentials anywhere in any pipeline
NotificationsSlack WebhooksDeploy, failure, rollback, Terraform drift

Key Learnings

1. Dual-repo GitOps is the right call, even for a personal project. Mixing app code and cluster state in one repo feels simpler until you need to diff what changed on the cluster versus what changed in the application. Keeping them separate makes the history of each clean and auditable.

2. git -C <dir> is better than cd <dir> in CI. Shell cd in a multi-line run block is fine normally, but combining it with GitHub Actions' working-directory: step property creates ambiguity. git -C is explicit, operates on the repo in the named directory, and doesn't touch the shell's CWD at all. Use it.

3. git diff --staged --quiet || git commit is essential for idempotency. Any CI step that commits to a Git repo should be re-runnable without failing. The || guard makes it a no-op when there's nothing to commit β€” re-promoting the same SHA is safe.

4. Rollback scope matters. git revert HEAD is the wrong tool for targeted environment rollback. Use git checkout HEAD~1 -- path/to/file to restore exactly the file you mean to restore. In a shared GitOps repo where multiple environments are managed, a broad revert can silently undo unrelated work.

5. needs.deploy-prod.result == 'failure' is not the same as failure(). GitHub Actions' failure() function returns true if any upstream job failed or was cancelled β€” including skipped jobs in some contexts. Always be explicit about which job failed and what result you're checking.

6. OIDC federation eliminates a whole class of secret rotation toil. Once set up, it's invisible. No expiry reminders, no accidental commits with credentials, no "my pipeline broke because someone rotated the service principal and forgot to update the secret." The only credential that still needs a PAT is cross-repo Git push β€” Azure's OIDC token can't push to a different GitHub repo.


What I'd Add Next

If this were a real production system (rather than a personal project whose infrastructure was torn down post-demo to avoid Azure costs), the next things I'd wire up:

  • Real smoke tests β€” the current staging smoke test is a placeholder. In a real cluster with internal ingress, this would run curl against the internal ingress URL or use a self-hosted runner with VNet access.
  • Argo CD Notifications controller β€” the ApplicationSet has notification annotations ready; the controller just needs to be deployed and wired to the Slack webhook.
  • HPA in staging β€” currently only prod has HPA. Staging should mirror prod's autoscaling config to catch scaling-related issues before they reach production.
  • Argo CD Image Updater β€” the current pattern polls on commit. Image Updater can watch the registry directly and commit the tag update to the GitOps repo automatically, removing the need for the CI pipeline to write to the GitOps repo at all.
  • OPA/Kyverno policy gates β€” the Terraform pipeline has a stub for OPA policy checks. Wiring real policies (e.g. "all prod images must have a Trivy scan result in the last 24 hours") would add a meaningful security gate before prod syncs.

Source Code

The full source is on GitHub:

Both repos have comprehensive READMEs covering setup, configuration, and operational runbooks.