Operational Excellence in Azure IaC: Beyond Writing Templates

Introduction: The Infrastructure That Runs Itself (Safely) #

You’ve written the perfect Bicep template. It deploys flawlessly. Your hub-and-spoke architecture is beautiful. Then someone runs a cleanup script in the wrong subscription and deletes production.

Or a junior engineer commits Bicep syntax errors that break the pipeline. Or a long-running cleanup operation fails after 55 minutes because the auth token expired. Or cross-platform line endings cause mysterious deployment failures.

Good infrastructure as code isn’t just about deployment templates—it’s about the operational patterns that prevent disasters.

After managing 1,258 files of Azure infrastructure across 4 environments and building operational automation for platform teams, I’ve learned that the difference between hobbyist projects and production-grade infrastructure is the guardrails you build around it.

This post covers 7 operational excellence patterns embedded in PowerShell scripts, configuration files, and automation setup—the operational layer that makes infrastructure safe to operate at scale.

1. Guardrails Everywhere: The PowerShell Module Pattern #

The Problem #

When you have 9 cleanup scripts (firewall deallocation, VPN gateway deletion, resource group cleanup, spoke network teardown), the naïve approach is to copy validation logic across all of them:

# ❌ Copied in every script
$allowedSubscriptionIds = @('sub-1', 'sub-2', 'sub-3')
if ($subscriptionId -notin $allowedSubscriptionIds) {
    throw "Wrong subscription!"
}

$context = Get-AzContext
if ($context.Subscription.Name -ne $expectedName) {
    throw "Wrong context!"
}

Problems:

Duplication across 9 scripts
Inconsistent validation logic
Hard to update when requirements change
Copy-paste errors introduce security holes

The Solution: Shared Validation Module #

Created Common-AzureCleanupFunctions.psm1 with reusable validation functions:

<#
.SYNOPSIS
Common validation and utility functions for Azure cleanup scripts.

.DESCRIPTION
This module provides shared functionality including:
- Subscription and environment validation
- Azure context verification
- Guardrail error handling
#>

function Throw-GuardrailError {
    param([string]$Message)
    throw $Message
}

function Assert-RequiredParameters {
    param(
        [string]$SubscriptionId,
        [string]$IacEnv,
        [string]$LocationSuffix
    )

    if ([string]::IsNullOrWhiteSpace($SubscriptionId)) {
        Throw-GuardrailError "Subscription ID is not set."
    }

    if ([string]::IsNullOrWhiteSpace($IacEnv)) {
        Throw-GuardrailError "Environment name (iac_env) is not set."
    }

    if ([string]::IsNullOrWhiteSpace($LocationSuffix)) {
        Throw-GuardrailError "Location suffix is not set."
    }
}

function Assert-SubscriptionAllowed {
    param(
        [string]$SubscriptionId,
        [string]$AllowedSubscriptionIds
    )

    $allowedList = $AllowedSubscriptionIds.Split(',') |
        ForEach-Object { $_.Trim() } |
        Where-Object { $_ }

    if (-not $allowedList) {
        Throw-GuardrailError "No allowed subscription IDs provided."
    }

    if ($allowedList -notcontains $SubscriptionId) {
        Throw-GuardrailError "Refusing to run: subscriptionId '$SubscriptionId' is not in the allowed list."
    }

    return $allowedList
}

function Assert-EnvironmentAllowed {
    param(
        [string]$IacEnv,
        [string]$AllowedEnvs
    )

    $allowedEnvList = $AllowedEnvs.Split(',') |
        ForEach-Object { $_.Trim() } |
        Where-Object { $_ }

    if ($allowedEnvList -notcontains $IacEnv) {
        Throw-GuardrailError "Refusing to run: environment '$IacEnv' is not in the allowed list."
    }
}

function Assert-AzureContextValid {
    param(
        [string]$SubscriptionId,
        [string]$ExpectedSubscriptionName
    )

    Select-AzSubscription -SubscriptionId $SubscriptionId -ErrorAction Stop | Out-Null

    $context = Get-AzContext
    if ($context.Subscription.Id -ne $SubscriptionId) {
        Throw-GuardrailError "Refusing to run: Az context is not set to expected subscription."
    }

    if (-not [string]::IsNullOrWhiteSpace($ExpectedSubscriptionName)) {
        $sub = Get-AzSubscription -SubscriptionId $SubscriptionId -ErrorAction Stop
        if ($sub.Name -ne $ExpectedSubscriptionName) {
            Throw-GuardrailError "Refusing to run: subscription name '$($sub.Name)' does not match expected '$ExpectedSubscriptionName'."
        }
    }
}

Usage in Every Script #

Now every cleanup script imports the module and validates before executing:

# Import shared validation module
Import-Module "$PSScriptRoot/Common-AzureCleanupFunctions.psm1" -Force

# Validate all parameters
Assert-RequiredParameters -SubscriptionId $subscriptionId -IacEnv $iac_env -LocationSuffix $location_suffix

# Validate subscription is allowed
Assert-SubscriptionAllowed -SubscriptionId $subscriptionId -AllowedSubscriptionIds $allowedSubscriptionIds

# Validate environment is allowed
Assert-EnvironmentAllowed -IacEnv $iac_env -AllowedEnvs "dev,drdev"

# Validate Azure context
Assert-AzureContextValid -SubscriptionId $subscriptionId -ExpectedSubscriptionName $expectedSubscriptionName

Why This Matters #

Benefits:

Single source of truth for validation logic
Consistent error messages across all scripts
Update once, fix everywhere when requirements change
Testable (validation functions can be unit tested with Pester)
Self-documenting (module docstrings explain what each function does)

Result: Zero production incidents from cleanup scripts running in wrong subscriptions.

The Pattern #

┌─────────────────────────────────────┐
│  Cleanup Scripts (9 scripts)       │
│  - DeallocateAzureFirewall.ps1     │
│  - DeleteAzureFirewall.ps1         │
│  - DeleteVPNGateways.ps1           │
│  - etc.                            │
└────────────┬────────────────────────┘
             │ Import-Module
             ▼
┌─────────────────────────────────────┐
│  Common-AzureCleanupFunctions.psm1 │
│  - Assert-SubscriptionAllowed      │
│  - Assert-EnvironmentAllowed       │
│  - Assert-AzureContextValid        │
│  - Throw-GuardrailError            │
└─────────────────────────────────────┘

2. The Dry-Run Default Pattern #

The Disaster Scenario #

You have a script that deletes spoke network resource groups. It works perfectly in dev. Then someone runs it in production. With no confirmation. All spoke networks: gone.

The mistake: Making deletion the default behavior.

The Solution: Safe by Default #

DeleteSpokeNetworkResources.ps1 requires explicit opt-in for destructive operations:

param(
    [Parameter(Mandatory)]
    [string]$subscriptionId,

    [string]$expectedSubscriptionId = 'yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy',
    [string]$expectedSubscriptionName = 'Azure-Dev-Spokes',
    [string]$spokePrefixesCsv = 'azf,gmr,iag,gnt,gmr-vdi,iag-vdi,gnt-vdi',
    [string]$locationSuffixesCsv = 'wus,eus',

    [switch]$Apply  # ← The safety switch
)

# Validate subscription (guardrails from module)
if ($subscriptionId -ne $expectedSubscriptionId) {
    Throw-GuardrailError "Refusing to run: subscriptionId does not match expectedSubscriptionId."
}

# Build regex to match spoke resource groups
$prefixPattern = ($spokePrefixes | ForEach-Object { [regex]::Escape($_) }) -join '|'
$locationPattern = ($locationSuffixes | ForEach-Object { [regex]::Escape($_) }) -join '|'
$rgRegex = "^(?:$prefixPattern)-network-(?:rg|auxiliary)-.+-(?:$locationPattern)$"

# Find matching resource groups
$rgCandidates = Get-AzResourceGroup | Where-Object {
    $_.ResourceGroupName -match $rgRegex
}

if (-not $rgCandidates) {
    Write-Output "No matching resource groups found."
    return
}

Write-Output "Matched resource groups in subscription ${expectedSubscriptionId}:"
$rgCandidates.ResourceGroupName | ForEach-Object { Write-Output "  - $_" }

# DRY RUN by default
if (-not $Apply) {
    Write-Output "`nDRY RUN: Pass -Apply to actually delete these resources."
    return
}

# Only reach here if -Apply flag was provided
foreach ($rg in $rgCandidates) {
    Remove-AzResourceGroup -Name $rg.ResourceGroupName -Force
    Write-Output "Deleted resource group: $($rg.ResourceGroupName)"
}

Usage Examples #

Safe exploration (default):

./DeleteSpokeNetworkResources.ps1 -subscriptionId "yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy"

# Output:
# Matched resource groups in subscription yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy:
#   - azf-network-rg-dev-wus
#   - gmr-network-rg-dev-wus
#   - iag-network-rg-dev-wus
#   - gnt-network-rg-dev-wus
#
# DRY RUN: Pass -Apply to actually delete these resources.

Actual deletion (explicit):

./DeleteSpokeNetworkResources.ps1 -subscriptionId "yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy" -Apply

# Output:
# Matched resource groups...
# Deleted resource group: azf-network-rg-dev-wus
# Deleted resource group: gmr-network-rg-dev-wus
# ...

Why This Pattern Works #

Philosophy: Make the safe operation easy, make the dangerous operation explicit.

Benefits:

Zero accidental deletions from running scripts without reading them
Preview before destruction - see exactly what would be deleted
Pipeline integration - CI/CD can run in dry-run mode for validation
Training-friendly - new team members can explore safely
Documentation - dry-run output serves as script documentation

Cost: One extra parameter (minimal)

The Switch Pattern #

PowerShell [switch] parameters are perfect for this:

Default value: $false (safe)
Explicit to enable: -Apply (dangerous)
Clear intent: “Apply these changes”

Alternative names: -Execute, -Confirm, -Force (depending on context)

3. Retry Logic with Exponential Backoff #

The Azure Reality #

Azure resource deletion isn’t instant. Resources have dependencies. Azure’s eventual consistency model means:

You delete a VNet
Azure returns “accepted” (202)
Behind the scenes, Azure is still cleaning up NICs, subnets, peerings
If you try to delete the resource group immediately: 409 Conflict

Naïve approach: Fail and give up.

Production approach: Retry intelligently.

The Implementation #

From DeleteSpokeNetworkResources.ps1:

param(
    [int]$maxRetries = 3,
    [int]$retryDelaySeconds = 15
)

$failed = @()

foreach ($rg in $rgCandidates) {
    $rgName = $rg.ResourceGroupName

    # Get VNet info for logging
    $vnets = Get-AzVirtualNetwork -ResourceGroupName $rgName -ErrorAction SilentlyContinue
    $vnetNames = if ($vnets) { ($vnets.Name -join ', ') } else { "<none>" }

    if ($Apply) {
        $deleted = $false

        # Retry loop with exponential backoff
        for ($attempt = 1; $attempt -le $maxRetries -and -not $deleted; $attempt++) {
            try {
                Remove-AzResourceGroup -Name $rgName -Force -ErrorAction Stop | Out-Null
                Write-Output "✅ Deleted resource group $rgName (VNets: $vnetNames)"
                $deleted = $true

            } catch {
                $msg = $_.Exception.Message

                # Check if it's a conflict error (resource still deleting)
                if ($msg -match 'Conflict' -or $msg -match '409') {
                    $sleep = $retryDelaySeconds * $attempt  # Exponential backoff
                    Write-Output "⏳ Conflict deleting $rgName (attempt $attempt/$maxRetries). Retrying in ${sleep}s..."
                    Start-Sleep -Seconds $sleep
                    continue  # Try again
                }

                # Non-conflict errors: fail immediately
                Write-Output "❌ Error deleting resource group ${rgName}: $msg"
                break
            }
        }

        if (-not $deleted) {
            $failed += $rgName
        }

    } else {
        Write-Output "DRY RUN: would delete $rgName (VNets: $vnetNames)"
    }
}

# Report failures
if ($failed.Count -gt 0) {
    throw "Failed to delete resource groups: $($failed -join ', ')"
}

The Smart Retry Logic #

Key decisions:

Only retry on conflicts (409 errors)
- Non-retryable errors (permissions, not found) fail immediately
- Saves time on errors that won’t resolve with waiting
Exponential backoff
- Attempt 1: wait 15 seconds
- Attempt 2: wait 30 seconds
- Attempt 3: wait 45 seconds
- Gives Azure progressively more time to complete operations
Track deletion state
- $deleted boolean prevents unnecessary retries after success
- Continue to next resource group once deletion succeeds
Collect failures
- Don’t exit on first failure
- Delete everything possible, then report all failures
- Allows partial progress in batch operations

Real-World Example #

⏳ Conflict deleting gmr-network-rg-dev-wus (attempt 1/3). Retrying in 15s...
⏳ Conflict deleting gmr-network-rg-dev-wus (attempt 2/3). Retrying in 30s...
✅ Deleted resource group gmr-network-rg-dev-wus (VNets: gmr-vnet-dev-wus)

Without retry: Fails on attempt 1, resource group left in partially deleted state.

With retry: Succeeds on attempt 3, resource group fully cleaned up.

The Pattern #

$maxRetries = 3
$deleted = $false

for ($attempt = 1; $attempt -le $maxRetries -and -not $deleted; $attempt++) {
    try {
        # Attempt operation
        Invoke-AzureOperation
        $deleted = $true  # Success

    } catch {
        if (Is-RetryableError $_) {
            $backoff = Calculate-Backoff $attempt
            Start-Sleep -Seconds $backoff
            continue  # Retry
        }
        throw  # Fail fast on non-retryable errors
    }
}

if (-not $deleted) {
    # Handle permanent failure
}

4. Git Hooks for Quality Gates #

The Problem #

Your CI/CD pipeline catches Bicep syntax errors… after someone commits broken code, pushes to the repository, triggers a pipeline run, waits 5 minutes, and sees the failure.

Feedback loop: 5-10 minutes Developer experience: Frustrating Repository history: Polluted with “fix syntax error” commits

The Solution: Pre-Commit Validation #

Setup-GitHooks.ps1 automates Git hook installation:

# Setup Git Hooks for Test Enforcement
# Run this script once to enable pre-commit testing

Write-Host "`n========================================" -ForegroundColor Cyan
Write-Host "Git Hooks Setup" -ForegroundColor Cyan
Write-Host "========================================`n" -ForegroundColor Cyan

try {
    # Check if Pester is installed
    Write-Host "Checking Pester installation..." -ForegroundColor Yellow
    $pester = Get-Module -ListAvailable -Name Pester | Where-Object Version -ge '5.0.0'

    if (-not $pester) {
        Write-Host "Pester 5.0+ not found. Installing..." -ForegroundColor Yellow
        Install-Module -Name Pester -MinimumVersion 5.0.0 -Force -SkipPublisherCheck
        Write-Host "Pester installed successfully.`n" -ForegroundColor Green
    } else {
        Write-Host "Pester $($pester.Version) is already installed.`n" -ForegroundColor Green
    }

    # Setup pre-commit hook
    $hookDest = Join-Path $PSScriptRoot ".git" "hooks" "pre-commit"

    if (Test-Path $hookDest) {
        Write-Host "Pre-commit hook already exists." -ForegroundColor Yellow
        $response = Read-Host "Do you want to overwrite it? (y/n)"
        if ($response -ne 'y') {
            Write-Host "Skipping hook setup.`n" -ForegroundColor Yellow
            exit 0
        }
    }

    # Create the pre-commit hook (shell script that calls PowerShell)
    $hookContent = @'
#!/bin/sh
# Pre-commit hook - runs Pester tests before allowing commit

# Change to the .git/hooks directory to ensure we're in the right location
cd "$(dirname "$0")" || exit 1

# Run PowerShell script using relative path
if [ -f "pre-commit.ps1" ]; then
    pwsh -NoProfile -ExecutionPolicy Bypass -File "./pre-commit.ps1"
    exit $?
else
    echo "pre-commit.ps1 not found. Allowing commit to proceed."
    exit 0
fi
'@

    Set-Content -Path $hookDest -Value $hookContent -Force
    Write-Host "Pre-commit hook created at: $hookDest" -ForegroundColor Green

    # Make executable on Unix-like systems
    if ($IsLinux -or $IsMacOS) {
        chmod +x $hookDest
        Write-Host "Hook made executable.`n" -ForegroundColor Green
    } else {
        Write-Host "`nNote: On Windows, Git will automatically use PowerShell for the hook.`n" -ForegroundColor Cyan
    }

    # Run a test to verify everything works
    Write-Host "Running test suite to verify setup..." -ForegroundColor Yellow
    $testScript = Join-Path $PSScriptRoot "tests" "Run-Tests.ps1"

    if (Test-Path $testScript) {
        & $testScript

        if ($LASTEXITCODE -eq 0) {
            Write-Host "`n========================================" -ForegroundColor Green
            Write-Host "Setup Complete!" -ForegroundColor Green
            Write-Host "========================================" -ForegroundColor Green
            Write-Host "Pre-commit hook is now active." -ForegroundColor Green
            Write-Host "Tests will run automatically before each commit.`n" -ForegroundColor Green
        } else {
            Write-Host "`n========================================" -ForegroundColor Red
            Write-Host "Setup Complete (with warnings)" -ForegroundColor Red
            Write-Host "========================================" -ForegroundColor Red
            Write-Host "Hook is installed, but some tests failed." -ForegroundColor Yellow
            Write-Host "Please fix the failing tests before committing.`n" -ForegroundColor Yellow
        }
    }

    Write-Host "To disable the hook temporarily, use: git commit --no-verify`n" -ForegroundColor Cyan

} catch {
    Write-Host "`nError during setup: $($_.Exception.Message)" -ForegroundColor Red
    Write-Host "Please check the error and try again.`n" -ForegroundColor Yellow
    exit 1
}

What Gets Validated #

The pre-commit hook runs:

Pester tests - PowerShell unit tests for cleanup scripts
PSRule validation - Azure best practices and security checks for Bicep
Bicep syntax checks - az bicep build on all changed .bicep files
Linting - Custom rules from bicepconfig.json

The Developer Experience #

Before (no hooks):

$ git commit -m "Add firewall rule"
[main abc123] Add firewall rule
 1 file changed

# Push to remote
$ git push
# Wait 5 minutes for pipeline
# Pipeline fails: Bicep syntax error
# Fix locally, commit again, push again, wait again...

After (with hooks):

$ git commit -m "Add firewall rule"
Running pre-commit tests...

Running Pester tests...
  ✅ Common-AzureCleanupFunctions.Tests.ps1 (8 tests passed)
  ✅ DeleteResourceGroups.Tests.ps1 (5 tests passed)

Validating Bicep files...
  ✅ hub.bicep
  ❌ definitions/gmr.bicep
     Line 42: Syntax error: Expected '}'

Pre-commit validation failed. Commit blocked.
Fix the errors above and try again.

# Fix syntax error
$ git commit -m "Add firewall rule"
Running pre-commit tests...
  ✅ All tests passed

[main abc123] Add firewall rule
 1 file changed

Feedback loop: 10 seconds Developer experience: Fast, immediate feedback Repository history: Clean, only working code

Bypassing Hooks (When Needed) #

For work-in-progress commits:

git commit --no-verify -m "WIP: incomplete implementation"

Use sparingly - most commits should pass validation.

The Pattern #

┌──────────────────────────────────┐
│     Developer runs git commit    │
└────────────┬─────────────────────┘
             │
             ▼
┌──────────────────────────────────┐
│   .git/hooks/pre-commit          │
│   (Shell script)                 │
└────────────┬─────────────────────┘
             │
             ▼
┌──────────────────────────────────┐
│   pre-commit.ps1                 │
│   (PowerShell orchestrator)      │
└────────────┬─────────────────────┘
             │
             ├─────> Run Pester tests
             ├─────> Run PSRule validation
             └─────> Validate Bicep syntax
                     │
                     ▼
                  Pass/Fail
                     │
           ┌─────────┴─────────┐
           │                   │
         Pass                Fail
           │                   │
    Allow commit        Block commit

5. The Line Ending Enforcement Pattern #

The Cross-Platform Catastrophe #

Scenario: You develop on Windows. Your CI/CD runs on Linux. You commit a Bicep file with CRLF line endings. The Linux pipeline fails with cryptic errors:

Error: Invalid character at line 42

Line 42 looks fine. But invisible \r\n vs \n line endings break parsing.

Or vice versa: PowerShell script committed with LF line endings. Windows execution fails with:

The term './script.ps1' is not recognized

The Solution: `.gitattributes` #

.gitattributes enforces line ending consistency:

# Normalize line endings across the repo

# Default: treat files as text and use LF in working tree
* text=auto eol=lf

# Explicit LF for common infra and scripts
*.sh           text eol=lf
*.yml          text eol=lf
*.yaml         text eol=lf
*.bicep        text eol=lf
*.bicepparam   text eol=lf
*.json         text eol=lf
*.md           text eol=lf

# Windows-specific scripts keep CRLF
*.ps1          text eol=crlf
*.bat          text eol=crlf
*.cmd          text eol=crlf

How `.gitattributes` Works #

Repository (always LF):

Files are stored with LF in Git
History is consistent regardless of contributor platform

Working tree (depends on file type):

Bicep/YAML: Checked out with LF (Unix standard)
PowerShell: Checked out with CRLF (Windows requirement)

On commit:

Git automatically normalizes line endings based on .gitattributes
Prevents accidental CRLF commits from Windows users

Why Specific File Types Matter #

LF files (.bicep, .yml, .json):

Infrastructure files should use Unix conventions
Bicep compiler expects LF
YAML parsers expect LF
Linux pipelines expect LF

CRLF files (.ps1, .bat):

PowerShell on Windows requires CRLF
Some Windows tools expect CRLF
Batch files must have CRLF

The Impact #

Without .gitattributes:

Windows devs commit CRLF inadvertently
Linux pipelines fail randomly
Mac developers see different line endings than Windows
Impossible to review diffs (entire files show as changed)

With .gitattributes:

Consistent line endings in repository
Cross-platform compatibility
Clean diffs showing actual changes
No more “works on my machine” line ending issues

Setting Up a New Clone #

When a developer clones the repo:

git clone https://github.com/org/azure-iac.git
cd azure-iac

# Git automatically reads .gitattributes
# Files are checked out with correct line endings
# No manual configuration needed

Fixing Existing Files #

If line endings are already mixed:

# Remove all files from Git's index
git rm --cached -r .

# Reset the index to match .gitattributes
git reset --hard

# Add all files back (normalizes line endings)
git add .

# Commit the normalized files
git commit -m "Normalize line endings per .gitattributes"

6. Bicep Analyzer Configuration: Custom Linting Rules #

The Linter Conflict #

Bicep’s built-in analyzer has a rule: explicit-values-for-loc-params

It wants:

param location string = 'westus'  // ✅ Explicit default

But our multi-environment pattern uses:

param location string = readEnvironmentVariable('location', 'WestUS')  // ❌ Analyzer doesn't like this

Problem: The analyzer flags this as a warning, but it’s intentional design for our multi-environment deployment model.

The Solution: Custom Configuration #

bicepconfig.json customizes linting rules:

{
  // See https://aka.ms/bicep/config for more information on Bicep configuration options
  // Press CTRL+SPACE/CMD+SPACE at any location to see Intellisense suggestions
  "analyzers": {
    "core": {
      "rules": {
        "explicit-values-for-loc-params": {
          "level": "off"
        }
      }
    }
  }
}

Decision Rationale #

Why disable this rule?

Environment variables provide explicit values - Just not at template compile time
Our pattern is intentional - Not a mistake or oversight
One parameter file serves all environments - Better than duplicated files with explicit locations
Runtime injection is required - For CI/CD pipeline integration

Alternative considered: Set rule to warning instead of off

Decision: off because warnings clutter output and hide real issues

Other Customizations #

You could customize additional rules:

{
  "analyzers": {
    "core": {
      "rules": {
        "no-unused-params": {
          "level": "warning"  // Default is error
        },
        "prefer-interpolation": {
          "level": "error"  // Enforce string interpolation over concat()
        },
        "use-stable-vm-image": {
          "level": "off"  // Allow 'latest' VM images in dev
        }
      }
    }
  }
}

The Impact #

Before customization:

$ az bicep build --file hub.bicep

Warning BCP037: The parameter "location" uses environment variable which may not provide an explicit value.
Warning BCP037: The parameter "source_branch" uses environment variable which may not provide an explicit value.
...
[50 more warnings from multi-environment pattern]

After customization:

$ az bicep build --file hub.bicep

Build succeeded. 0 warning(s), 0 error(s)

Clean output focuses attention on actual issues, not expected patterns.

7. Orchestration with Token Refresh #

The Long-Running Operation Problem #

You have a cleanup orchestrator that:

Deallocates firewall (10 min)
Deletes VPN gateways (15 min each = 30 min)
Deletes Virtual Hub connections (20 min)
Deletes resource groups (10 min)

Total: 70 minutes

Azure AD token validity: 60 minutes

Result: Auth expires at minute 60, operation fails at minute 61.

The Solution: Proactive Token Refresh #

From Invoke-CleanupOrchestrator.ps1:

#region Azure Authentication Functions

function Connect-AzureWithServicePrincipal {
    <#
    .SYNOPSIS
    Establishes Azure connection using service principal credentials from environment variables.

    .DESCRIPTION
    This function is used to refresh the Azure authentication token during long-running operations.
    It uses the service principal credentials exposed by the AzurePowerShell@5 task when
    addSpnToEnvironment is set to true.

    Based on the token refresh strategy documented in pipelines/TOKEN-REFRESH-STRATEGY.md
    #>
    param(
        [Parameter(Mandatory=$true)]
        [string]$subscriptionId
    )

    Write-Host "Refreshing Azure authentication token..."

    # Check if service principal credentials are available from AzurePowerShell task
    if (-not $env:servicePrincipalId -or -not $env:servicePrincipalKey -or -not $env:tenantId) {
        Write-Host "Service principal environment variables not found. Token refresh may not work."
        Write-Host "This is expected if running locally. Attempting to use existing context..."

        # Try to use existing context
        try {
            Select-AzSubscription -SubscriptionId $subscriptionId -ErrorAction Stop | Out-Null
            Write-Host "Using existing Azure context."
            return $true
        } catch {
            Write-Host "Failed to establish Azure context: $($_.Exception.Message)"
            return $false
        }
    }

    try {
        # Convert service principal key to secure string
        $securePassword = ConvertTo-SecureString $env:servicePrincipalKey -AsPlainText -Force
        $credential = New-Object System.Management.Automation.PSCredential($env:servicePrincipalId, $securePassword)

        # Connect using service principal
        Connect-AzAccount -ServicePrincipal `
            -Credential $credential `
            -Tenant $env:tenantId `
            -Subscription $subscriptionId `
            -ErrorAction Stop | Out-Null

        Write-Host "Azure authentication token refreshed successfully."
        return $true
    } catch {
        Write-Host "Failed to refresh Azure authentication: $($_.Exception.Message)"
        return $false
    }
}

function Test-AzureAuthentication {
    <#
    .SYNOPSIS
    Tests if the current Azure authentication is valid and can access the subscription.
    #>
    param(
        [Parameter(Mandatory=$true)]
        [string]$subscriptionId
    )

    try {
        # Try a simple operation to verify authentication is working
        $context = Get-AzContext -ErrorAction Stop

        if (-not $context) {
            return $false
        }

        # Verify we can actually query the subscription
        Get-AzSubscription -SubscriptionId $subscriptionId -ErrorAction Stop | Out-Null
        return $true
    } catch {
        return $false
    }
}

#endregion

# Main orchestration loop
$lastTokenRefresh = Get-Date
$cleanupTasks = @(
    'Deallocate-Firewall',
    'Delete-VPNGateways',
    'Delete-VirtualHubConnections',
    'Delete-ResourceGroups'
)

foreach ($task in $cleanupTasks) {
    # Check if token needs refresh (every 45 minutes)
    $timeSinceRefresh = (Get-Date) - $lastTokenRefresh
    if ($timeSinceRefresh.TotalMinutes -ge 45) {
        Write-Host "`n=== Token Refresh ===" -ForegroundColor Cyan

        $refreshed = Connect-AzureWithServicePrincipal -subscriptionId $subscriptionId

        if ($refreshed) {
            $lastTokenRefresh = Get-Date
            Write-Host "Token refresh completed. Continuing cleanup..." -ForegroundColor Green
        } else {
            Write-Host "Token refresh failed. Attempting to continue with existing context..." -ForegroundColor Yellow
        }
    }

    # Verify authentication before each task
    if (-not (Test-AzureAuthentication -subscriptionId $subscriptionId)) {
        Write-Host "Authentication validation failed before task: $task" -ForegroundColor Red

        # Attempt emergency token refresh
        $refreshed = Connect-AzureWithServicePrincipal -subscriptionId $subscriptionId

        if (-not $refreshed) {
            throw "Unable to establish valid Azure authentication. Aborting cleanup."
        }
    }

    # Execute cleanup task
    Write-Host "`n=== Running: $task ===" -ForegroundColor Cyan
    & ".\$task.ps1" -subscriptionId $subscriptionId -iac_env $iac_env -location_suffix $location_suffix
}

The Token Refresh Strategy #

Key elements:

45-minute refresh interval (not 60)
- Azure tokens expire at 60 minutes
- Refresh at 45 gives 15-minute safety buffer
- Prevents mid-operation auth failures
Service principal credentials from pipeline
- Azure DevOps task exposes $env:servicePrincipalId, $env:servicePrincipalKey, $env:tenantId
- Script uses these to re-authenticate
- No credentials hardcoded in scripts
Graceful degradation
- If running locally (no env vars): use existing context
- If refresh fails: attempt to continue with existing auth
- Only fail if both refresh and existing auth are invalid
Pre-task validation
- Test authentication before each cleanup task
- Emergency refresh if validation fails
- Prevents task failures mid-operation

Azure DevOps Pipeline Integration #

The pipeline task must expose service principal credentials:

- task: AzurePowerShell@5
  displayName: 'Run Cleanup Orchestrator'
  inputs:
    azureSubscription: 'azure-iac-secret'
    ScriptType: 'FilePath'
    ScriptPath: '$(System.DefaultWorkingDirectory)/powerShellScripts/Invoke-CleanupOrchestrator.ps1'
    ScriptArguments: '-subscriptionId "$(subscriptionId)" -iac_env "$(iac_env)" -location_suffix "$(location_suffix)"'
    azurePowerShellVersion: 'LatestVersion'
    addSpnToEnvironment: true  # ← Critical: exposes service principal credentials

Without addSpnToEnvironment: true: Token refresh doesn’t work, operation fails at 60 minutes.

With addSpnToEnvironment: true: Service principal credentials available, token refresh succeeds.

The Same Pattern Everywhere #

This token refresh strategy is used in:

Deployment pipelines - For long-running Bicep deployments
Cleanup orchestrators - For sequential cleanup operations
DR failover scripts - For multi-hour disaster recovery operations

Consistency: Same pattern across all long-running operations.

The Operational Maturity Spectrum #

Level	Characteristics	Example	This Repo
L1: Scripts	One-off PowerShell scripts, manual execution, no validation	`Remove-AzResourceGroup -Name "rg-prod"`	❌
L2: Reusable	Functions extracted, some parameter validation	`function Remove-RG { param($name) ... }`	❌
L3: Safe	Guardrails, dry-run defaults, subscription validation	`Assert-SubscriptionAllowed`, `-Apply` switch	✅
L4: Resilient	Retry logic, error handling, token refresh	Exponential backoff, `Connect-AzureWithServicePrincipal`	✅
L5: Automated	Git hooks, pre-commit tests, CI/CD integration	Pre-commit Pester tests, automated PSRule validation	✅

This repository operates at Level 5: Automation with safety guardrails, resilience patterns, and quality gates built in.

Progression Path #

L1 → L2: Extract reusable functions L2 → L3: Add validation and dry-run modes L3 → L4: Implement retry logic and error handling L4 → L5: Automate quality gates with Git hooks and CI/CD

Most organizations stop at L2 or L3. Reaching L5 requires investment in operational excellence.

The Pattern Library: Extractable to Any Environment #

These patterns aren’t Azure-specific or IaC-specific. They apply to any operational automation:

1. Shared Validation Module #

Pattern: Extract common validation logic into a reusable module Applies to: Cleanup scripts, deployment scripts, migration scripts Benefit: Consistent guardrails across all automation

2. Dry-Run Default Switch #

Pattern: Make safe operations default, require explicit flag for destruction Applies to: Database migrations, resource deletion, configuration changes Benefit: Prevents accidental execution of dangerous operations

3. Conflict-Aware Retry Logic #

Pattern: Retry only on transient errors with exponential backoff Applies to: API calls, resource operations, network requests Benefit: Handle eventual consistency gracefully

4. Git Hooks Setup Automation #

Pattern: Provide scripts to install quality gates as Git hooks Applies to: Linting, testing, security scanning, formatting Benefit: Shift quality gates left without CI/CD dependency

5. Cross-Platform Line Ending Management #

Pattern: Use .gitattributes to enforce file-type-specific line endings Applies to: Any multi-platform repository with scripts and configs Benefit: Prevent “works on my machine” issues from invisible characters

6. Token Refresh for Long Operations #

Pattern: Proactively refresh auth tokens before expiration Applies to: Long-running operations against any API with token expiration Benefit: Operations that would take >60 minutes can run reliably

Conclusion: Operational Excellence as a Competitive Advantage #

Anyone can write a PowerShell script that deletes resources. Few build operational automation that:

Prevents disasters through layered guardrails
Provides visibility with dry-run defaults
Handles failures gracefully with smart retry logic
Shifts quality gates left with pre-commit validation
Works cross-platform with line ending enforcement
Scales beyond token limits with proactive token refresh

This is the difference between scripts and systems.

Good infrastructure code deploys resources. Great infrastructure code makes operations safe, reliable, and scalable.

The patterns in this repository—shared validation modules, dry-run defaults, retry logic, Git hooks, line ending enforcement, and token refresh—represent operational maturity that separates engineers who write scripts from engineers who build platforms.

If your organization is struggling with operational incidents, inconsistent quality, or “works on my machine” issues, these patterns provide a roadmap from ad-hoc automation to production-grade operational excellence.

Technologies: PowerShell, Azure CLI, Pester, Git Hooks, PSRule, Azure DevOps

Skills Demonstrated: Operational excellence, PowerShell automation, Quality assurance, Resilience engineering, DevOps practices, Platform engineering

Related posts:

Part of a series on building enterprise-scale Azure infrastructure at scale.

Introduction: The Infrastructure That Runs Itself (Safely) #

1. Guardrails Everywhere: The PowerShell Module Pattern #

The Problem #

The Solution: Shared Validation Module #

Usage in Every Script #

Why This Matters #

The Pattern #

2. The Dry-Run Default Pattern #

The Disaster Scenario #

The Solution: Safe by Default #

Usage Examples #

Why This Pattern Works #

The Switch Pattern #

3. Retry Logic with Exponential Backoff #

The Azure Reality #

The Implementation #

The Smart Retry Logic #

Real-World Example #

The Pattern #

4. Git Hooks for Quality Gates #

The Problem #

The Solution: Pre-Commit Validation #

What Gets Validated #

The Developer Experience #

Bypassing Hooks (When Needed) #

The Pattern #

5. The Line Ending Enforcement Pattern #

The Cross-Platform Catastrophe #

The Solution: .gitattributes #

How .gitattributes Works #

Why Specific File Types Matter #

The Impact #

Setting Up a New Clone #

Fixing Existing Files #

6. Bicep Analyzer Configuration: Custom Linting Rules #

The Linter Conflict #

The Solution: Custom Configuration #

Decision Rationale #

Other Customizations #

The Impact #

7. Orchestration with Token Refresh #

The Long-Running Operation Problem #

The Solution: Proactive Token Refresh #

The Token Refresh Strategy #

Azure DevOps Pipeline Integration #

The Same Pattern Everywhere #

The Operational Maturity Spectrum #

Progression Path #

The Pattern Library: Extractable to Any Environment #

1. Shared Validation Module #

2. Dry-Run Default Switch #

3. Conflict-Aware Retry Logic #

4. Git Hooks Setup Automation #

5. Cross-Platform Line Ending Management #

6. Token Refresh for Long Operations #

Conclusion: Operational Excellence as a Competitive Advantage #

The Solution: `.gitattributes` #

How `.gitattributes` Works #