The Incident #
Early in the project, I was testing a cleanup script against the dev environment. Or so I thought.
I ran the script. Within 30 seconds, I realized my terminal was connected to the production subscription.
I’d just deleted critical network security groups from production.
That happened once. Never again.
The Problem: Humans Make Mistakes #
You can be careful. You can be experienced. You can double-check. You will still make mistakes.
The question isn’t “Will I make a mistake?” It’s “When I make a mistake, what breaks?”
The Solution: Multiple Safety Layers #
Every operational script now includes multiple independent safety checks. If any check fails, the script exits before touching anything.
Layer 1: Subscription ID Allowlist #
# cleanup-dev-resources.ps1
param(
[Parameter(Mandatory=$false)]
[switch]$Apply # Dry-run by default
)
# Explicitly allowed subscriptions
$allowedSubscriptionIds = @(
'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx', # Azure-Dev-Hub
'yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy' # Azure-Dev-Spokes
)
# Get current subscription
$currentSubscription = az account show | ConvertFrom-Json
$subscriptionId = $currentSubscription.id
$subscriptionName = $currentSubscription.name
# Check 1: Subscription ID
if ($subscriptionId -notin $allowedSubscriptionIds) {
Write-Error "❌ SAFETY CHECK FAILED: Subscription not in allowed list"
Write-Error "Current subscription: $subscriptionName ($subscriptionId)"
Write-Error "Allowed subscriptions:"
$allowedSubscriptionIds | ForEach-Object { Write-Error " - $_" }
exit 1
}
Write-Host "✅ Subscription ID check passed: $subscriptionName"
Layer 2: Subscription Name Validation #
# Check 2: Subscription name (redundant check)
$expectedNames = @('Azure-Dev-Hub', 'Azure-Dev-Spokes')
if ($subscriptionName -notin $expectedNames) {
Write-Error "❌ SAFETY CHECK FAILED: Subscription name mismatch"
Write-Error "Current: $subscriptionName"
Write-Error "Expected one of: $($expectedNames -join ', ')"
exit 1
}
Write-Host "✅ Subscription name check passed"
Layer 3: Environment Tag Validation #
# Check 3: Verify environment tag on resource groups
$resourceGroups = az group list | ConvertFrom-Json
foreach ($rg in $resourceGroups) {
$envTag = $rg.tags.environment
if ($envTag -eq 'production' -or $envTag -eq 'prod') {
Write-Error "❌ SAFETY CHECK FAILED: Production environment detected"
Write-Error "Resource group $($rg.name) has environment tag: $envTag"
exit 1
}
}
Write-Host "✅ Environment tag check passed"
Layer 4: Explicit Confirmation Required #
# Check 4: Dry-run by default, explicit --Apply required
if (-not $Apply) {
Write-Host "`n⚠️ DRY-RUN MODE (no changes will be made)" -ForegroundColor Yellow
Write-Host "Resources that would be deleted:"
# Show what would be deleted
az resource list --query "[?resourceGroup=='rg-hub-dev'].{name:name, type:type}" -o table
Write-Host "`nTo actually delete these resources, run with -Apply flag"
exit 0
}
Write-Host "`n⚠️ APPLY MODE - Resources will be deleted!" -ForegroundColor Red
Start-Sleep -Seconds 3 # Give time to Cancel
In CI/CD Pipelines #
For Azure DevOps pipelines, add environment protection:
jobs:
- deployment: CleanupDevEnvironment
displayName: 'Cleanup Dev Resources'
environment: 'dev-cleanup' # Requires approval if misconfigured
variables:
allowedSubscription: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
strategy:
runOnce:
deploy:
steps:
- task: AzureCLI@2
displayName: 'Validate Subscription'
inputs:
azureSubscription: 'Azure-Dev-ServiceConnection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
CURRENT_SUB=$(az account show --query id -o tsv)
if [ "$CURRENT_SUB" != "$(allowedSubscription)" ]; then
echo "❌ Subscription mismatch!"
echo "Current: $CURRENT_SUB"
echo "Expected: $(allowedSubscription)"
exit 1
fi
echo "✅ Subscription validated"
- task: AzureCLI@2
displayName: 'Cleanup Resources'
inputs:
azureSubscription: 'Azure-Dev-ServiceConnection'
scriptType: 'bash'
scriptLocation: 'scriptPath'
scriptPath: 'scripts/cleanup-dev.sh'
Resource Locks for Production #
On production resources, add locks:
# Lock production resource groups to prevent deletion
az lock create \
--name 'prevent-deletion' \
--lock-type CanNotDelete \
--resource-group rg-hub-prod
# Lock critical resources individually
az lock create \
--name 'prevent-deletion' \
--lock-type CanNotDelete \
--resource-group rg-hub-prod \
--resource-name fw-hub-prod \
--resource-type Microsoft.Network/azureFirewalls
Even if you run a delete script against production with locks in place, Azure will reject the deletion.
The Results #
Since implementing these guardrails:
- ✅ 0 accidental production changes
- ✅ 0 wrong-subscription incidents
- ✅ Scripts fail safely and loudly before touching anything
- ✅ Dry-run mode catches mistakes before they happen
The overhead:
- 5-10 lines of validation code per script
- 3 seconds of validation time
- Zero cognitive load (automated checks)
The Lesson #
Production systems require production-grade safety. Assume mistakes will happen, and build systems that fail safely.
Good guardrails:
- Are automated (not reliant on human vigilance)
- Have multiple independent checks (defense in depth)
- Fail loudly and early (before any changes)
- Require explicit intent for destructive actions (dry-run by default)
Don’t rely on being careful. Rely on systems that make it hard to break things.
Related posts:
Part of a series on lessons learned managing enterprise Azure infrastructure at scale.