Skip to main content

Over the past year, I built and maintained an enterprise Azure infrastructure platform managing 1,258 files of Bicep templates across 6 spoke networks, 4 environments, and multiple Azure subscriptions.

This series shares 7 critical lessons I learned the hard way—lessons that transformed how I think about Infrastructure as Code, cloud architecture, and operational excellence at scale.

The Challenge #

Managing enterprise-grade Azure infrastructure that includes:

  • Hub-and-spoke architecture with Azure Virtual WAN
  • Azure Firewall Premium with threat intelligence and IDPS
  • ExpressRoute circuits and dual VPN gateways
  • 6 business unit networks across dev/prod/DR environments
  • Multi-subscription governance and cost management
  • Automated deployment pipelines with safety guardrails

Each lesson below addresses a real production problem I faced and solved.

The 7-Part Series #

1. When Your Azure Deployment Fails at Minute 61 #

The Problem: Long-running Azure deployments consistently failed with token expiration errors after 60 minutes.

The Solution: Asynchronous deployment with proactive token refresh every 45 minutes.

Key Takeaway: Enterprise automation requires resilience patterns beyond the happy path. Authentication, retry logic, and graceful degradation aren’t optional—they’re essential.


2. Why “Deployment In Progress” Isn’t Good Enough #

The Problem: When deployments take 2+ hours, “Deployment In Progress” provides zero visibility into what’s actually happening.

The Solution: Real-time tracking of individual resource operations showing what’s deploying, what’s completed, and what’s failed.

Key Takeaway: In production systems, visibility is as important as functionality. Instrumentation should be built in from day one.


3. Hub-and-Spoke at Scale: Why Deployment Order Matters #

The Problem: Spokes can’t deploy before the hub exists, but deploying spokes serially takes forever.

The Solution: Explicit dependency orchestration with hub-first deployment, then parallel spoke deployment.

Key Takeaway: Infrastructure dependencies are as critical as code dependencies. Parallelization can reduce deployment time by 63%.


4. Parameterization is an Art #

The Problem: Managing 9,000+ lines of configuration across 6 parameter files without losing your mind.

The Solution: Scoped parameter objects that balance flexibility with discoverability.

Key Takeaway: Good abstraction makes complex systems manageable. Bad abstraction just hides complexity until it explodes.


5. Stop Reminding Developers to Save Money #

The Problem: Dev environments running 24/7 costing $8,400/month, with manual reminders to shut down resources not working.

The Solution: Automated nightly cleanup of expensive resources (firewalls, gateways) via scheduled pipelines.

Key Takeaway: Developer experience and cost optimization aren’t mutually exclusive. Good automation achieves both (70% cost reduction with zero complaints).


6. I Ran a Cleanup Script Against Production Once #

The Problem: Accidentally ran a delete script against the wrong subscription. Once.

The Solution: Multiple safety layers—subscription ID allowlists, name validation, environment tags, explicit apply flags, dry-run by default.

Key Takeaway: Production systems require production-grade safety. Assume mistakes will happen and build systems that fail safely.


7. Why Every Production Change Should Require What-If Analysis #

The Problem: “Why did the deployment recreate our entire VNet?” By the time you ask, it’s too late.

The Solution: Mandatory what-if analysis as a pipeline stage before all production deployments.

Key Takeaway: Visibility before action prevents mistakes. Make change impact visible before making changes irreversible.


The Architecture #

The final architecture implements:

Centralized Hub (West US Primary) #

  • Azure Virtual WAN with dual hubs (production + SD-WAN)
  • Azure Firewall Premium with threat intelligence, TLS inspection, IDPS
  • Point-to-Site VPN (Azure AD auth), Site-to-Site VPN with BGP, ExpressRoute
  • Active Directory domain controllers, Azure Bastion, Private DNS Resolver
  • Azure Front Door with WAF, Application Gateway

Distributed Spokes (Per Business Unit) #

  • Dedicated VNets per workload (GMR, IAG, GNT, AZF, VDI, Management)
  • Dev, QA, and Production with separate subscriptions
  • Databricks integration, API Management, AVD host pools
  • Cross-region DR environments in East US

Automation & Operations #

  • Azure DevOps pipelines with validation, what-if, and deployment stages
  • Automated resource deallocation for non-production environments
  • Real-time deployment progress tracking
  • Resource locks, policy enforcement, standardized tagging

Measurable Impact #

  • Deployment Reliability: 0 token expiration failures since implementing refresh strategy
  • Cost Reduction: ~70% reduction in dev environment costs through automated cleanup
  • Deployment Speed: Hub-and-spoke deployment time reduced from 6 hours to 2.25 hours (63% faster)
  • Change Safety: 100% of production changes require what-if approval
  • Team Velocity: 6 spoke networks deployed and maintained by 1 engineer using automation

Technologies & Skills Demonstrated #

Cloud & Infrastructure:

  • Azure Virtual WAN, Virtual Hub, hub-spoke networking
  • Azure Firewall Premium, ExpressRoute, VPN with BGP
  • Azure Bastion, Private DNS Resolver, Front Door, Application Gateway
  • Multi-region disaster recovery architecture

Infrastructure as Code:

  • Bicep (1,258 files of templates, parameters, modules)
  • ARM template deployment at subscription scope
  • Azure Verified Modules (AVM) integration
  • Parameterization strategies for multi-environment deployments

CI/CD & Automation:

  • Azure DevOps YAML pipelines (multi-stage, parallel execution)
  • Dependency orchestration (hub-first, spoke parallelization)
  • Scheduled pipeline execution (cost optimization)
  • Azure CLI automation with PowerShell/Bash

Security & Governance:

  • Service principal authentication with token refresh
  • Multi-subscription isolation strategies
  • Resource locking, Azure Policy, RBAC
  • Secrets management with Azure Key Vault

The Bigger Picture: Infrastructure as a Product #

This journey taught me that enterprise infrastructure isn’t just about deploying resources—it’s about building a platform that enables teams.

Good infrastructure is:

  • Reliable: Handles failure gracefully with retry logic and observability
  • Secure: Built-in guardrails prevent mistakes and enforce governance
  • Cost-Effective: Automates optimization without sacrificing developer experience
  • Maintainable: Clear parameterization and documentation reduce cognitive load
  • Auditable: What-if analysis and change tracking provide accountability

Start Reading #

Begin with Part 1: Token Expiration in Azure Deployments or jump to whichever topic interests you most. Each post stands alone while building on common themes.

This series reflects real production experience managing enterprise Azure infrastructure. All metrics and examples are based on actual implementations.