azure-avd-terraform-accelerator

Azure Virtual Desktop - Best Practices & Recommendations

🎯 Executive Summary

Your AVD deployment is already implementing many best practices! This document outlines additional enhancements to achieve enterprise-grade production readiness.


βœ… What You’re Already Doing Well

Feature Status Implementation
Modular Architecture βœ… Complete 6 specialized modules for maintainability
Network Flexibility βœ… Complete 3 deployment scenarios (new/existing VNet/subnet)
Availability Zones βœ… Complete Session hosts distributed across zones
Monitoring βœ… Complete Log Analytics with VM Insights agents
Auto-Shutdown βœ… Complete Cost optimization for dev/test
Private Endpoints βœ… Complete Secure storage connectivity
Key Vault βœ… Complete Secure credential management
Scaling Plans βœ… Complete Dynamic capacity management
Domain Join Options βœ… Complete Both Azure AD and AD DS support
Simplified Deployment βœ… Complete Deploy to Azure button + wizard

1. Azure Backup for Session Hosts ⭐ HIGH PRIORITY

Why: Protect against data loss, accidental deletion, or ransomware.

Implementation: New backup module created

Usage in main.tf:

module "backup" {
  source = "./modules/backup"

  enable_backup           = var.enable_backup
  recovery_vault_name     = "${local.resource_prefix}-rsv"
  resource_group_name     = azurerm_resource_group.main.name
  location                = var.location
  
  # Backup schedule
  backup_frequency        = "Daily"
  backup_time             = "02:00"
  backup_timezone         = var.timezone
  
  # Retention
  daily_retention_count   = 30
  weekly_retention_count  = 12
  monthly_retention_count = 12
  yearly_retention_count  = 5
  
  # VMs to protect
  vm_ids = module.compute.vm_ids
  
  tags = local.common_tags
}

Benefits:


Why: Standardize session host images, accelerate deployments, ensure consistency.

Implementation: New image_gallery module created

Usage in main.tf:

module "image_gallery" {
  source = "./modules/image_gallery"

  enable_image_gallery       = var.enable_image_gallery
  gallery_name               = replace("${local.resource_prefix}_sig", "-", "_")
  resource_group_name        = azurerm_resource_group.main.name
  location                   = var.location
  
  create_win11_definition    = true
  create_win10_definition    = false
  
  # Grant CI/CD pipeline access
  image_builder_principal_id = var.image_builder_principal_id
  
  tags = local.common_tags
}

Update compute module to use custom images:

# In modules/compute/variables.tf, add:
variable "custom_image_id" {
  description = "Custom image ID from Shared Image Gallery"
  type        = string
  default     = ""
}

# In modules/compute/main.tf, modify source_image_reference:
dynamic "source_image_reference" {
  for_each = var.custom_image_id == "" ? [1] : []
  content {
    publisher = "MicrosoftWindowsDesktop"
    offer     = "Windows-11"
    sku       = "win11-22h2-avd"
    version   = "latest"
  }
}

source_image_id = var.custom_image_id != "" ? var.custom_image_id : null

Benefits:


3. Azure Policy for Governance ⭐ MEDIUM PRIORITY

Why: Enforce compliance, security standards, and operational best practices.

Implementation: New policy module created

Usage in main.tf:

module "policy" {
  source = "./modules/policy"
  
  enable_policies            = var.enable_policies
  resource_group_id          = azurerm_resource_group.main.id
  location                   = var.location
  
  # Governance
  require_environment_tag    = true
  allowed_vm_sizes           = var.allowed_vm_sizes
  
  # Security
  deploy_antimalware         = true
  audit_disk_encryption      = true
  
  # Monitoring
  enable_vm_diagnostics      = true
  log_analytics_workspace_id = module.monitoring.workspace_id
}

Policies Enforced:

Benefits:


4. FSLogix Profile Container Backup ⭐ HIGH PRIORITY

Why: User profiles contain critical data and settings.

Implementation: Add to storage module

Add to modules/storage/main.tf:

# Backup vault for storage account
resource "azurerm_data_protection_backup_vault" "main" {
  count               = var.enable_fslogix_backup ? 1 : 0
  name                = "${var.storage_account_name}-bvault"
  resource_group_name = var.resource_group_name
  location            = var.location
  datastore_type      = "VaultStore"
  redundancy          = "LocallyRedundant"
  
  tags = var.tags
}

# Backup policy for Azure Files
resource "azurerm_data_protection_backup_policy_blob_storage" "main" {
  count               = var.enable_fslogix_backup ? 1 : 0
  name                = "fslogix-backup-policy"
  vault_id            = azurerm_data_protection_backup_vault.main[0].id
  retention_duration  = "P30D"
}

# Backup instance for storage account
resource "azurerm_data_protection_backup_instance_blob_storage" "main" {
  count              = var.enable_fslogix_backup ? 1 : 0
  name               = "${var.storage_account_name}-backup"
  vault_id           = azurerm_data_protection_backup_vault.main[0].id
  location           = var.location
  storage_account_id = azurerm_storage_account.main.id
  backup_policy_id   = azurerm_data_protection_backup_policy_blob_storage.main[0].id
}

Benefits:


5. Update Management with Update Rings ⭐ MEDIUM PRIORITY

Strategy: Deploy updates in waves to minimize risk.

Implementation:

# In modules/compute/variables.tf
variable "update_ring" {
  description = "Update ring for patching strategy"
  type        = string
  default     = "production"
  validation {
    condition     = contains(["pilot", "testing", "production"], var.update_ring)
    error_message = "Update ring must be pilot, testing, or production."
  }
}

# In modules/compute/main.tf
resource "azurerm_virtual_machine_extension" "update_management" {
  count                = var.enable_update_management ? var.vm_count : 0
  name                 = "UpdateManagement"
  virtual_machine_id   = azurerm_windows_virtual_machine.main[count.index].id
  publisher            = "Microsoft.SoftwareUpdateManagement"
  type                 = "WindowsOSUpdateExtension"
  type_handler_version = "1.0"

  settings = jsonencode({
    updateRing         = var.update_ring
    autoUpdateEnabled  = true
    maintenanceWindow  = var.update_ring == "pilot" ? "Sunday 02:00" : var.update_ring == "testing" ? "Tuesday 02:00" : "Friday 02:00"
  })
}

Update Rings:

Benefits:


6. Cost Optimization - Azure Advisor Integration ⭐ MEDIUM PRIORITY

Add to modules/monitoring/main.tf:

# Cost Management Exports
resource "azurerm_cost_management_export" "daily" {
  count                      = var.enable_cost_exports ? 1 : 0
  name                       = "avd-daily-costs"
  resource_group_id          = var.resource_group_id
  
  recurrence_type            = "Daily"
  recurrence_period_start_date = timestamp()
  recurrence_period_end_date   = timeadd(timestamp(), "8760h") # 1 year
  
  delivery_info {
    storage_account_id = var.cost_export_storage_account_id
    container_name     = "cost-exports"
    root_folder_path   = "/avd"
  }
  
  query {
    type       = "Usage"
    time_frame = "MonthToDate"
  }
}

# Workbook for Cost Analysis
resource "azurerm_application_insights_workbook" "cost_analysis" {
  count               = var.enable_cost_workbook ? 1 : 0
  name                = "AVD-Cost-Analysis"
  resource_group_name = var.resource_group_name
  location            = var.location
  display_name        = "AVD Cost Analysis"
  
  data_json = jsonencode({
    version = "Notebook/1.0"
    items = [
      {
        type = 3
        content = {
          version = "KqlItem/1.0"
          query = <<-EOT
            AzureDiagnostics
            | where ResourceType == "DESKTOPS"
            | summarize Cost=sum(todouble(Cost_USD)) by bin(TimeGenerated, 1d), ResourceId
            | render timechart
          EOT
        }
      }
    ]
  })
  
  tags = var.tags
}

Benefits:


7. Session Host Replacement Strategy ⭐ HIGH PRIORITY

Why: Fresh VMs prevent configuration drift and security issues.

Implementation: Add scheduled replacement

Add to modules/compute/main.tf:

# Random rotation for VM lifecycle
resource "time_rotating" "vm_lifecycle" {
  rotation_days = var.vm_replacement_days
}

# Add lifecycle meta-argument to VMs
resource "azurerm_windows_virtual_machine" "main" {
  # ... existing config ...
  
  lifecycle {
    create_before_destroy = true
    replace_triggered_by = [
      time_rotating.vm_lifecycle.id
    ]
  }
  
  tags = merge(
    var.tags,
    {
      "ProvisionedDate" = timestamp()
      "ReplacementDate" = timeadd(timestamp(), "${var.vm_replacement_days * 24}h")
    }
  )
}

Recommendation: Replace session hosts every 60-90 days

Benefits:


8. Multi-Region DR (Disaster Recovery) ⭐ MEDIUM PRIORITY

Strategy: Deploy secondary AVD environment in paired region.

High-Level Architecture:

Primary Region (East US)          Secondary Region (West US)
β”œβ”€β”€ Host Pool (Active)      <-->  β”œβ”€β”€ Host Pool (Standby)
β”œβ”€β”€ Session Hosts (10)            β”œβ”€β”€ Session Hosts (3)
β”œβ”€β”€ Storage (FSLogix)       --->  β”œβ”€β”€ Storage (Replica)
└── Workspace                     └── Workspace (Hidden)

Implementation Approach:

Option A: Active-Standby

Option B: Active-Active (Geo-Distribution)

Key Components:

# In main.tf
module "avd_secondary" {
  source = "./modules/avd"
  
  location            = var.secondary_region
  resource_group_name = azurerm_resource_group.secondary.name
  host_pool_name      = "${local.resource_prefix}-secondary-hp"
  
  # Minimal capacity for DR
  validation_environment = false
  preferred_app_group_type = "Desktop"
  
  tags = merge(local.common_tags, {
    Role = "DisasterRecovery"
  })
}

# Storage replication
module "storage_secondary" {
  source = "./modules/storage"
  
  location                = var.secondary_region
  resource_group_name     = azurerm_resource_group.secondary.name
  account_replication_type = "GZRS" # Geo-zone-redundant
  
  tags = local.common_tags
}

Benefits:


9. RBAC and Just-in-Time Access ⭐ MEDIUM PRIORITY

Add to modules/security/main.tf:

# AVD User Role Assignment
resource "azurerm_role_assignment" "avd_users" {
  count                = var.enable_rbac ? 1 : 0
  scope                = var.workspace_id
  role_definition_name = "Desktop Virtualization User"
  principal_id         = var.avd_users_group_id
}

# AVD Admin Role Assignment
resource "azurerm_role_assignment" "avd_admins" {
  count                = var.enable_rbac ? 1 : 0
  scope                = var.resource_group_id
  role_definition_name = "Desktop Virtualization Contributor"
  principal_id         = var.avd_admins_group_id
}

# JIT VM Access (Azure Security Center)
resource "azurerm_security_center_jit_network_access_policy" "main" {
  count               = var.enable_jit_access ? 1 : 0
  name                = "avd-jit-policy"
  resource_group_name = var.resource_group_name
  location            = var.location
  kind                = "Basic"

  virtual_machine {
    virtual_machine_id = var.vm_id
    
    port {
      number   = 3389
      protocol = "TCP"
      max_request_access_duration = "PT3H"
    }
  }
}

Benefits:


10. Advanced Monitoring & Alerting ⭐ HIGH PRIORITY

Add to modules/monitoring/main.tf:

# Alert Rule - Session Host Unavailable
resource "azurerm_monitor_metric_alert" "session_host_down" {
  count               = var.enable_advanced_alerts ? 1 : 0
  name                = "avd-session-host-unavailable"
  resource_group_name = var.resource_group_name
  scopes              = [var.host_pool_id]
  description         = "Alert when session host becomes unavailable"
  severity            = 1
  frequency           = "PT5M"
  window_size         = "PT15M"

  criteria {
    metric_namespace = "Microsoft.DesktopVirtualization/hostpools"
    metric_name      = "SessionHostHealthCheckFailurePercent"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 10
  }

  action {
    action_group_id = azurerm_monitor_action_group.main.id
  }
}

# Alert Rule - High CPU Usage
resource "azurerm_monitor_metric_alert" "high_cpu" {
  count               = var.enable_advanced_alerts ? 1 : 0
  name                = "avd-high-cpu-usage"
  resource_group_name = var.resource_group_name
  scopes              = var.vm_ids
  description         = "Alert when CPU exceeds 80%"
  severity            = 2
  frequency           = "PT5M"
  window_size         = "PT30M"

  criteria {
    metric_namespace = "Microsoft.Compute/virtualMachines"
    metric_name      = "Percentage CPU"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 80
  }

  action {
    action_group_id = azurerm_monitor_action_group.main.id
  }
}

# Alert Rule - Storage Low Space
resource "azurerm_monitor_metric_alert" "storage_low_space" {
  count               = var.enable_advanced_alerts ? 1 : 0
  name                = "avd-storage-low-space"
  resource_group_name = var.resource_group_name
  scopes              = [var.storage_account_id]
  description         = "Alert when FSLogix storage exceeds 80%"
  severity            = 2
  frequency           = "PT15M"
  window_size         = "PT1H"

  criteria {
    metric_namespace = "Microsoft.Storage/storageAccounts"
    metric_name      = "UsedCapacity"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 858993459200 # 800GB (80% of 1TB)
  }

  action {
    action_group_id = azurerm_monitor_action_group.main.id
  }
}

# Action Group for Alerts
resource "azurerm_monitor_action_group" "main" {
  count               = var.enable_advanced_alerts ? 1 : 0
  name                = "avd-alerts"
  resource_group_name = var.resource_group_name
  short_name          = "avdalerts"

  email_receiver {
    name                    = "Admin Email"
    email_address           = var.alert_email
    use_common_alert_schema = true
  }

  webhook_receiver {
    name        = "Teams Webhook"
    service_uri = var.teams_webhook_url
  }
}

Alerts:


πŸ“Š Implementation Priority Matrix

Priority Feature Impact Complexity Timeline
πŸ”΄ HIGH Azure Backup High Low 1 week
πŸ”΄ HIGH Shared Image Gallery High Medium 2 weeks
πŸ”΄ HIGH FSLogix Backup High Low 1 week
πŸ”΄ HIGH Session Host Replacement Medium Low 1 week
πŸ”΄ HIGH Advanced Monitoring High Medium 1 week
🟑 MEDIUM Azure Policy Medium Low 1 week
🟑 MEDIUM Update Rings Medium Medium 2 weeks
🟑 MEDIUM Cost Optimization Medium Low 1 week
🟑 MEDIUM RBAC & JIT Medium Low 1 week
🟑 MEDIUM Multi-Region DR High High 4 weeks

🎯 Quick Wins (Implement This Week)

  1. Enable Azure Backup - 1 hour setup, automated protection
  2. Add Advanced Alerts - 2 hours, immediate visibility
  3. Implement RBAC - 1 hour, improved security
  4. Enable Azure Policy - 2 hours, compliance enforcement

πŸ“ Variables to Add to main.tf

# Backup
variable "enable_backup" {
  description = "Enable Azure Backup for session hosts"
  type        = bool
  default     = true
}

# Image Gallery
variable "enable_image_gallery" {
  description = "Enable Shared Image Gallery"
  type        = bool
  default     = false
}

variable "image_builder_principal_id" {
  description = "Service principal ID for image building"
  type        = string
  default     = ""
}

# Policy
variable "enable_policies" {
  description = "Enable Azure Policy governance"
  type        = bool
  default     = true
}

variable "allowed_vm_sizes" {
  description = "Allowed VM sizes for session hosts"
  type        = list(string)
  default     = ["Standard_D2s_v5", "Standard_D4s_v5", "Standard_D8s_v5"]
}

# Monitoring
variable "enable_advanced_alerts" {
  description = "Enable advanced monitoring alerts"
  type        = bool
  default     = true
}

variable "alert_email" {
  description = "Email for alert notifications"
  type        = string
  default     = ""
}

# DR
variable "enable_dr" {
  description = "Enable disaster recovery in secondary region"
  type        = bool
  default     = false
}

variable "secondary_region" {
  description = "Azure region for disaster recovery"
  type        = string
  default     = "westus2"
}

πŸš€ Next Steps

Phase 1: Security & Compliance (Week 1-2)

  1. βœ… Enable Azure Backup module
  2. βœ… Implement Azure Policy governance
  3. βœ… Configure RBAC and JIT access
  4. βœ… Add advanced monitoring alerts

Phase 2: Operational Excellence (Week 3-4)

  1. βœ… Deploy Shared Image Gallery
  2. βœ… Implement update rings
  3. βœ… Configure session host replacement
  4. βœ… Enable FSLogix backup

Phase 3: Business Continuity (Week 5-8)

  1. βœ… Design multi-region architecture
  2. βœ… Deploy secondary region (standby)
  3. βœ… Test failover procedures
  4. βœ… Document DR runbooks

πŸ“š Additional Resources


πŸŽ‰ Summary

Your implementation is production-ready with the current features! The recommendations above will take you from good to exceptional by adding:

βœ… Business Continuity: Backup, DR, session host replacement
βœ… Operational Excellence: Golden images, update rings, monitoring
βœ… Security & Compliance: Azure Policy, RBAC, JIT access
βœ… Cost Optimization: Cost tracking, auto-shutdown, right-sizing

Estimated effort: 6-8 weeks for full implementation
ROI: Reduced downtime, faster recovery, improved compliance, lower costs