Part 1- Deploy Synapse Apache Spark Pool and install PyPI packages

Azure Synapse Analytics provides powerful Apache Spark pools for big data processing and machine learning workloads. While creating these resources manually through the Azure portal is straightforward, infrastructure as code (IaC) approaches like Bicep provide significant benefits for repeatability, version control, and automation.

🚀 In this post, I’ll walk through how to use Bicep to deploy a Spark pool to an existing Synapse workspace and manage its Python packages.

Prerequisites

Before we begin, make sure you have:

  • An Azure subscription.
  • An existing Azure Synapse Analytics workspace.
  • Azure CLI installed with the latest version.
  • Basic familiarity with Bicep templates.

Synapse Spark Pools Package Management

Spark pools in Azure Synapse provide a serverless Spark environment for big data processing. Some key points to understand:

  • Spark pools are billed based on vCore-hours used.
  • You need package from public registry which is not available on Azure Synapse Runtime for Apache Spark default packages.
  • They support various node sizes and auto-scaling configurations.
  • Custom libraries can be installed at the pool level or session level. The change of session-level libraries isn’t persisted between sessions.
  • Library management is crucial for maintaining consistency across environments

Libraries provide reusable code that you might want to include in your programs or projects for Apache Spark in Azure Synapse Analytics (Azure Synapse Spark).

You might need to update your serverless Apache Spark pool environment for various reasons. For example, you might find that:

  • One of your core dependencies released a new version.
  • You need an extra package for training your machine learning model or preparing your data.
  • A better package is available, and you no longer need the older package.
  • Your team has built a custom Python .whl package that you need available in your Apache Spark pool.

There are two primary ways to install a library on a Spark pool:

  1. Install a workspace library that’s been uploaded as a workspace package.
  2. To update Python libraries, provide a requirements.txt or Conda environment.yml environment specification file to install packages from repositories like PyPI or Conda-Forge.

Next I’ll show you how to manage Python libraries that your Sparkpool loads from PyPI using the requirements.txt and Bicep template.

Spark Pool Bicep Template

Below is a snippet from a Bicep template that deploys a Spark pool to an existing Synapse workspace and configures what PyPI libraries it will have available for developer (you can find the complete templates at the end of the post).

// For brevity, only the parameters relevant to library requirements are included.
param workspaceName string
param name string
// library requirements - by default loads the content from requirements.txt in the same folder.
param libraryRequirements object = {
  filename: 'requirements.txt'
  content: loadTextContent('./requirements.txt')
}

// Existing parent workspace resource
resource workspace 'Microsoft.Synapse/workspaces@2021-06-01' existing = {
  name: workspaceName
}

// Create Apache Spark Pool
resource bigDataPool 'Microsoft.Synapse/workspaces/bigDataPools@2021-06-01' = {
  name: name
  parent: workspace
  location: resourceGroup().location
  properties: {
    libraryRequirements: libraryRequirements
    // Other required properties omitted for brevity
  }
}

modules/synapse/big-data-pools/main.bicep

This template creates a Spark pool with auto-scaling enabled and a few common Python libraries specified directly in the template.

External Requirements File

The bicep modules/synapse/big-data-pools/main.bicep module expects that the you provide the libraries reuirements as requirements.txt file in the same folder.

It has a parameter libraryRequirements to accept list of packages to be loaded and by default it will uses Bicep function loadTextContentto load requirements from the external file.

param libraryRequirements object = {
  filename: 'requirements.txt'
  content: loadTextContent('./requirements.txt')
}

The requirements file contains the libraries to be loaded in package/version key/value pairs for example:

polars==1.23.0
pandas==2.2.3
snowflake-connector-python==3.14.0

This approach keeps your Bicep template cleaner and makes it easier to manage library versions separately.

Deploying the template

Run in powershell session following az cli command in the same folder as the module template:

az deployment group create `
  -g "<your existing resource group name>" `
  -f "main.bicep" `
  --parameters `
    "synapseWorkspaceName=<your existing workspace name>" `
    "sparkPoolName=<your spark pool name>"

or with bash shell:

az deployment group create \
  -g "<your existing resource group name>" \
  -f "main.bicep" \
  --parameters \
    "synapseWorkspaceName=<your existing workspace name>" \
    "sparkPoolName=<your spark pool name>"

âť—NOTE: When adding or removing PyPI packages the deployment might take a long time. Deployment will trigger a system job to install/unstall and cache the specified libraries. This process helps reduce overall session startup time.

After deployment you should see in Azure Portal the requirements.txt file deployed in the Packages section of your Azure Spark Pool.

Azure Portal - packages section

References

modules/synapse/big-data-pools/main.bicep

This module deploys Azure Synapse Apache Sparkpool to an existing Azure Synapse Workspace. It assumes that there is a requirements.txt file in the same folder that contains all the PyPI libraries that needs to be installed to the pool.

metadata name = 'Synapse Analytics Apache Spark Pools'
metadata description = 'This module deploys an Apache Spark Pool for existing Azure Synapse Workspace.'
metadata owner = 'cloud-center-of-excellence/bicep-module-owners'

@description('Required. The name of the existing parent Synapse Workspace.')
param workspaceName string

// General
@minLength(3)
@maxLength(15)
@description('Required. Name of the Azure Spark Pool. Valid characters: Alphanumerics. Max length of 15 characters.')
param name string

@description('Optional. The geo-location where the resource lives.')
param location string = resourceGroup().location

@description('Optional. Tags of the resource.')
param tags object = {}

// AutoPauseProperties
@description('Optional. Number of minutes of idle time before the Big Data pool is automatically paused.')
param autoPauseDelayInMinutes int = 15
@description('Optional. Whether the Big Data pool is auto-paused.')
param autoPauseEnabled bool = true

// AutoScaleProperties
@description('Optional. Whether the Big Data pool is auto-scaled.')
param autoScaleEnabled bool = true
@minValue(3)
@maxValue(200)
@description('Optional. The maximum number of nodes the Big Data pool can have.')
param autoScaleMaxNodeCount int = 3
@minValue(3)
@maxValue(200)
@description('Optional. The minimum number of nodes the Big Data pool can have.')
param autoScaleMinNodeCount int = 3

// CustomLibraries (custom .whl or .jar files)
@description('Optional. List of custom libraries/packages associated with the spark pool.')
param customLibraries array = []

@description('Optional. The default folder where Spark logs will be written. Default value is "spark-logs".')
param defaultSparkLogFolder string = 'spark-logs'

@description('Optional. Whether autotune is enabled.')
param isAutotuneEnabled bool = true

@description('Optional. Whether compute isolation is enabled.')
param isComputeIsolationEnabled bool = false

@description('Optional. The library requirements. Default loads the content from requirements.txt file at the same folder as this module.')
param libraryRequirements object = {
  filename: 'requirements.txt'
  content: loadTextContent('./requirements.txt')
}

@minValue(3)
@maxValue(200)
@description('Optional. The number of nodes in the Big Data pool. Defaults to 3.')
param nodeCount int = 3

@allowed([
  'Large'
  'Medium'
  'None'
  'Small'
  'XLarge'
  'XXLarge'
  'XXXLarge'
])
@description('Optional. The level of compute power that each node in the Big Data pool has.')
param nodeSize string = 'Small'

@allowed([
  'HardwareAcceleratedFPGA'
  'HardwareAcceleratedGPU'
  'MemoryOptimized'
])
@description('Optional. The kind of nodes that the Big Data pool provides.')
param nodeSizeFamily string = 'MemoryOptimized'

@description('Optional. Whether session level packages enabled.')
param sessionLevelPackagesEnabled bool = true

@description('Optional. Whether dynamic executor allocation is enabled.')
param dynamicExecutorAllocationEnabled bool = true
@description('Optional. The maximum number of executors the Big Data pool can have. This value must be less than the maximum number of autoScaleMaxNodeCount.')
param dynamicExecutorAllocationMaxExecutors int = autoScaleMaxNodeCount - 1
@description('Optional. The minimum number of executors the Big Data pool can have.')
param dynamicExecutorAllocationMinExecutors int = 1

// SparkConfigProperties
@description('Optional. Spark configuration file to specify additional properties.')
param sparkConfigProperties object = {}

@description('Optional. The Spark events folder.')
param sparkEventsFolder string = 'spark-events'

@allowed([
  '3.4'
])
@description('Optional. The version of Spark. Default value is 3.4.')
param sparkVersion string = '3.4'

// Existing parent resource
resource workspace 'Microsoft.Synapse/workspaces@2021-06-01' existing = {
  name: workspaceName
}

// Create Apache Spark Pool
resource bigDataPool 'Microsoft.Synapse/workspaces/bigDataPools@2021-06-01' = {
  name: name
  location: location
  tags: tags
  parent: workspace
  properties: {
    autoPause: {
      delayInMinutes: autoPauseDelayInMinutes
      enabled: autoPauseEnabled
    }
    autoScale: {
      enabled: autoScaleEnabled
      maxNodeCount: autoScaleMaxNodeCount
      minNodeCount: autoScaleMinNodeCount
    }
    customLibraries: customLibraries
    defaultSparkLogFolder: defaultSparkLogFolder
    dynamicExecutorAllocation: {
      enabled: dynamicExecutorAllocationEnabled
      maxExecutors: dynamicExecutorAllocationMaxExecutors
      minExecutors: dynamicExecutorAllocationMinExecutors
    }
    isAutotuneEnabled: isAutotuneEnabled
    isComputeIsolationEnabled: isComputeIsolationEnabled
    libraryRequirements: libraryRequirements
    nodeCount: nodeCount
    nodeSize: nodeSize  
    nodeSizeFamily: nodeSizeFamily
    sessionLevelPackagesEnabled: sessionLevelPackagesEnabled
    sparkConfigProperties: sparkConfigProperties
    sparkEventsFolder: sparkEventsFolder 
    sparkVersion: sparkVersion
  }
}

output sparkPoolName string = bigDataPool.name
output sparkPoolId string = bigDataPool.id

modules/synapse/big-data-pools/main.bicep