Configuring an Azure VNET to use AZTK in mixed mode

Wed, 20 Jun 2018 13:31:32 +0200

In my last post, I showed you how to provision a low-cost Apache Spark cluster on Microsoft Azure, with the help of the Azure Batch service, Low Priority Virtual Machines, and the Azure Distributed Data Engineering Toolkit (AZTK).

But have you tried to mix a cluster with Dedicated-, as well as Low Priority-Virtual Machines?

If you did, you propably run into an error…

Mixing Dedicated- and Low Priority, Virtual Machines

I tried to provision a Spark cluster with the following command:

aztk spark cluster create --id mycluster --size 1 --size-low-priority 2 --vm-size standard_d12_v2

But all I got, was the following error message:

You must configure a VNET to use AZTK in mixed mode (dedicated and low priority nodes)

What do I need a mixed mode cluster for

But let my first start with the Why.

Azure offers low-priority virtual machines (VMs) to reduce the cost of your workloads. Low-priority VMs make new types of workloads possible by enabling a large amount of compute power to be used for a very low cost.

Low-priority VMs take advantage of surplus capacity in Azure. When you specify low-priority VMs in your cluster, Azure can use this surplus, when available.

The tradeoff for using low-priority VMs is that those VMs may not be available to be allocated or may be preempted at any time, depending on available capacity.

Dedicated VMs stay online all the time to process your Spark jobs, even if some of the low-priority VMs are offline. That’s the reason, why you might want to add dedicated VMs to your cluster too.

If you have a mixed-mode Spark cluster with both types of VMs, the master node will always to assigned to a dedicated VM.

AZTK Spark Cluster running in mixed mode

Creating an Azure Virtual Network

So the first thing I did was to create an Azure Virtual Network (How to create a virtual network using the Azure portal) in the AZTK resource group.

After that, I opened the Properties section of the newly created VNET and copied the Resource ID to my clipboard.

Azure Virtual Network - Properties

Update cluster.yaml file

I opened a Terminal window and used vim to edit the cluster.yaml file in the .aztk folder.

In that file, there is a section called “To add your cluster to a virtual network provide the full arm resource id below”.

I removed the comment symbol from the line which contains the “subnet_id:” property and added the Resource ID from the Virtual Network.

I also added the name of the subnet with the /subnets/ prefix.

subnet_id: /subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/aztk/providers/Microsoft.Network/virtualNetworks/aztk-vnet/subnets/default

.aztk/cluster.yaml Subnet Settings

Create a cluster

After saving the chances to the cluster.yaml file, I was able to create a mixed-mode cluster with the toolkit.

Creating a mixed-mode Spark cluster

How to create a low-cost Apache Spark cluster on Microsoft Azure

Sat, 16 Jun 2018 12:18:17 +0200

A few months ago, I found a nice little open-source tool on GitHub called AZTK, which provides a fast and easy way to provision low-cost Apache Spark clusters on Microsoft Azure.

In this blog post, I would like to show you, how to install the Azure Distributed Data Engineering Toolkit (AZTK) on your Windows-, Linux- or MacOS-based system, and how to provision your first Apache Spark cluster with it.

Azure Distributed Data Engineering Toolkit (AZTK)

The Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It’s a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.

This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

For more details, please have a look on [GitHub].

Notable Features

Spark cluster provision time of 5 minutes on average
Spark clusters run in Docker containers
Run Spark on a GPU enabled cluster
Users can bring their own Docker image
Ability to use low-priority VMs for an 80% discount
Mixed Mode clusters that use both low-priority and dedicated VMs
Built in support for Azure Blob Storage and Azure Data Lake connection
Tailored pythonic experience with PySpark, Jupyter, and Anaconda
Tailored R experience with SparklyR, RStudio-Server, and Tidyverse
Ability to run spark submit directly from your local machine’s CLI

Getting Started

Install Python 3

Before you install the Azure Distributed Data Engineering Toolkit, you need Python 3, as well as pip3, installed on your system.

To do this, please have a look at Python.org.

Virtual Environment (optional)

After that, I recommend to create a separate virtual environment for the toolkit.

python3 -m venv aztk

Once you’ve created a virtual environment, you may activate it.

On Windows, run:

aztk\Scripts\activate.bat

On Unix or MacOS, run:

source activate aztk

Install AZTK

Now you’re ready to install the Azure Distributed Data Engineering Toolkit (AZTK) with a simple:

pip install aztk

Initialize AZTK

After you installed the toolkit, you’re ready to create your first Aztk environment. To do that, you simply call

aztk spark init

This command creates a .aztk folder in your current directory with the following file structure:

cluster.yaml
core-site.xml
jars
- .null
job.yaml
secrets.yaml
spark-defaults.conf
spark-env.sh
ssh.yaml

If you want to create a machine wide configuration, you add the –global parameter to the command.

Azure Resources and Credentials

To be able to work with the toolkit, you have to provision a few Azure resources, e.g. Azure Batch, Azure Storage Account, an Service Principal, etc.

The easiest way to do this, is to login to the Azure Portal and execute the following command in the Azure Cloud Shell.

wget -q https://raw.githubusercontent.com/Azure/aztk/v0.7.0/account_setup.sh -O account_setup.sh &&
chmod 755 account_setup.sh &&
/bin/bash account_setup.sh

After answering a few questions, the command return the required settings, which you add/update in the .aztk/secrets.yaml file.

Adding the Azure credentials to the secrets.yaml file

Provision your first Apache Spark cluster

Finally we’re ready to provision our first Apache Spark cluster using the AZTK.

aztk spark cluster create --id mycluster --size 0 --size-low-priority 5 --vm-size standard_d12_v2

With the id parameter, you specify an unique ID (within your Azure Batch account) for your cluster.
The size parameter specifies the amount of dedicated virtual machines (which are charged at the full price).
The size-low-priority parameter specifies the amount of Low-Priority Virtual Machines (which are charged at the 20% of the regualar price). This, of course, comes with a disadvantage. If Azure needs the virtual machines for another customer, they will be deleted.
The vm-size parameter specifies the type of the virtual machines to use.

Provision your first Apache Spark cluster

You’re also able to use the Azure N-Series virtual machines to provision GPU enabled clusters.

Getting Cluster Information

As soon as a cluster is provisioning, existing or deleting, you can use the following commands to get more details:

aztk spark cluster list

List all Apache Spark clusters

aztk spark cluster get --id mycluster

Get detailed information about a single cluster

Connect to the cluster

With the following command, you’re able to ssh-connect to the master node of your cluster, as well as do a port forwarding to the services (and plugins) on the cluster.

aztk spark cluster ssh --id mycluster

Get detailed information about a single cluster

After the connection has been established, you can use the port forwarding to access services like the Spark Web UI.

Access the Spark Web UI through the port forwarding

Deleting a cluster

Last but not least, don’t forget to delete the cluster if you don’t need it anymore.

aztk spark cluster delete --id mycluster

Demo

To see a demo of how to setup AZTK and provision your first Spark cluster, I created a short video:

How to create a low-cost Spark cluster on Azure on my YouTube Channel

Data Insights & Cloud

Configuring an Azure VNET to use AZTK in mixed mode

Mixing Dedicated- and Low Priority, Virtual Machines

What do I need a mixed mode cluster for

Creating an Azure Virtual Network

Update cluster.yaml file

Create a cluster

How to create a low-cost Apache Spark cluster on Microsoft Azure

Azure Distributed Data Engineering Toolkit (AZTK)

Notable Features

Getting Started

Install Python 3

Virtual Environment (optional)

Install AZTK

Initialize AZTK

Azure Resources and Credentials

Provision your first Apache Spark cluster

Getting Cluster Information

Connect to the cluster

Deleting a cluster

Demo