Data Insights &amp; Cloud A blog about Big Data, Data Science and the Cloud https://datainsights.cloud/ Fri, 19 Dec 2025 12:24:54 +0100 Fri, 19 Dec 2025 12:24:54 +0100 Jekyll v4.2.2 Configuring an Azure VNET to use AZTK in mixed mode <p>In my last post, I showed you how to provision a low-cost Apache Spark cluster on Microsoft Azure, with the help of the Azure Batch service, Low Priority Virtual Machines, and the Azure Distributed Data Engineering Toolkit (AZTK).</p> <p>But have you tried to mix a cluster with Dedicated-, as well as Low Priority-Virtual Machines?</p> <p>If you did, you propably run into an error…</p> <!--more--> <h2>Mixing Dedicated- and Low Priority, Virtual Machines</h2> <p>I tried to provision a Spark cluster with the following command:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">aztk spark cluster create <span class="nt">--id</span> mycluster <span class="nt">--size</span> 1 <span class="nt">--size-low-priority</span> 2 <span class="nt">--vm-size</span> standard_d12_v2</code></pre></figure> <p>But all I got, was the following error message:</p> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkMixedMode01.png" alt="You must configure a VNET to use AZTK in mixed mode (dedicated and low priority nodes)" /> <figcaption class="caption-text">You must configure a VNET to use AZTK in mixed mode (dedicated and low priority nodes)</figcaption> </figure> <h2>What do I need a mixed mode cluster for</h2> <p>But let my first start with the Why.</p> <p>Azure offers low-priority virtual machines (VMs) to reduce the cost of your workloads. Low-priority VMs make new types of workloads possible by enabling a large amount of compute power to be used for a very low cost.</p> <p>Low-priority VMs take advantage of surplus capacity in Azure. When you specify low-priority VMs in your cluster, Azure can use this surplus, when available.</p> <p>The tradeoff for using low-priority VMs is that those VMs may not be available to be allocated or may be preempted at any time, depending on available capacity.</p> <p>Dedicated VMs stay online all the time to process your Spark jobs, even if some of the low-priority VMs are offline. That’s the reason, why you might want to add dedicated VMs to your cluster too.</p> <p>If you have a mixed-mode Spark cluster with both types of VMs, the master node will always to assigned to a dedicated VM.</p> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkMixedMode02.png" alt="AZTK Spark Cluster running in mixed mode" /> <figcaption class="caption-text">AZTK Spark Cluster running in mixed mode</figcaption> </figure> <h2>Creating an Azure Virtual Network</h2> <p>So the first thing I did was to create an Azure Virtual Network (<a href="https://docs.microsoft.com/en-us/azure/virtual-network/quick-create-portal">How to create a virtual network using the Azure portal</a>) in the AZTK resource group.</p> <p>After that, I opened the <strong>Properties</strong> section of the newly created VNET and copied the Resource ID to my clipboard.</p> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkMixedMode03.png" alt="Azure Virtual Network - Properties" /> <figcaption class="caption-text">Azure Virtual Network - Properties</figcaption> </figure> <h2>Update cluster.yaml file</h2> <p>I opened a Terminal window and used vim to edit the <strong>cluster.yaml</strong> file in the <strong>.aztk</strong> folder.</p> <p>In that file, there is a section called “<strong>To add your cluster to a virtual network provide the full arm resource id below</strong>”.</p> <p>I removed the comment symbol from the line which contains the “<strong>subnet_id:</strong>” property and added the Resource ID from the Virtual Network.</p> <p>I also added the name of the subnet with the <strong>/subnets/</strong> prefix.</p> <figure class="highlight"><pre><code class="language-plain" data-lang="plain">subnet_id: /subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/aztk/providers/Microsoft.Network/virtualNetworks/aztk-vnet/subnets/default</code></pre></figure> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkMixedMode04.png" alt=".aztk/cluster.yaml Subnet Settings" /> <figcaption class="caption-text">.aztk/cluster.yaml Subnet Settings</figcaption> </figure> <h2>Create a cluster</h2> <p>After saving the chances to the cluster.yaml file, I was able to create a mixed-mode cluster with the toolkit.</p> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkMixedMode05.png" alt="AZTK - Creating a mixed-mode Spark cluster" /> <figcaption class="caption-text">Creating a mixed-mode Spark cluster</figcaption> </figure> Wed, 20 Jun 2018 13:31:32 +0200 https://datainsights.cloud/2018/06/20/aztk-mixed-mode/ https://datainsights.cloud/2018/06/20/aztk-mixed-mode/ big data microsoft azure apache spark azure batch How to create a low-cost Apache Spark cluster on Microsoft Azure <p>A few months ago, I found a nice little open-source tool on GitHub called AZTK, which provides a fast and easy way to provision low-cost Apache Spark clusters on Microsoft Azure.</p> <p>In this blog post, I would like to show you, how to install the <strong>Azure Distributed Data Engineering Toolkit (AZTK)</strong> on your Windows-, Linux- or MacOS-based system, and how to provision your first Apache Spark cluster with it.</p> <!--more--> <h2>Azure Distributed Data Engineering Toolkit (AZTK)</h2> <p>The <strong>Azure Distributed Data Engineering Toolkit (AZTK)</strong> is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It’s a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.</p> <p>This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.</p> <p>For more details, please have a look on [<a href="https://github.com/Azure/aztk">GitHub</a>].</p> <h3>Notable Features</h3> <ul> <li>Spark cluster provision time of 5 minutes on average</li> <li>Spark clusters run in Docker containers</li> <li>Run Spark on a GPU enabled cluster</li> <li>Users can bring their own Docker image</li> <li>Ability to use low-priority VMs for an 80% discount</li> <li>Mixed Mode clusters that use both low-priority and dedicated VMs</li> <li>Built in support for Azure Blob Storage and Azure Data Lake connection</li> <li>Tailored pythonic experience with PySpark, Jupyter, and Anaconda</li> <li>Tailored R experience with SparklyR, RStudio-Server, and Tidyverse</li> <li>Ability to run spark submit directly from your local machine’s CLI</li> </ul> <h2>Getting Started</h2> <h3>Install Python 3</h3> <p>Before you install the Azure Distributed Data Engineering Toolkit, you need Python 3, as well as pip3, installed on your system.</p> <p>To do this, please have a look at <a href="https://www.python.org">Python.org</a>.</p> <h3>Virtual Environment (optional)</h3> <p>After that, I recommend to create a separate virtual environment for the toolkit.</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">python3 <span class="nt">-m</span> venv aztk</code></pre></figure> <p>Once you’ve created a virtual environment, you may activate it.</p> <p>On Windows, run:</p> <figure class="highlight"><pre><code class="language-plain" data-lang="plain">aztk\Scripts\activate.bat</code></pre></figure> <p>On Unix or MacOS, run:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">source </span>activate aztk</code></pre></figure> <h3>Install AZTK</h3> <p>Now you’re ready to install the Azure Distributed Data Engineering Toolkit (AZTK) with a simple:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">pip <span class="nb">install </span>aztk</code></pre></figure> <h3>Initialize AZTK</h3> <p>After you installed the toolkit, you’re ready to create your first Aztk environment. To do that, you simply call</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">aztk spark init</code></pre></figure> <p>This command creates a .aztk folder in your current directory with the following file structure:</p> <ul> <li>cluster.yaml</li> <li>core-site.xml</li> <li>jars <ul> <li>.null</li> </ul> </li> <li>job.yaml</li> <li>secrets.yaml</li> <li>spark-defaults.conf</li> <li>spark-env.sh</li> <li>ssh.yaml</li> </ul> <p>If you want to create a machine wide configuration, you add the <strong>–global</strong> parameter to the command.</p> <h3>Azure Resources and Credentials</h3> <p>To be able to work with the toolkit, you have to provision a few Azure resources, e.g. Azure Batch, Azure Storage Account, an Service Principal, etc.</p> <p>The easiest way to do this, is to login to the <a href="https://portal.azure.com/">Azure Portal</a> and execute the following command in the <a href="https://shell.azure.com/">Azure Cloud Shell</a>.</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">wget <span class="nt">-q</span> https://raw.githubusercontent.com/Azure/aztk/v0.7.0/account_setup.sh <span class="nt">-O</span> account_setup.sh <span class="o">&amp;&amp;</span> <span class="nb">chmod </span>755 account_setup.sh <span class="o">&amp;&amp;</span> /bin/bash account_setup.sh</code></pre></figure> <p>After answering a few questions, the command return the required settings, which you add/update in the <strong>.aztk/secrets.yaml</strong> file.</p> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkGettingStarted01.png" alt="secrets.yaml Settings" /> <figcaption class="caption-text">Adding the Azure credentials to the secrets.yaml file</figcaption> </figure> <h2>Provision your first Apache Spark cluster</h2> <p>Finally we’re ready to provision our first Apache Spark cluster using the AZTK.</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">aztk spark cluster create <span class="nt">--id</span> mycluster <span class="nt">--size</span> 0 <span class="nt">--size-low-priority</span> 5 <span class="nt">--vm-size</span> standard_d12_v2</code></pre></figure> <ul> <li>With the <strong>id</strong> parameter, you specify an unique ID (within your Azure Batch account) for your cluster.</li> <li>The <strong>size</strong> parameter specifies the amount of dedicated virtual machines (which are charged at the full price).</li> <li>The <strong>size-low-priority</strong> parameter specifies the amount of Low-Priority Virtual Machines (which are charged at the 20% of the regualar price). This, of course, comes with a disadvantage. If Azure needs the virtual machines for another customer, they will be deleted.</li> <li>The <strong>vm-size</strong> parameter specifies the <a href="https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes">type of the virtual machines</a> to use.</li> </ul> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkGettingStarted02.png" alt="Provision Apache Spark Cluster" /> <figcaption class="caption-text">Provision your first Apache Spark cluster</figcaption> </figure> <blockquote> <p>You’re also able to use the Azure N-Series virtual machines to provision GPU enabled clusters.</p> </blockquote> <h2>Getting Cluster Information</h2> <p>As soon as a cluster is provisioning, existing or deleting, you can use the following commands to get more details:</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">aztk spark cluster list</code></pre></figure> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkGettingStarted03.png" alt="List Apache Spark Clusters" /> <figcaption class="caption-text">List all Apache Spark clusters</figcaption> </figure> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">aztk spark cluster get <span class="nt">--id</span> mycluster</code></pre></figure> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkGettingStarted04.png" alt="Get Cluster Details" /> <figcaption class="caption-text">Get detailed information about a single cluster</figcaption> </figure> <h2>Connect to the cluster</h2> <p>With the following command, you’re able to ssh-connect to the master node of your cluster, as well as do a port forwarding to the services (and plugins) on the cluster.</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">aztk spark cluster ssh <span class="nt">--id</span> mycluster</code></pre></figure> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkGettingStarted05.png" alt="Get Cluster Details" /> <figcaption class="caption-text">Get detailed information about a single cluster</figcaption> </figure> <p>After the connection has been established, you can use the port forwarding to access services like the Spark Web UI.</p> <figure class="caption"> <img src="https://datainsights.cloud/images/posts/AztkGettingStarted06.png" alt="Apache Spark Web UI" /> <figcaption class="caption-text">Access the Spark Web UI through the port forwarding</figcaption> </figure> <h2>Deleting a cluster</h2> <p>Last but not least, don’t forget to delete the cluster if you don’t need it anymore.</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash">aztk spark cluster delete <span class="nt">--id</span> mycluster</code></pre></figure> <h2>Demo</h2> <p>To see a demo of how to setup AZTK and provision your first Spark cluster, I created a short video:</p> <iframe width="500" height="281" src="https://www.youtube.com/embed/Kr62gDdRMyQ" frameborder="0" allow="autoplay; encrypted-media" webkitallowfullscreen="" mozallowfullscreen="" allowfullscreen=""></iframe> <p><a href="https://youtu.be/Kr62gDdRMyQ" target="_blank">How to create a low-cost Spark cluster on Azure</a> on my <a href="https://www.youtube.com/SaschaDittmann" target="_blank">YouTube Channel</a></p> Sat, 16 Jun 2018 12:18:17 +0200 https://datainsights.cloud/2018/06/16/how-to-create-a-low-cost-spark-cluster-on-azure/ https://datainsights.cloud/2018/06/16/how-to-create-a-low-cost-spark-cluster-on-azure/ big data microsoft azure apache spark azure batch