Microsoft’s new Azure Data Lake service provides us with just as much insight about Microsoft’s strategic plans as it does with new technology.
What is Azure Data Lake?
Azure Data Lake is the name of a new analytics service within the Microsoft Azure cloud platform. It actually consists of two PaaS format services - Azure Data Lake Store and Azure Data Lake Analytics. The Store service provides petabyte scale HDFS (Hadoop format) storage that can hold both structured and un-structured data, while the Analytics service executes queries against petabyte scale data sources whilst dynamically scaling the infrastructure resources it requires to run them. Although these capabilities may at first just feel like the evolution of Microsoft’s existing data platform services, they also provide a few insights into Microsoft’s cloud and data strategies.
Microsoft branded open source software
We’re probably now all familiar with the “Microsoft loves open source” caption. It’s used to describe Microsoft’s desire to share the source code for its development tools - PowerShell and .net for example - with the open source software (OSS) community. It also describes how Microsoft is engineering its tools and software as much as it can to work as well with Linux platforms.
However, as well as making its own software as open source friendly as possible, Microsoft has also created its own builds of OSS, for example Microsoft HDInsight Server is a 100% compatible variant of Apache Hadoop and Microsoft R Server is an enterprise grade variant of the R statistics platform.
Azure Data Lake also adapts itself to meet the needs of the Microsoft customers who want access to increasingly popular open source technologies but who also want a receipt and a support contract in case anything goes wrong.
The Azure Data Lake services provide Microsoft variants of the Apache HDFS file system and Hadoop family of query engines. A clear sign that Microsoft is still keen to provide its customers with the technologies they need, even if it doesn’t originally create them.
Doing anything with terabytes of data, never mind petabytes, means having large amounts of storage and compute resources available. While it’s not impossible to deploy those types of capabilities in on-premises data centres, it’s rarely a cost effective option. Especially, when large amounts of compute resources might only be required for a few hours a month.
Microsoft once again has leveraged its cloud-scale deployment of storage and compute resources to create an advanced analytics capability that wouldn’t feasible to deploy on-premises. Like the compute intensive Azure Machine Learning service, there’s no on-premises variant of Azure Data Lake expected. It’s another sign that Microsoft’s advanced analytics innovation has taken a cloud-only strategy.
Real-time multi-dimensional pricing
One of the first published benefits of cloud services, nearly five years ago now, was their pay-per-use charging model. Rather than paying all of the time for tech to be powered on in your own data centre, you could just pay for a cloud service as and when you need it. In reality, you often now pay for PaaS services according to three dimensions: pay for Storage, pay for Features and pay for Compute.
Microsoft used some of those principles in the pricing model for the Azure SQL Data Warehouse service by separating the cost of data storage from the cost of the compute resources used by queries. This means you only need to pay for compute when you need it and only for as much as you need.
The Azure Data Lake pricing model also separates out the cost of data storage from the cost of the compute resources used to run queries. However, it also allows you to determine how much compute resource gets used to execute its large distributed queries. This gives the option of cheap but slow queries or faster but more expensive queries – you get to decide. Yet again, we can see Microsoft creating very flexible commercial models compared to its traditional on-premises software licensing models.
A universal query language
The strength of a programming language normally reflects the type of work it was designed to do. In the Microsoft world, we know that T-SQL is good at querying structured data, R at performing statistics queries and C# at performing application logic.
The final insight into Microsoft’s analytics strategy that Azure Data Lake gives us is the fact that the it can query both unstructured data stored using HDFS and structured data stored using Azure SQL Data Warehouse – in a single query written in an expressive language. To do this, it has created U-SQL, short for Universal-SQL, that will look familiar enough to database developers, data scientists and application developers to allow them to get querying on day one. While U-SQL isn’t a massive deviation from any of Microsoft’s previous, I still expect it to reappear in future analytics services.
The Microsoft Azure Data Lake service might still be in its public preview phase, but it’s already given us some clear insights into how Microsoft is evolving its analytics capabilities.