Large companies across a variety of industries have discovered hidden business value within existing data, and using machine learning algorithms to find hidden insight within that data has quickly become the norm.
In spite of all of its benefits, machine learning for data analytics poses some challenges, particularly where storage infrastructure is concerned.
Because data can contain hidden value, organizations may be less inclined to purge aging data . This causes storage to be consumed at an accelerated rate, complicating capacity planning efforts. Furthermore, the actual analytical processes generate an additional load on the underlying storage infrastructure.
Somewhat ironically, several vendors have begun using AI as a tool for solving problems created by big data analytics. As it stands now, they have not based their machine learning for analytics efforts around one single technology, but rather on a disparate collection of technologies.
When it comes to using AI for things like workload profiling and capacity planning, it is important to have access to current data pertaining to storage use and health. Even so, relying exclusively on real-time data may not always be desirable.
The problem with using real-time streaming data is that the data is raw and completely uncurated. Imperfections in the data stream might exist, and the fact that the data is being used in real time greatly limits the amount of processing that can be done.
Using data that is relatively current — but not real time — can often yield more information through machine learning for data analytics. But that data is not as up-to-date as the data that is streaming right now .
The Lambda architecture addresses this problem by simultaneously streaming data into two different layers: the batch layer and the speed layer. The batch layer’s job is simply to store the data. Because this data is not being acted on in real time, batch rules can be used to improve the quality of the data. In some models, the batch layer can also make data available to a third layer — the serving layer — that creates batch views in response to query requests.
Inbound data is also streamed into the speed layer, which provides real-time data views.
When a query is made against the Lambda architecture, organizations obtain results by merging analysis from both the batch view and the real-time — speed — view. This enables the Lambda architecture to provide a more comprehensive and complete picture of the data than might otherwise be possible.[…]