When people think of AI driven products they think of neural networks and deep learning. However, what people seem to forget is that the machine learning model is just a small part of a larger data architecture. The data architecture lays the foundation to work with data in the first place and enables business intelligence, analytics and machine learning models. In this post we will look into the components of an example data architecture. We’ll briefly describe each component by answering the following three questions:
Here we’ll briefly answer why this component is needed and what purpose it’s serving.
Here we’ll list main properties we’d like this component to have. If we for example have large i/o throughput then we’d like a system that’s able to handle it. If we on the other hand have lots of data we’d like a system which scales with the data. If we want real time access then the system should allow to do this.
Sometimes properties are contradicting. In other words you can’t have two things at the same time . A classical example is Brewer’s CAP theorem . In that case it’s important to chose those properties best suited for the specific use case.
There are various products for each component and many good open source ones, too. Here we’ll mention some of them. Sometimes different solutions do almost the same thing. Sometimes they focus on different aspects of the above mentioned properties.
Data pipelines can be implemented in a batch fashion (e.g. once a day) or they can be implemented in a streaming fashion (near real time). We’ll not differentiate between those two in this post to keep things simple. Some of the tools we mention are better suited for real time whereas others are better suited for batch processing. As a rule of thumb: real time pipelines are usually more brittle and involve more work than batch pipelines.
This post is only talking about the technological side of things. Keep in mind that a successful data organization is not only about technology. Other factors such as organizational structure, processes and people play an equally important role.
The solution shown here is just one possible data architecture. It serves as a showcase to better understand the building blocks and why we need them. Depending on the concrete use case components mentioned here might be different or completely missing.
The following image shows a possible data architecture. Arrows mark the data flow direction.
Most of the available data is part of the application state e.g. user profile information, user purchases or item descriptions. This data needs to be accessible in real time and is usually stored in databases.
This highly depends on the actual use case within the application. Sometimes you want high I/O throughput. In that case NoSql databases might be a good fit. Sometimes you want to join and filter the data in various ways. In that case traditional relational databases might do the job.
One key signal for many machine learning and analytical tasks are interactions users do in the app e.g. user viewed item x, user listened to song y, user read article z. This data is not part of the app state. You can imagine that an massive amount of data is produced. Every client is constantly generating data while the user is active in the app. This data will be send from client devices (Smartphone, Browser, Desktop App) to the backend. The message broker is responsible for receiving this data on the backend and routing it to multiple destinations such as real time monitoring, long term storage....
We want to store as much data as we can. Even data which is of no use today can become so in the near future. In the past we were limited by storage space. Distributed storage models nowadays allow to store almost infinite data.
The raw data we’re collecting needs to be preprocessed. For example to generate train sets for machine learning models or derived data sets for the analytics system. This could involve simple aggregation and filtering steps e.g. group all purchases of a user and take last 100. It could also involve more complicated feature extraction steps e.g. crop and resize images.
Machine learning models are able to learn patterns in the data and use these to predict future events. Examples are recommender systems, fraud detection and user churn prediction. The input to these models are datasets generated by the data processing step.
For some use cases predictions of the ml model are not generated in real time. Recommender systems for example might generate recommendations once a day. These recommendations need to be saved in a persistent storage from which they are served. Databases are usually used to save them.
AI ist just one part of dealing with data. Reporting, Business Intelligence (BI) and analytics are probably the more traditional way. Questions like:
are answered here.
Up to now we have individual components doing one piece of work. However, we don’t want to execute these steps manually. A workflow management tool connects all these building blocks into data pipelines. One pipeline may generate trainsets using the data processing component then learn an ai model and then predict and store the results into a key value store. Another pipeline might preprocess raw data from the data storage and transfer it into the analytics system.
This is a software project. So on top of all the things above we’d like to have some software development basics such as:
Yes it is. No doubt. Luckily it’s never been easier to implement a modern data architecture. There are multiple great open source solutions which tackle the building blocks we’ve mentioned above. That said these systems are not always easy to setup and maintain. This is especially the case if they’re based on a distributed model (run on multiple machines). Luckily cloud providers offer managed solutions for many popular data architecture components. In that case you don’t have to worry about setting up and maintaining such a system. In an ideal case you just have to stick the components together.
Absolutely. A modern data architecture lays the technical foundation of a data driven organization:
At the end of the day making extensive use of data will give you a competitive advantage over your competitors. The big players have realized this a long time ago. Many are following them.