The Modern Data Platform Design: a tool agnostic approach
--
The big data echosystem is still too big! There I said it. I still remember seeing this post years ago and as a young and upcoming data scientist and it resonated with me. When I tried to keep up to date with all the advancements in data infrastructure I felt overwhelmed by the breadth of tools, resource to learn about. As I gained experience in the domain, I soon realised that when designing then building or maintaining data platforms, tools matter less than use cases. Use cases are more stable than tools. How do you leverage use cases so that they become generic enough to be useful for design: by developing abstraction layers.
I have been thinking about this for a while, at first my layers were too specific which meant that they were not always reusable. I refined them over time with the learning stemming from building data platforms.
I ultimately identified the following:
- ingestion
- storage
- transform
- application
- execution
- governance
Ingestion
The first step towards using data effectively to help a business make decisions is gathering all the data from first, second and third parties into one place. The intelligence in BI is not possible without the learnings possible from joining and analysing data from multiple sources. The ingestion layers covers the tools that make possible the transfer of data from their sources to a centralised storage.
Before the widespread use of cloud data integration solutions, this work would be done via ad-hoc scripts and using orchestration tools (usually Airflow + Hadoop / Spark). As use cases became more and more standardised, cloud solutions have emerged that reduce considerably the engineering effort required to ingest well known data sources (Stitch, Fivetran, AirByte, Segment, …)
The emergence of engineering roles focusing on creating usable datasets for analysts has been the most critical additions to data roles in recent years
Storage
The storage layer is the essential piece of all data platforms. This is the central repository of all data that will ultimately…