Understanding Your Data: How Not to Drown in a Data Lake

Financial Technology Services

Article

20 Jan 2025

Paweł Machalski, Senior Business Consultant

Reading time: 5 min

For several years, there has been growing popularity surrounding the concept of a data-driven organization. However, the experiences of many companies indicate that transforming an organization to a data-driven standard is not as easy as training slogans and marketing materials suggest.

In a typical large organization today, the problem of a lack of or inaccessible data no longer exists. Instead, we face an overabundance of data, which often leads to:

difficulties in accessing the right data,
a lack of reliable data,
inconsistent analyses,
incorrect or delayed conclusions.

In this article, I will discuss the primary causes of this phenomenon and highlight common issues I have faced in recent years. Toward the end, I will present recommendations for those involved in data collection and analysis. It’s crucial to note that I’m not describing ideal scenarios but real-life examples that do not always fit within theoretical frameworks, standards, or budgets.

The following material contributes to the ongoing discussion on recommended directions for the development of ETL (Extract, Transform, Load) systems and data analysis. In future publications, I will provide a more technical approach to the topic.

A Look at the History of Data Analysis
How the Data Lake Concept Emerged?
What Knowledge Must a Data Analyst Have?
What Is Necessary to Achieve Reliable Analytical Results?
Summary – How Not to Drown in a Data Lake?

A Look at the History of Data Analysis

Simplifying somewhat, it can be assumed that the history of corporate data analysis systems began in the 1980s with the canonical works of Bill Inmon, the father of the data warehouse concept. Inmon’s approach emphasized:

storing data in a business model tailored to the analytical needs of the organization,
maintaining the immutability of stored data over time,
ensuring only correct, high-quality, and verified data were stored,
providing a single source of truth for all analytical needs.

This concept worked very well for organizations with well-defined needs and expectations operating in a relatively slow-changing business environment.

However, the shortcomings of Bill Inmon’s original concept, such as:

a lengthy development process,
a complicated approach to implementing changes and development,
high maintenance costs,

became evident with the significant acceleration of changes in the business environment caused by the transition to virtual processes.

How the Data Lake Concept Emerged?

The business need for “fast and cheap” access to data led to new approaches to data storage, including the data lake concept. Raw data, supported by ad hoc analysis tools (e.g., Tableau, Power BI, and others), enables analysts to work with data immediately upon acquisition. As a result, analysts no longer access standardized and verified data but rather the full spectrum of data and analyses from various sources, often of unknown structure and not always validated reliability.

It often results in unexpected outcomes. I will now share some real-world examples and explain the reasons behind the inconsistencies.

Example 1 – Analyzing the number of Internet Banking System logins on a given day

Google Analytics: 52,763
Application Log: 47,391
Cause of discrepancy: the same metric name, “number of logins,” doesn’t always mean the same thing across all systems. In the given example, the term “number of logins” was incorrectly used to describe an event encompassing visits to the login page, including successful, failed, and abandoned login attempts. In the second source, the same term was used exclusively to count successful logins.

Example 2 – Non-statistical data: daily number of “internal transfer” operations

Data values for a certain date range significantly deviated from the average for that category on the same day of the week.
Cause of discrepancy: an undetected failure of the communication component responsible for transferring data from a supplementary system. Due to the lack of validation within the ETL process, the absence of a single component (attribute) did not generate an error or warning but resulted in an incorrect report – some operations were wrongly classified into another category.

Example 3 – Non-statistical data: daily number of ATM transactions

Data values for a certain date range were zero.
Cause of discrepancy: the initial assumption suggested a device failure and no operations. However, the automatic monitoring system did not detect any technical issues. A physical device inspection was required, and the technician reported that the ATM’s screen and keyboard had been spray-painted, making the machine inaccessible.

These examples highlight the importance of analysts paying close attention to the quality and reliability of analytical results and reports. Over-reliance on data can lead to false conclusions.

What Knowledge Must a Data Analyst Have?

A data analyst must have a solid understanding of:

data sources,
data meanings and the metrics used,
data acquisition and processing, as well as potential ETL process issues,
data dependencies, both within a single system and in relation to other systems.

What Is Necessary to Achieve Reliable Analytical Results?

To ensure reliable analytical results, it’s essential to:

establish principles for validating data completeness and consistency,
verify ETL processes related to data handling,
validate the analysis itself, including examining correct datasets and analyzing all rejected data in the process.

Summary – How Not to Drown in a Data Lake?

While access to data is widespread, an overabundance of information creates challenges concerning its quality, consistency, and reliability. To avoid drawing incorrect conclusions, it’s crucial to implement robust validation mechanisms and quality controls at every stage of the analysis process. Real-world examples demonstrate that every data analysis requires a deep understanding of the data’s sources and processing methods.

Do you need support in transforming data into value for your entire organization?

Learn more

Ailleron - Understanding Your Data: How Not to Drown in a Data Lake

Paweł Machalski Senior Business Consultant

Business & Technical Analyst with over 10 years of experience in IT, specializing in project ownership, project management, and business consulting. Collaborated with major organizations in the financial, media, and telecommunications sectors. Skilled in refining business processes, leveraging modern marketing strategies powered by big data analysis, and working within a Scrum framework.

Share on social media