data engineering

The FMCG sector remains among the most competitive and rapidly growing verticals in the world, where meeting diverse customer needs/wants, responding to competitors and maintaining adequate profit margins remain the most urgent challenges. In response, various tools and systems optimize operational processes for the smooth delivery of value to customers. Nowadays, companies increasingly realize the value of extracting info from the data collected through their systems. In the first stages of a data engineering project, it’s really important to determine the data, data sources, ease of extraction, data quality, etc.

Data sources for use cases that are common in FMCG spaces

The following figure illustrates the common data points used in an effective FMCG data analysis project. Based on the needs of yours, the list may increase or perhaps decrease.

Operation Data: Operational data is actually probably the most critical to the organization, collected during daily operations.

Competitor Data: Monitoring competitor prices and position data will help identify opportunities and markets potential to explore.

Marketing Data: this’s data collected during a marketing plan. Combined with more data, it is able to help classify the strengths and weaknesses of the marketing efforts of yours and help identify the best marketing mix.
Weather Data – Customer buying patterns can highly depend on the weather. Apart from a seasonal sale, weather data are able to predict the sale of items and supplier behaviour.


Data Formats

A diversified list of data sources leads to an increased variety of data.
The data may be in several forms and formats, and a Data Engineering team should be adaptable and flexible when extracting data from heterogeneous formats/sources like:

Legacy System Data – It’s readily available in a proprietary structure needed to create custom connectors to extract the data.

Flat Files – JSON, Excel, CSV, etc. are actually not hard to analyze and organize, but may require careful data mining & exploration to analyze issues. Data could be structured, semi structured, or perhaps completely unstructured. ยท

APIs/Web – Supported in nearly all modern systems.

Semi-structured/Unstructured Data: Data from social media, newspaper articles, blogs, and any other public data will have little or perhaps no structure. While it is difficult to extract and prepare the data for consumption, it is useful to extract insights into customer sentiment and share voices.

Turn raw data into a unified data model

To help make probably the best use of data, it should be converted into a common unified model that every team that uses it understands. For this purpose, exploratory data analysis activities are actually carried out to identify attributes, value ranges, outliers, data quality problems, etc.

The four-step data pipeline process is perfect for this.


Raw Layer: data extracted from the source system lands and is actually kept in this specific level in its raw form.

Staging layer: Raw data is actually cleaned and converted to data type and format. Here, the column names can be changed to a traditional format that is actually understandable by all parties. The information in the Staging Layer has the same content as the Raw Layer, the only difference is actually standardization.

Intermediate Level: Here, business logic is actually applied to standard clean data. A mid-level transformation might include combining multiple data sets and checking whether the data meets the defined business constraints. The middle level of one data pipe can be worn by another data pipe.

Consumption Rate: The output of all data engineering activities is now ready for consumption/analysis by other analysts and data scientists. Data is formatted for stable and agile consumption. This’s the only level which may be seen by parties apart from the Data Engineer. In addition to a data model, it’s beneficial to provide a data dictionary to consumers.

Final data formats

Managing 4V Big Data Volume, Velocity, Veracity and Variety (The 4 Vs of Big Data, n.d.) requires modern data and advanced analytical techniques. Data storage formats must support robust processing and efficient consumption using distributed computing. Excellent examples are actually columnar formats as Parquet on distributed storage as HDFS or perhaps cloud provider’s blob storage services. Databricks recently introduced Delta Lake (besides Microsoft Azure and Amazon AWS) in Apache Spark Workloads. This’s an open-source storage tier that sits on top of Parquet with benefits like ACID properties and time travel (which allows you to restore data to a previous state).


Data governance

As data becomes the new currency of enterprises, there must be processes, policies, roles and standards from the very beginning of a data analytics project to make sure data quality and security. Roles such as Data Stewards are actually defined to operationalize the organization’s established data governance strategy. Data governance primarily ensures accountability, regulatory requirements, quality, security, reliability, and data consistency. Data engineers have to be aware of data governance strategies to build secure value-based data paths.


data masking


Data Engineers should know only necessary data is actually extracted and provided to consumers (Ex: Data Scientist). No more, no less. Any personally identifiable info (PII) must be withheld unless you have the customer’s consent. With introducing policies as GDPR in Europe (GDPR meets its first challenge: Facebook –, 2020), companies that collect customer data have a legal obligation not to abuse it.

Certain entities use loyalty programs to track customers and their buying behaviour. The company also keeps the info about suppliers. This’s a precious data asset and should be checked carefully for any related, personally identifiable info.

Personally, identifiable info is masked in numerous ways. Generating a surrogate key rather than the real customer identification ID or perhaps encrypting a personally identifiable info column are common techniques.


Difficulties in creating a data engineering path

It’s a challenge for technicians to understand business functions and map data with them. Good communication channels and frequent discussions with business units, managers, translators, and data owners should plan to understand the nature of the data.

Before building a pipeline, data engineers must understand use case outcomes by working closely with use case owners, delivery managers, and data scientists.

In order to overcome technical challenges, a Data Engineer holds expertise with techniques and tools for architectural knowledge, data analysis, and data extraction to develop efficient data models with the business capabilities required for analysis.

Let us help you


A retail data engineering project must withstand 4V due to the need to integrate heterogeneous data from numerous sources. Selecting the right input data set, defining the ETL channel, and deciding the storage format (and location) is actually essential to the success of a data analysis project.
In order to deal with the challenges, a data engineer must adhere to a data governance strategy and provide the optimal data model required for the analytics use case.

The data engineering pipeline must develop based on changing data sources and consumer needs. Therefore, creating an iteratively developed layered pipe architecture is actually a major strategic decision for success.




Show All Tags Hide All Tags