Towards Big Data
Today, those who are better won the business competition, faster, and more accurately assess risks, test hypotheses, and launch new products based on data. Retail, with its hundreds of thousands of customers, points of sale, and merchandise, is the real front for data-driven technology adoption, and fuel retail is no exception.
Data processing technologies did not appear yesterday. But before arriving at sophisticated systems for working with big data, companies must go through a phased process of technological maturation. The first steps relate to automating reporting and creating a single version of the truth for data-driven decision-making. Then - combining data from many sources to build flexible business intelligence based on specialized Business Intelligence (BI) products. When there is a lot of data, they have original structures and must be processed at different speeds for different tasks, the stage of implementation of Big Data technologies begins.
Data Lake is a basic element of the Big Data architecture, a system that collects large amounts of raw data from many sources. The data is received and stored in the lake so that we can use it ad hoc in the future, on-demand. An important difference between the lake is that it can scale to extreme amounts of data at a low cost, and it can also provide instant access to any data for all users. Algorithms in the Data Warehouse, whose role is to be a reliable trusted source and provider of data for a wide variety of users and systems store and transform the valuable structured data extracted from the lake. Building lakes and storage is resource-intensive but creates an environment for robust processing of large amounts of data and more complex algorithms.
Next to the warehouses - lakes and warehouses - is the Data Science digital laboratory, a specially organized set of tools for building digital models, advanced analysis, and quick hypothesis testing. In these conditions, analytics flourishes, data sets turn into models and prototypes, acquire non-random relationships, and give new ideas to the business. It transformed models and prototypes that the business trusts into management decisions, economic forecasts, and commercial proposals. If a company develops analytics systematically and purposefully, over time the models used by Data Science become more complex and become capable of making smart recommendations and, in some situations, deciding instead of a person. This stage is called the implementation of artificial intelligence (AI). Today, artificial intelligence is a key growth driver for many business sectors.
It's all about intelligence
Thanks to introducing artificial intelligence, the world economy could grow by $15.7 trillion by 2030, according to a global study by PricewaterhouseCoopers in artificial intelligence. The analysis of over 300 examples of the use of AI showed that the greatest readiness for implementing technology today is in the retail and technology industries. Spheres that still rarely use the capabilities of artificial intelligence can make a significant leap by resorting to its services.
Introducing AI is impossible without effective data governance (Data Governance) because the use of artificial intelligence means a person's complete trust in algorithms and an increasing refusal to control them. The company must learn to catalog the data available to it, control its origin, measure and improve quality, consolidate the processes of change and respond to various surprises, make data accessible and equally understandable for all employees. To accomplish these tasks, there are sophisticated specialized tools, usually implemented separately. Without a focus on Data Governance, data lakes become swamps and become unusable because of a lack of trust and understanding of data. According to global statistics, about 90% of data lake implementation projects fail, and about the same percentage of Data Science initiatives fail due to lack of data management. Implementing Data Governance tools requires a long time frame, high organizational and technological maturity, and therefore is postponed by many companies.
One of the local fuel company regional sales directorate, which deals with fuel retail, has been developing and applying solutions related to reporting automation and flexible business analytics (BI) for over 7 years. In 2017, the accumulated data sets required the creation of solutions for Big Data processing and advanced analytics. Realizing what the next stages of the evolution of analytics will require, the company decided to simultaneously implement a platform that includes Big Data, Data Science components, and a set of Data Governance tools. The project, called Smart Data Lake, took 2 years to complete. The Global CIO community of IT directors of the US recognized it as the best project of 2019 in the Analytical Solutions and Big Data nomination, noting that so far the analytical platform ARD is the only one in the US in its class.
The platform includes components for processing, storing, and analyzing data: a data lake, a data warehouse, an advanced analytics laboratory. They work closely with data management components: corporate data catalog, data quality management system, business glossary, data navigation portal at all stages of processing.
The data catalog can automatically extract information from the lake and data warehouse, the BI-system of the PDD, as well as data processing procedures in these systems. Thanks to this, it contains a technical map of all available data, as well as their flows of origin (Data Lineage). Internal experts enrich the objects in the catalog with descriptions and various characteristics so that the search engines built into the catalog allow a very flexible search for the data you need and evaluate the effects between systems and data models. In addition, the catalog has many specialized functions based on AI algorithms. The cases of the US banks confirm that implementing a data catalog speeds up the work of analysts and architects by 40%. And research from Joseph M. Blumenthal says that with an enterprise data catalog, the ROI on data and analytics is threefold faster. Filling a catalog is a long process with many participants - according to world practice, it takes several years. Now the catalog contains about 1000 data objects, and this is only the beginning of a long journey.
The data quality management system allows you to create quality control rules for data in the lake and storage, immediately check the implementation of these rules and accumulate statistics on violations. For example, the price per liter of gasoline at a gas station in the US today cannot go beyond 3 to 3.2 USD. If it does, then the data is incorrect, and someone should immediately signal this. Together with users, they develop thousands of rules under which the data becomes of high quality, enjoying maximum trust. The system allows you to manage a single register of all data quality rules, see contradictions and gaps in it, set metrics and track complex statistics. World practice suggests that introducing such a system allows an organization to improve data quality by 60% within 3 years, which significantly reduces operating costs and risks.
Our data processing and analysis systems are ready to build complex solutions that will allow you to extract additional value from them. The data lake and data management system is the basis for development, for new management models and new products. Already 50% of all analytical projects and initiatives are being implemented on their basis, in 2020 there will be 75% of such projects. This is our material part for further innovation and development of client ecosystems.
Business glossary, user portal - the most understandable tools for each employee to search for business data. As with the data catalog, filling the business glossary will take a long time, but I have already put all the tools into operation. If all employees use a single dictionary and catalog, this will create a single language and a single culture of working with data, so someone will implement much faster projects.
The idea of the platform as a source of reliable data is very important. This is precisely the property due to which the use of such solutions allows you to reduce the cost of implementing projects related to analytics. The added value of such a source is that it is self-replenishing. Each implemented task adds new data to the system, and this data is correct by definition, understandable, and easy to use.
In addition, the platform includes ready-made analytical sandboxes - a specially dedicated environment for the safe execution of mathematical algorithms. We need them for building prototypes and testing hypotheses based on data. Now sandboxes become available to specialists within one day, whereas earlier their organization took 2-3 months.
Thanks to the platform, launching new products and solutions will also speed up. The presence of many ready-mades, open and well-described data, collected in one place and formatted, allows us to guarantee a short delivery time and by the end of 2020 reduce their delivery time to 2 days.
First, the centralized infrastructure will save on data integration for new analytical projects and initiatives, will create a culture of work with uniform tools for all departments. Second, the availability of accessible quality data in a single vocabulary and format will speed up the introduction of new digital products and improve the quality of decisions. And we have created a base for the full-scale work of the Data Science segment, where business cases can be checked much faster, successful ones can be quickly monetized and turned into regular business tools. "
Big Data in practice
The first cases of the integrated platform were processing operations of the fuel company filling station network and feedback from customers, calculation of segments for client analytics. To build analytical solutions, we integrate data from external sources into the data lake: information of partners and information about competitors.
How does a business benefit from AI platform solutions? Large structured and verified data helps to develop the most personalized offers for customers. For example, the Shell Global gas station network is launching a campaign: a customer gets a discount for every liter of G-Drive 100 fuel. Among those who took advantage of the discount, some customers have not refueled at Shell Global gas stations for a long time, and the discount has become an effective incentive to return. Others refuel the usual ninety-fifths, and thanks to the promotion they did not miss the chance to get more premium fuel. The third group of customers regularly refuel with G-Drive 100, they will not bring additional profit to the company, because they would have refueled anyway, but now they did it at a discount. A smart data lake allows you to analyze consumer preferences and draw conclusions based on an enormous amount of data: the regularity of refueling, time and day of the week, weather, changes in fuel prices, and exchange rates, the actions of competitors. By analyzing the consumer preferences of each client, the company receives a tool for drawing up the most customized offer, reducing the effect of firing a cannon at sparrows and increasing the effectiveness of campaigns. Only the first two categories of customers will receive an offer to refuel with the hundredth gasoline at a discount as push notifications in a smartphone, while a different, more relevant offer will be made for the third group.
It will take less and less time to launch each next marketing campaign, because someone will already load the initial data into the lake, including those automatically received from now on. A smart algorithm is able not only to generate a mathematically ideal proposal for the client but even predict the need to develop this proposal. For example, by diagnosing a decrease in pumping at a certain gas station or predicting a future change in demand based on recurring events over the past several years.