noun_Email_707352 noun_917542_cc Map point Play Untitled Retweet Group 3 Fill 1

Data as an Enabler part 3: Bringing clarity to data chaos with a data catalog

Without taking relevant data under control, organizations will struggle to understand and harness business information, discusses Tomi Mustonen in the 3rd blog of the Data as an Enabler series.

Tomi Mustonen / August 04, 2023

Imagine how easy it is to search for information from the internet.

Say, you want to search for information about a new TV show that your friend recommended. You just type the name into the search engine (or generative AI solution), and you will instantly get more information and most likely the navigation to a service where you can watch the show.

Now, imagine that you have started a job in a new company and want to find financial information or other important facts. You most likely search from the Intranet; some might try a specific IT system, and one tries to find a colleague whom to ask. When you manage to find the information, the next question you might have is how the figures are calculated and what they mean. This leads to another round of searching and consumes more of your valuable time.

Respectively, data professionals face time-consuming tasks when they start creating new reports or other data products that utilize existing data from organization´s databases. Finding trustable data can be difficult and laborious as you need to find the correct data sources, get access to them, and often clean up the data to ensure that it is accurate, consistent, and fit for use.

According to studies, data professionals still spend around 38% of their time in cleaning and curating data (Source 1). Although this number has decreased from 45% in 2020 (Source 2), there is a significant amount of time that data experts could utilize for more productive tasks. And this is only the professionals who use data, not the ones creating data in their daily work. Users of operational business systems don’t often know what data should be inputted in which system, in what format, and lack knowledge where their inputted data will be used. This can lead to careless data inputs. We have seen that data catalog is one of the key solutions to avoid this.

 

What is a data catalog and what do we need it for?

Key information that organizations have include:

- Structured (tabular) data, usually stored in traditional databases, produced and mainly used in operational business systems.

- Unstructured data, including documents, webpages, emails, social media content, mobile data, images, audio, and video.

In addition, the following can be seen as key information assets: reports, data visualizations, and dashboards; machine learning models; integrations between systems and databases; external data sources, publicly available data, and data purchased from 3rd parties (Source 3).

Why do we need a data catalog to gain control over these key information assets? One of the descriptions for data catalog is organized inventory of the relevant data in the organization (Sources 4 & 5). One analogy for data catalog is the card catalog used in libraries to register all bibliographic items into a central location. Each card in library catalog contains key information about one single book, such as author, style and unique identifier of the book. The card catalog, either web-based or physical, makes it is easier to search and find the book you are looking for.

One should not try to include all data objects into the scope of data catalog. Instead, the organization should slice the data elephant and concentrate on the most relevant data. Relevant data can be defined as valuable piece of information that an organization uses to support and operate its day-to-day business, and information it uses to make decisions and forecasts.

In short, data catalog collects metadata, “data about data”, meaning data stored in some form of IT solution that improves the business and technical understanding of data. The following figure illustrates key contents in a data catalog.

data_as_enabler_data_catalogue_1.jpg

Figure 1. Key contents in data catalog


Let’s explain the key contents:

  • Business glossary: Definitions of the relevant/important business terms, e.g., Product, Customer, Supplier. Terms that are used across the organization and are often represented with data.
  • Data dictionary: Technical descriptions of physical data storage solutions storing relevant data, e.g., databases. Documentation on tables, columns, data types, keys, etc.
  • Data ownership: Information about the owners and key stakeholders for specific data objects. Data owner has an accountability over a particular data object and cooperates closely with process owners who own processes using these data objects. Data stakeholders use, affect, or are affected by data objects.
  • Relationships: Relationships between business terms and physical data objects, e.g., database tables. Also, relationships between physical data objects. Can include relationships of the business processes and data objects.
  • Data lineage: Information regarding the origin of data and how it is transformed between systems and integrations.
  • Data locations: Information about where and how data is physically stored, e.g., in internal data center, cloud, server, file, database.

To summarize, a data catalog:

  1. Creates single central repository which enables connecting different business functions and personnel.
  2. Helps data professionals in collecting, organizing, and enriching metadata so that data is discoverable and governed.
  3. Creates central documentation repository of company’s relevant data objects and therefore increases data searchability, access to, and understanding of data.

A data catalog answers questions and concerns around data and gives a comprehensive view on the most important data objects for business operations.




Figure 2.
Data catalog answers questions and concerns about data.


Business benefits of a data catalog

Why should one invest in implementing such a solution then?

The key benefits of the data catalog can be summarized as follows:

  1. A data catalog enables employees working with data to save valuable time by helping the discovery of relevant data. This, in turn, increases productivity and efficiency.
  2. By providing centralized location for all relevant data, data catalog enables collaboration among data stakeholders, including data owners, consumers, and stewards. Besides providing a common understanding of data definitions and standards, it leads to more informed decisions and increased confidence in outcomes. Collaboration can also foster new ideas and increase innovativeness.
  3. A data catalog is an essential component in data governance, which sets policies and practices to ensure consistent data across the organization. Governed data leads to numerous benefits, including improved data quality characterized by reduced errors in data input and enhanced accuracy. Additionally, data governance fosters trust in the data, and thereby facilitates data-driven decision-making.

Also, by providing centralized place for documentation, data catalog can decrease the lead time of data initiatives. Today, most of the development initiatives need, create and/or consume data and these initiatives will benefit of the documented and approved data landscape. When existing data, definitions, descriptions, owners are locations are known, it’s easier to evaluate the gap between the current and future state and the development required for initiatives.

Once different data objects are documented and known, it is easier and faster to combine different data sets and implement integrations between systems. A data catalog also improves interoperability between organizations. When data is being shared outside the organization, whether sold or as open data, having well-defined definitions and proper documentation becomes crucial. Equally important is the understanding of data received from external sources.

 

Finding a “fit for use” data catalog

A data catalog is an important enabler for efficient data discovery and the trust in data. According to studies, the time spent to locate data and reports can be reduced by 50% or even more (Source 6).

However, implementing a data catalog incurs costs, as it requires initial investment in its development and ongoing attention for maintenance. It is a technical solution used by humans and not a silver bullet fixing all data issues. Proper data management practices, as well as processes to curate the catalog itself, are needed.

There are several mature and sophisticated data catalog solutions in the market, including an array of features but also potentially carrying a high price tag. Data executives should thoroughly assess the specific requirements of their organization. If the requirements are straightforward, then opting for a simple solution is recommended. The process of finding and utilizing accurate data to address critical business challenges should be as effortless as using your preferred e-commerce or search platform to discover and purchase a product. A data catalog is a valuable tool that can assist in achieving this objective.

If you want to get your data in order and say goodbye to data chaos, do not hesitate to reach out. Our team is ready to help!

Read more about how we are working with Responsible AI:

 

 

References

1: State of Data Science report, 2022 |Anaconda
2: State of Data Science report, 2020 | Anaconda
3: Data Catalog |IBM
4: What is a Data Catalog? | Alation
5: Data Catalog | Oracle
6: The Total Economic Impact Of The Alation Data Catalog | Alation

Tomi Mustonen
Senior Data Management Advisor, Tietoevry Create

Tomi has experience in various areas of data-driven business, including data management models and data modeling. He enjoys collaboration with business stakeholders, linking business processes to enterprise architecture and data management.

Share on Facebook Tweet Share on LinkedIn