data integration, it is estimated that 50 to 80 percent
of a data scientist’s time is spent on collecting and
organizing data for analysis. 1 If we consider that in
any complex organization, data governance is also
essential for tasks other than data analytics, we can
conclude that the challenge of identifying, gathering, retaining, and providing access to all relevant
data for the business at an acceptable cost is huge
(Bernstein and Haas 2008).
The aforementioned considerations are valid even
for very simple information systems, as the following
example scenario illustrates. Figure 1 shows a portion
of Cust table, a relational table contained in a real
information system. The table maintains informa-
tion about the customers of an organization, where
each row stores data about a single customer. The first
column contains the customer code, with the provi-
so that if the code is positive, then the record refers
to an ordinary customer, and if it is negative, to a spe-
cial customer. If the code is nonnumeric, then the
customer type is unknown. Columns 2 and 3 specify
the time interval of validity for the record. ID_GROUP
indicates the group the customer belongs to (if the
value of FLAG_CP is “S,” then the customer is the
leader of the group; if FLAG_CF is “S,” then the cus-
tomer is the controller of the group). FATTURATO is
the annual turnover (but the value is valid only if
FLAG_FATT is “S”). Obviously, each notion mentioned
previously (like “special,” “ordinary,” “group,”
“leader,” etc.) has a specific meaning in the organi-
zation, and understanding such meaning is crucial if
one wants to correctly access or manage the data in
the table and extract useful information out of it.
Similar rules hold for the other 47 columns that, for
lack of space, are not shown in the figure.
Those who have experience with complex databases, or databases that are part of large information
systems, will not be surprised to see such complexity
in a single data structure. Now, think of a database
with many tables of this kind, and try to imagine a
poor client accessing such tables for data analysis.
The problem is even more severe if one considers that
information systems in the real world use different
(often many) heterogeneous data sources, both internal and external to the organization. While many are
the issues raised by this problem, I would like to go
into more detail on some of them.
Accessing and Querying Data
As observed by De Giacomo et al. (2018), although
the initial design of a collection of data sources might
be adequate, corrective maintenance actions tend to
reshape these sources into a form that diverges from
the original structure. Also, they are often subject to
changes so as to adapt to specific, application-depen-dent needs. Analogously, applications are frequently
modified to accommodate new requirements, and
guaranteeing their seamless usage within the organization is costly. The result is that the data stored in
different sources and the processes operating over
them tend to be redundant, mutually inconsistent,
and obscure for large classes of users. So, query formulation often requires interacting with IT experts
who know where the data are and what they mean in
Articles
Figure 1. Fragment of the Cust_table Table.
124589
140904
124589
-452901
129008
-472900
130976
30-lug-2004
30-lug-2004
27-lug-2004
9-lug-2003
1-gen-9999
1-gen-9999
1-gen-9999
15-giu-200515-mag-2001
5-mag-2001
13-mag-2001
10-mag-2001
10-mag-2001
7-mag-2001
92736
35060
92736
92770
62010
62010
75680
S
N
N
S
N
S
N
N
S
N
S
N
195000,00
230600,00
195000,00
392000,00
247000,00
0 00
N
N
S
N
S
N
CUC TS_START TS_END ID_GRUP FLAG_CP FLAG_FATT FATTURATO FLAG_CF