Data Quality: Standards for Better AI and Greater Efficiency
Data is the fuel for artificial intelligence (AI) and has long been a currency, especially in research. However, its quality is not always good. Prof. Dr Felix Naumann from the Hasso Plattner Institute (HPI) knows this only too well. He works closely on the matter of data quality, including in the AI sector. A conversation about bad files, bad data and wolves in the snow.
Machine Learning Needs Good Training Data
Prof. Dr Naumann, the term data quality seems very clear-cut at first glance. But when you take a closer look, you quickly ask yourself the following: “What exactly is meant by that?”
Strictly speaking, it is necessary to distinguish between data quality and information quality. However, the two often overlap in practice. As a computer scientist, I would associate the first term with tangible technical criteria. It also makes sense to define how data should not be: faulty, outdated, incomplete or simply badly formatted. If we expand the term to include information quality, many more questions arise: How well is data protected? How much does it cost? How readily available is it? What about its credibility? Even if we define the term data quality broadly, there are probably 20 to 30 additional dimensions for describing the term in greater detail.
You once spoke of “bad files” and “bad data”...
It goes without saying that these are rather sensational terms. But basically, what I mean by bad files is this: Data first comes as a file – as a CSV, as an Excel spreadsheet and so on. Files can already harbour problems. They may be incorrectly formatted or inconsistently structured, for example – if the number of columns suddenly changes in a CSV file or a new table appears at the bottom. Sometimes there are even comments inserted or table titles that do not belong there. Bad files make it difficult or even impossible to load data into a system in the first place. As boring as “badly formatted files” may sound, they are a huge – and unfortunately widespread – problem for data scientists. This is especially true in data lake scenarios. “Data lake” here means a data store in which – unlike in normal databases – data from many different sources and different formats can be combined in their raw form.
What happens once the files are in the system?
Then I can at least already read the data in a structured way. But it may be wrong, outdated, incomplete or not diverse enough. Then I would speak of bad data. A common problem is duplicates – multiple records describing the same product or person. Undetected, they can cause enormous damage. What bank wants to give a loan to the same person twice?
A nice example. Surely there are other, less hypothetical...
Incredibly many – whereby the double-issued loan is by no means hypothetical. One totally classic example of an erroneous data set comes from Ireland. A Polish driver there called Prawo Jazdy was up to no good. There was hardly a road traffic regulation that he had not violated. Although he was caught by the police again and again, he always escaped punishment – because there was no man with his name living at any of the addresses he gave. To his credit, it must be said that he couldn’t help it at all. In fact, he hadn’t even lied about his address. This is because Prawo Jazdy is not a name, but instead the Polish word for driving licence. This ended up in the system so many times as a first name and surname that it became the most reckless Polish driver in Ireland. Indeed, the hunt for him probably kept the investigators very busy.
You deal with AI-specific dimensions of data quality. Are there other criteria hidden here?
These criteria are actually only just becoming the subject of research. However, I can already provide a few examples: For example, data protection is taking on a much greater role. It has always been relevant, of course, but artificial intelligence has once again raised many new questions. The situation is similar with other categories. The explainability of data and its models is also an AI-specific dimension. Just like the diversity of data. The issue of bias is closely related to this. The question is whether the diversity of the data is evenly represented or whether certain data is overrepresented. We also deal with the liability complex. This is because all too often, it isn’t sufficiently clarified who owns the data and who is liable if it isn’t correct.
What impact does bad data have in the AI context?
If I feed a machine learning model the wrong data, the saying “garbage in, garbage out” is usually true. However, omitting data is also enough to produce poor results. For example, AI-assisted autonomous driving suffers from the fact that many situations have not been trained. If the model has never seen a bike in the rain due to incomplete training data, then the car will not be able to respond appropriately. One famous case from the research concerns image recognition. The issue was not incompleteness, but a lack of diversity. An AI was supposed to learn to distinguish between wolves and dogs. During training, the AI recognised photos of wolves very reliably. However, it made many errors in practice. Why? It was because snow was visible in all the training pictures with wolves. The model did not learn that wolves look different from dogs. It only recognised the snow. And a wolf could only be where snow was visible.
How can we take countermeasures to prevent data quality from becoming a problem for AI applications?
Until a few years ago, the strategy was to improve the models – to train them to work with dirty data. There are all kinds of tricks, in fact. However, research is increasingly trying to improve the quality of the training data in the meantime. The goal is to clean up this data and feed the AI with high-quality input.
That sounds like the much more sensible way ...
Basically, yes. But this path has its difficulties. In the case of the AI that was supposed to distinguish between wolves and dogs, the photos were essentially of good quality. Nevertheless, the researchers made some unconscious errors. You also have to stay aware of the relatively new issue of bias. In the USA, there is the famous COMPAS data. Using it, artificial intelligence is supposed to predict to what extent people who might be released from prison on parole would reoffend. So it’s a kind of digital decision-making aid for judges. The training data for the AI are past decisions. However, people with dark skin often did not come off well in these cases and were disadvantaged due to judicial prejudices. As a result, the training data is problematic and AI runs the risk of making equally prejudiced decisions.
So are there standards for good training data?
The Artificial Intelligence Act and the Data Act of the European Union contain some requirements, but no technical ones. This is exactly where our research comes in. We want to formulate technical standards – while always being aware that perfect data will never exist.
Your research is conducted as part of the KITQAR project, which receives funding from the Denkfabrik Digital Working Society at the Federal Ministry of Labour and Social Affairs (BMAS). What exactly does this acronym stand for?
KITQAR stands for “KI-Test- und -Trainingsdatenqualität in der digitalen Arbeitsgesellschaft” (AI test and training data quality in the digital working society) and unites the fields of computer science, jurisprudence, ethics as well as the Verband der Elektrotechnik Elektronik und Informationstechnik e. V. (Association of Electrical, Electronic and Information Technologies), or VDE for short, as the standardisation organisation. Together, we first want to develop a comprehensive definition of data quality in the AI context – along the professions of those involved in the project. After all, experts from the field of IT have different demands regarding data than experts from the fields of law or philosophy. Once this definition is in place, the next step is to develop quality standards for test and training data. By this, I mean checklists or guidelines that developers and users of artificial intelligence can use to check whether their data is appropriate. Some guidelines will be technically feasible, while others, especially the ethical ones, will need to be assessed by external experts.
Which information-related topics need to be regulated by the quality standards?
The first topic is diversity: Are all values represented in all relevant dimensions? Let’s take the dimension of “human being”: Standards are needed here to ensure that all relevant groups are adequately represented in AI training data involving people. With the second topic, we take care of standards regarding the completeness of data. This is a really difficult undertaking, since the definition of “complete” is unclear. The third topic involves quality specifications regarding correctness. It is important to note that all three areas have to do with defining and measuring data quality – but not yet with cleansing.
How does such a data cleansing process traditionally work?
It involves many different steps. A central task: replacing missing data. To do this, we use what is called imputation, which invents values to fill in blanks. These values may not be the right ones, but they make sense, such as by representing an average. Another step: correcting. Many things can be fixed quickly with spatial data. It is relatively easy to check whether a particular postcode also matches the street and house number. Many errors can also be identified via business rules. Users formulate in advance what must be correct about data, then the technical check can follow.
In medicine, great hopes are pinned on real-world data that accumulates in everyday medical practice. Isn’t it precisely here that data quality standards are needed in order to make them really useful?
Absolutely. The usefulness of such data stands and falls with the documentation. And sometimes there is little time for this – after all, the patient should ultimately be in the foreground, not the paperwork. It is therefore all the more important to be able to measure data quality and deal responsibly with bad data. This means not pouring them blindly into a machine model, but cleaning up data beforehand or at least handling the possibly erroneous model with care.
How do you actually rate the data trustee model regarding data quality?
A trustee who manages data for different clients and has permission to link them for added value makes a lot of sense to me. Such an entity would undoubtedly be of benefit to data quality.