Artificial intelligence - for sure! AI quality is now verifiable

Fighting myths with standards: Making artificial intelligence verifiably safe is a prerequisite for consumer and industry trustworthiness. With the development of such a standard, the German Commission on Electronics and Information Technology (DKE), with the involvement of the fortiss, the research institute of the Free State of Bavaria for software-intensive systems, has managed to create a breakthrough. Under the leadership of fortiss, which also contributed to the content, the development of the VDE-AR-E 2842-61 standard represents an initial, detailed framework for the “design and trustworthiness of autonomous/cognitive systems.” Four volumes have already been completed, with plans to release the last two in the first quarter of 2021. The publication of the standard paves the international way for the structured and verifiably safe development of AI-based systems, thus establishing the availability of a reference standard that can lead to an AI seal of quality. fortiss software experts Harald Rueß und Henrik Putzer believe there is cause for concern if we fail to create standards. In an interview the Munich scientists, who played a leading role in the development of the new standard, talk about the demands and reality of a technology that is surrounded by an abundance of myths.

The topic of artificial intelligence has created considerable anxiety among the populace. Do you believe the fear of machine superiority is justified?

Putzer: If you look at the currently available technologies, these concerns are unfounded. AI is more like a new type of engineering. It’s essentially not something that can achieve goals on its own or accomplish tasks that it was not intended for. AI might discover new solutions perhaps, but not new tasks. The fear of a world dominated by artificial intelligence is unfounded. But we do have to be concerned if people utilize AI without the necessary basic knowledge or if there are nefarious intentions behind it.

Does this relate to your concerns about a lack of quality standards with AI technologies?

Putzer: Yes. The concerns about the quality of AI have three dimensions. First, the developer must be a skilled and be able to develop technologies such that the user can easily deal with them. Secondly, the manufacturer must clearly specify where the limits of this AI technology lie – in other words defining what it can’t do. Thirdly, the user must then employ the AI technology in line with this requirement. The first issue – how do I develop a good product? – is where the quality standards come in. The second – how do I describe the possibilities? – also involves the quality standards. The third issue is more a question of ethics: for what purpose am I using it?

What do we mean exactly when we talk about AI? You often get the impression that a lot of people are talking about completely different things.

Rueß: Thanks to on-going digitalization, we have access to enormous amounts of data. At least the potential is there. And we should also utilize this data. This is exactly where AI technologies come into play. We have discovered that machine learning methods, particularly deep learning approaches, are highly scalable when processing and detecting certain statistical correlations from this data. We can take advantage of these correlations to optimize processes and systems and if needed, support specific decisions as well. An underlying causality remains closed to today’s available AI technologies. Causality on the basis of symbol grounding – having a genuine understanding – is still reserved for humans. This is something that must be clear to the developer, manufacturer and the user, otherwise it can create a dangerous situation.

Rueß: Here is an example. You could conclude on the basis of available data that a correlation exists between the number of storks and the birthrate. Of course, you have to be careful in deriving a causality or logical correlation here. Sticking with our example here, you could conclude that in order to increase the birthrate, increase the number of storks.

This presents a danger for the human observer who has nothing to do with AI. We always interpret things on the basis of causalities, although statistical correlations are being dealt with in the background.

We over-interpret the result of this AI even though it’s highly dependent on where we get the data from. After all, some regions have no storks.

Joseph Weizenbaum developed an “AI” system named Eliza as early as the 60s. Although this computer program does nothing more than simply paraphrasing user input, for the most part the test subjects were convinced that EIiza was actually responsible for their problems. Even when interacting with newer AI technologies, such as the recently published GPT-3 from Open AI, we’re tempted to attribute intelligent behavior to this computer program, at least until the point comes when it completely screws up. This is where the biggest challenge lies for companies and their business ideas. Although we are seeing how much money, research and effort is being invested in AI technologies, many of the application scenarios – with the exception of autonomous driving - are still unclear.

Where is the real business potential of AI?

Rueß: By analyzing AI-based data, we understand what’s efficient or what will completely go wrong, and can then try to optimize these things. This is happening at the moment with applications such as predictive machine maintenance, where AI is employed to predict potential interruptions and beyond that even suggest the most efficient repair scenarios.

It gets even more exciting as we begin to utilize AI in a new generation of products and services based on increasingly autonomous machines and controls. In this case it’s not only about optimizing processes and systems. On the contrary. Having command of these technologies will be crucial for the ability of industry to remain competitive and for the welfare and prosperity of our society.

In the words of economist Joseph Schumpeter, AI could be the one new revolutionary technology with the potential to have a profound impact on the economy and serve as the technical foundation for a long-term economic upswing.

In what industries does it even make sense to employ AI-based systems?

Rueß: Wherever and whenever data accumulates, whether during development or production or in logistics and customer management. It’s like the field of statistics, which doesn’t have a specific field of application either. This is the reason for the current hype about AI. It can be used everywhere and it scales so well for huge amounts of data. One key field of application is support for decision making; AI as an assistant. AI helps us to do specific things. That’s absolutely comparable to a manager. AI helps to take the data that exists and prepare it so that it can be comprehended and used as the foundation for making decisions that are as evidence-based as possible. But this is exactly why AI has to be transparent and capable of interacting. It also has to offer an explanation component in order to avoid erroneous conclusions like in our example with the stork and the babies. Furthermore – and this is actually our main issue – AI will increasingly be used in safety-critical systems. Examples include autonomous driving, medical engineering, controlling turbines and also controlling intelligent power grids. If AI technology gets out of control in these environments, it can cause major damage. And the question is, how can we prevent that from happening? How can we develop and operate AI in such a way that no damage is foreseeable?

Putzer: At the moment, AI cannot accomplish anything by itself. It’s merely a type of automation of some kind of sub-task. The bottom line is, we depend on things being developed so that humans can easily work with AI; so that together humans and AI can achieve better performance than a human can on his own. With the current approaches, AI alone is meaningless. The only purpose that automation serves, even AI-based, is to help a human.

Rueß: Here’s a real-life example. A medical technology company trained a 40,000-dataset neural network to diagnose diabetic retinopathy. They came to us and wanted to make a product out of it. We then conducted an analysis and used our “neural network dependability kit” to improve it. The whole thing is now a lot more efficient than before. But the important thing was that we also integrated an explanation component, so that physicians now receive feedback in addition. And not only whether it’s diabetes or not. Instead, the physicians also receive feedback as to why it could be diabetes so that they can develop a more precise picture: do I trust this diagnosis or not?

This is where we hear the often-quoted request to look under the hood when it comes to AI.

Putzer: If you have a structured approach to creating such AI systems – clear criteria which you can use to make decisions – then you have a good method, including design steps, verification and validation. Then you have a basis to simply judge whether this product is good. Consumers have long been unable to look under hood. You need technical experts. And the technical experts have to be given the methods that allow them to reasonably develop and test the product. You can then build on that by developing quality criteria and seals of quality, which can also be used by consumers to see that the product is a quality and trustworthy innovation.

Rueß: AI works relatively well and is better than traditional methods in many applications. With image recognition for instance, the recognition rate has experienced an unbelievable increase through the use of AI – perhaps as high as 95 to 98 percent, which is phenomenal. But if you now employ traffic sign recognition in a safety-critical system, you don’t need 95 to 98 percent, rather significantly more than 99.9999 percent in order to demonstrate that no unacceptable damage is likely to be caused. And this is exactly the basic problem. It’s precisely here where the development methods have to be provided in order to come up with precisely such arguments, so that the use of such AI technologies can be accepted by society.

How can you even create a standard that enables an understanding of these processes?

Rueß: Researchers and developers, including those at fortiss, have acquired a lot of experience and knowledge about developing AI-based systems over the past several decades. A common approach has meanwhile crystallized, from gathering data and training these types of networks, to analysis. We took this valuable experience and integrated it into the current implementation guideline. These guidelines have now been documented in six volumes, each with roughly 60 pages.

What are the most important measures that were established for this process?

Putzer: What was it like when we massively began to utilize electronics in the earlier days? This was also a new technology at some point. Back then people were also asking: When does such an electronic thing malfunction? What is the likelihood that it functions correctly? This led us to measures like failure rates (from coincidental malfunctions), all developed over decades through lots of testing and experience.

Afterward came the software, at which point we realized that failure rates are not the right approach. We have to develop things so that they have no malfunctions. Software, in and of itself, can’t go bad or age. This led us to create corresponding systematic design approaches in order to avoid systematic failures.

AI now presents us with a new category of coincidental (primarily hardware) and systematic malfunctions. So you have to develop measures to reduce them. There is a collection of best practices – you established when and how it functioned. That was the first step for the standard, where we began to collect and work on the information to determine which requirements have to be placed on an AI element - which after all is just one part of an overall system – and then determine how to implement and validate these requirements. At the end of the day we have an approach to manage this third type of malfunction, which is related to uncertainty.

Today when we say AI, we’re talking mainly about neural networks. That’s actually just a very small part of AI. There are formal methods that do a good job of managing the other elements of AI. When it comes to the neural networks however, it’s difficult. That means we need new methods that make it possible to determine the likelihood of failure or failure rate or functionality probability.

When it comes to widespread use of a technology, especially with AI, it involves a lot of trial and error. And if you have enough processing power, you can try out that trial and error approach. There are people with lots of money and lots of processing power. You can read a bit of envy into that statement, but research comes from the other angle and says, okay somehow I have to grasp that issue as well. Trial and error is not everything. And that’s where we come in. So we’re getting closer to the third pillar – approval. What is the status? The status is, you have to have clear argumentation with proof. Even the FDA (US Food and Drug Administration, which approves medical products) says that. Although the FDA accepts the fact that products contain AI technology, it expects sound argumentation. We are in the process of asking what such argumentations should look like and how they should be structured. What the underlying structure is, what the minimum requirements are, so that you are in a position to accept something. With argumentation, you need more than just a structure – you also need proof or evidence. When I take a look at something in a piece of old software, I can say that it contains no flaws. But I must have carried out tests that prove there are no flaws, at least in the important parts of the software. Both research and industry are developing these types of metrics, tests and proof. And they are working together as evidence of structured development and structured argumentation so that we can assume the trustworthiness and safety of the networks.

How does a method function that actually gives us verifiability? Can you cite an example?

Putzer: In principle, even an AI technology is subject to a structured process with certain phases, as long as it wasn’t developed in a chaotic fashion. And one of the initial phases is locating and setting up valuable data, because the AI – or at least the neural networks – can learn from this data. In a nutshell, if I put garbage in there, then garbage comes out. The question now becomes, garbage or not garbage? With data it’s diverse. That’s why we have corresponding metrics. For instance, do I cover every type of traffic participant? Every weather situation?

On the other hand, if I’ve trained my network, I can look and see how well trained it is. How many of my test patterns – and that will never be all of them – does it recognize as correct? And then of course I begin to take not only the patterns that I trained, but explicitly try out other patterns to see how the network handles them. How many people does it identify as a bush and how many bushes does it identify as people? These types of metrics are interesting as well of course.

And with the heat maps, I have a method that I can use to see through a pattern why the network provides a specific output – why it recognized an image as a pedestrian or a bush. In one case for instance, there was an attempt to identify pedestrians with CNNs. After the development phase, heat maps were used for the analysis and it was discovered that pedestrians were identified mainly by their feet. That might be okay for general applications, but for the automotive sector it’s insufficient. In many cases pedestrians are walking behind something, which means you can no longer see the feet. That’s why during development, you have to apply a wealth of examples in the datasets in which the feet of the pedestrians are hidden.

Or turn it the other way around. Why is a bush identified as a pedestrian? Perhaps because it has two branches that are hanging down. You can detect that with the heat maps, but then you still have other characteristics – that the bush is very round for instance. And then you have to give the network a few more examples of bushes and say: these are not pedestrians. Please learn that.

Rueß: Many of us who have trained an artificial neural network have already fallen into a trap. In one case, a network was trained to steer a vehicle automatically. Since the underlying dataset was not that representative, the network merely meandered along the median strip, which was not a good thing when the vehicle had to cross a bridge for the first time. That’s what I meant before. The systems learn statistical correlations on the basis of data. And this data doesn’t necessarily have to reflect reality. They also can’t generalize or abstract content. They can only repeat what they have learned, not deal with new situations. In these cases, their behavior is impossible to predict.

AI applications have progressed much further in other countries. Why is this important standard originating from Germany of all places?

Rueß: Germany boasts a strength that everyone is envious of: an engineering-oriented approach to system development; precision and thoroughness. We are simply playing to our strengths and combining an engineering-oriented approach with AI technologies that, to be honest, are not being developed in Germany. But we hope we have an edge that will help us mold such systems into market-ready products under realistic conditions as well. The world is familiar with the expression “German engineering” and now we’re establishing the standard for “German AI-engineering.”

What are the next concrete steps for the standard?

Putzer: For one thing, there is now a Version 1.0. Companies from various industries – rail, aerospace, automotive, IT – will be implementing this version in short order. In other words, we will have a practical test and a corresponding evaluation, which must lead to improvements in Version 2.0 in order to prevent smaller companies in particular from getting hung up on red tape and instead be able to share in and help shape this key technology of the future. This would actually bring the standard in line with the latest technology.

The other thing is to give additional thought to this standard. The reference framework that we now have for the standard is especially good for development activities. Of course at some point you’d like to analyze whether someone can actually do a good job of developing with it. That means, you could define the degree of maturity and utilize the VDE implementation guideline as a reference.

In any case, you’d then take a look at how a technical inspection organization can actually inspect the whole thing. Developing an AI system is one thing. But then at some point, within a relatively short period of time, I also have to analyze whether it’s good. I can’t take years to do this. In some areas, the situation is such that the accredited organizations are involved during the development and assess it while it’s being carried out, which leads to a corresponding delay in the inspection. But we have other industries where an approval can be done within a couple of weeks. You have to give some thought to this type of inspection however, because it requires completely different know-how, methods and practices.

Rueß: The standard should be rolled-out worldwide. There is already a lot of interest in implementing the standard as is, especially in strong industrial countries in the Asian region.

Putzer: For that reason we paid particular attention to how it was written. VDE-AR-E 2842-61 is one of the few implementation guidelines not written in German, but in English.

Perhaps people aren’t even aware how much of an influence a standard can have on daily life and how much these standards can act as drivers to make products successful.

Putzer: There are lots of examples in our daily lives. Take mobile phones for instance. There was a time when every model had a proprietary charger. Today nearly every mobile phone (and many other electronic assistants) can be charged with a uniform USB device. That’s standardization.

Rueß: Today’s software-based flight control systems were first made possible through standards that were to some extent developed in the 70s or 80s. And so far, we’ve had no crashes, at least involving commercial aircraft, which were tied solely to software malfunctions. That’s a tremendous accomplishment. We’re striving for similar performance for increasingly autonomous AI-based systems.

Putzer: We depend on standards every minute of the day. Without them, screws would never fit or stay tightened, buildings would not be safe, finding the right tires for a vehicle would be a matter of luck and in the supermarket shelves, we would have no way of knowing if something was healthy or even harmful to our health.

Would you say that an AI standard should be a condition for the approval of an autonomous vehicle?

Rueß: Yes. You can also rely on de facto standards by simply trying them out. We see a tendency toward this approach in the area of autonomous driving; certain companies that simply press ahead. This is the alternative to the engineering-oriented approach that we preach here. For obvious reasons, we consider the tests currently being carried out by certain companies, on and with customers, as irresponsible.

What concrete dangers does this trial-and-error method present in the area of autonomous driving?

Rueß:

We’re talking about systems that in some cases confuse trucks with clouds. These are systems that can also be externally manipulated and lured to switch to the wrong lane for example. Some believe that you can manage all of these situations as long as you have a human driver as a fallback position. Even the operating manual states that these functions may not be operated without the driver’s attention and only on highways. In this case the system functions. Of course there are users who neglect these guidelines in line with the motto, “but everything has worked up until now.”

That means this VDE standard would be compatible with what the ethics commission requires?

Rueß: All of the relevant commissions preach that this is exactly what we need to really get products to the market. Our VDE implementation guideline is already doing that.

Putzer: The key is the separation of ethics and technology. What we don’t do is establish the ethics principles within the implementation guideline. These ethics goals and principles need to come from the respective part of society that is using the product. The implementation guideline can then demonstrate how these principles can also be verifiably implemented and adhered to. We draw a clear line between the ethics principles and the technology. If ethics principles exist, we illustrate how the technology is to be implemented, so that it complies with these principles.

Rueß: Let me give you an admittedly somewhat shopworn example. In an accident situation, an autonomous vehicle faces a decision as to whether it should run over a child or an older woman. In western societies, you often hear the answer that the older woman already has most of her life behind her, thus the child should be protected. In China the answer would be the other way around, because the older woman possesses knowledge and experience acquired over an entire life. An engineer cannot and should not deal with such issues. And there is no clear-cut answer. That’s why we decided that ethics issues are outside the scope of the engineering approach and development of these systems.

Putzer: Perhaps not outside the scope, but at least such that the developer stipulates that the requirements for the system under development have to be delivered. In other words, the developer of the AI technology points out that the responsibility for this issue will be delegated to the customer.

How was this standard actually created?

Putzer: We began with norm 61 508, one of the most well-known safety standards. It serves as an overall standard from which various industries have created a wealth of derivations. We used this standard as a core, but then expanded various parts of it.

What’s new is the solution level, where humans and complex interactive systems are taken into account. For example, a luggage transport vehicle, a conveyer and the robot that puts it all together. This interaction is taken into account. We then examined not only errors caused by malfunctions (safety), but also in combination with things like IT security and safe utilization. We call that trustworthiness. We then handed the engineer an instruction that describes how to make his way from this abstract layer, to his AI level (system). For the AI components, we stipulated how to avoid all of these error modes (technology level).

The issue of security, hacking attacks and cyber security makes it clear just how important standards are to external security as well.

Putzer: AI development is one thing. But the fact that it’s reliable and more or less looks after human lives, we’re not quite there yet. Or let’s say, the only way to get there is through clean engineering. Our implementation guideline makes sure this happens. It’s not only industry-independent, but aspect-independent. It can be applied to safety – in other words problems stemming from internal errors – as well as to security, which relates to issues originating from the outside. Or it can be applied to problems that result from improper utilization, which allows you to achieve ethics goals. Aspects such as fairness in traffic for instance, so that the AI technology does not have an advantage over the driver. We can cover all of these aspects.

Rueß: Generally speaking, we’re convinced that an engineering-oriented approach to developing AI systems is necessary to be able to employ these things in a responsible manner. We can build increasingly autonomous machines at the moment. That’s not the main issue. The validation step is exactly what is required for market approval, in other words for a product that also conforms to socially-accepted norms.

What does the new standard mean for you personally?

Putzer: A lot of blood, sweat and energy went into creating the VDE-AR-E 2842-61 implementation guideline over the past three years. Research institutes, renowned industrial companies and SMEs all made contributions. The result was a trailblazing and modern reference framework that has drawn international attention and which can be used for the development of dependable autonomous/cognitive systems with underlying AI technology. This was really a major step. VDE-AR-E 2842-61 has the potential to decisively bring Germany forward in the area of AI. To do that, the implementation guideline has to enjoy wide practical use, undergo evaluation and be expanded. This is what we are calling for!

Name	Purpose	Lifetime	Type	Provider
_pk_id	Used to store a few details about the user such as the unique visitor ID.	13 months	HTML	Matomo
_pk_ref	Used to store the attribution information, the referrer initially used to visit the website.	6 months	HTML	Matomo
_pk_ses	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_cvar	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_hsr	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo