Making AI Platforms More Trustworthy

A LOW-COST DESK TOP "INDUSTRIAL GRADE" ERROR FOUR STEP SUPPRESSION PROTOCOL

One cannot trust a single AI Platform results for work where accuracy is vital. But that does not mean AI Platforms cannot be used to do important things. The ready solution is the addition of a quality control system.

What follows is a handbook for a quality control system (an error suppression protocol of industrial grade) that works with simple desk top access to an AI Platform, adds about twenty minutes to the several minute for the initial individual AI platform evaluations, but it does depend upon access to three independent AI Platforms. It may take longer to run using free access versions of AI Platforms when the evaluations involve heavy computations, as all the AI Platform ration access time in their free versions.

A significant fraction of the research behind "The Graveyard of Good Intentions" book was developing the Prompt and this associated quality control system. Basic subscriptions were taken with Anthropic Claude, ChatGPT and Gemini. The error suppression protocol worked well on all of these three platforms.

Introduction

The reason why we used an AI Platform for evaluating case studies against the Ten Golden Rules in the first place was:

To Take Human Bias completely Out of the Loop - as authors we had to ensure any views we had would not influence the evaluations.
To Ensure Consistency of the Evaluation of Different Case Studies- so the outcomes from different case studies could be compared
To Produce Research Results Speedily

But the current generation of LLMs are prone to making errors, drifting and hallucinating. That was unacceptable. The whole purpose of our book was to stimulate debate of the research results and not discussion of our methodology. We knew there is already widely available AI Platform quality control solutions in the form commercial software where LLMs were being use for critical industrial or commercial applications. The commercial software orchestrate running tasks across multiple AI Platforms, apply harnesses to enforce processes, apply consistency checks and provide an audit log. But we did not have the means of funding it. So we developed out own error suppression protocol. It was based on some earlier modelling research using AI Platforms at the University of Surrey 6GIC - where the discovery was to switch the instructions to two independent LLMs from their analysis function to forensic detective and get each to challenge the results of the other.

We applied this principle to developing our own desk top error suppression protocol. In order to reach industrial grade error suppression we used three independent Large Language Models. The magic of using "three" is that the results of each LLM was then attacked by two independent LLMs acting as forensic detectives. Residual errors, if any, would then be mitigated by triangulating the three independent error corrected outputs. The only disadvantage of this low cost error suppression protocol is that data has to be transferred between AI Platforms through copy and pasting results. Later we show how this was done efficiently so the end to end process only took twenty minutes.

Multiple-AI Platforms

In the human domain, when much is at stake on expert advice, it is normal to get a second opinion or even a third. That is the foundation of our error suppression approach – don’t trust a single AI Platform but run the same problem on a second quite independent AI Platform and, for good measure, a third. But that is not sufficient. All three independent outputs will each contain their own errors.

Now it is possible to ask each AI Platform to check its own result in a second stage of tasking it to look for any errors or hallucinations in its results. Our research across four different AI Platforms showed that the most diligent AI Platform for self-checking its outputs was Anthropic Claude and the least willing to accept it had made any errors was Grok. But even Claude was not up to the job to the required standard.

My early University of Surrey research also showed that asking the independent AI Platforms to work cooperatively together and agree what the right answer was also failed to comprehensively and systematically find and eliminate errors. It was more likely to arrive at “split the difference” compromises.

The only thing that worked was “an aggressive confrontation”. This was to set each AI Platform up as a forensic detective and instruct it to aggressively hunt down every error in the other two AI Platforms’ results. This led to each AI Platforms being confronted with the evidence from two other independent AI Platforms that it had made an error of fact, arithmetic error, error of logic or error of judgement.

We then turned this approach into a repeatable four-step quality control error suppression protocol.

Four-Step Error Connection Protocol

Step 1 – Attack phase. The character of all the LLMs was changed from deep analysts to forensic detectives. They were given instructions to rigorously seek out any errors of fact, logic, arithmetic and hallucinations in the results from all the AI Platforms.

Step 2 – Confrontation phase. The error reports from all three LLMs were consolidated and given back to all three LLMs. This time the instruction to the LLMs was to review all the errors attributed to itself and, in every case, either defend itself or accept it had made an error. If it refuted the challenge, it had to set out why.

Step 3 - Correction phase. Each LLM had to put right the errors it had owned up to, re-assess its evaluation, and produce a revised overall score

Step 4 – Shoot-out between the extreme scorers. Our particular evaluation approach generated an overall score. This was not the principal output but was a helpful measure of output divergence.

The two LLMs having the extreme output numbers are each confronted with their opponent’s evaluation and asked to make the case why the opponent’s interpretation was more faithful to the ten golden rules. It then had to consider its own position in the light of this.

This last step is possibly the most distinctive feature of our error suppression protocol compared with standard commercial approaches. What our “shoot out” step was bringing out was interpretative differences.

Say two AI Platforms gave similar high scores, and one gave a low score. The traditional approach would assume the minority LLM scorer had got it wrong. In our shoot out it was testing judgements and assumptions. In one of our shoot-outs the minority scorer stood its ground, and both majority high scorers conceded their judgements had been wrong.

Evidence of the effectiveness of the error suppression protocol

Over this one evaluation run of the three case studies there were a total of thirty-one errors found and corrected. The number per LLM over the three case study evaluations ranges from seven to twelve. These numbers are given in the book to quantify the success of the error suppression protocol

Examples of LLM mistakes included:

An arithmetic error by one LLM. The excuse given was that, in heavily loaded conditions, the LLM was required to prioritised reasoning over simpler functions, like arithmetic.
An hallucination where the LLM picked off an example appearing in the prompt interpretation guidance and used it as if it existed in the case study to justify a score.
An error of fact where the LLM concluded that was no enacted commitment. It had missed the statement of a £21m AI Diagnostics fund.
Misclassifying something as a commitment that had been implemented as only being a future intention.

This got us to the point of having all the power of AI with its errors and misjudgments suppressed to the extent possible with the forensic abilities of the three LLMs we used.

Further, if any remained, it would be revealed when triangulating the corrected outputs of the three independent evaluation from the three AI Platforms.

Manual Work Flow

It was easiest when working with the three AI Platforms to use a single blank Word document to paste the output from all three AI Platforms. As they will be long, each was pasted using a different text colour eg orange for Claude, blue for Chat GPT and green for Gemini. This made it very easy to find the start and finish of each output to add the AI platform titles.

Then on top of the composite document the new Prompt instructions were added, for example, to change their role into forensic detectives. Then the entire document could be pasted (or uploaded) back to all three platforms. The first three steps would typically take twenty minutes. If the fourth step was required that would take longer but that would be time well spent as it gave insights to the judgmental differences at work and where the human might have to step in and be the final arbiter.

For full traceability - a fifth step could be added to collect all the errors found from all three Platforms in Step 2 and paste into a Word document as an error log record. But that is only useful in research projects or find tune the prompt for repeat interpretative errors.

[1] Now published on Amazon

Rebuilding Britain’s Capacity to Win in Strategic Technologies
Making AI Platforms More Trustworthy
Stepping off the well trodden road to technology leadership failure
AI Prompt Engineered To Automate The Use Of The Ten Golden Rules By Different AI Platforms

Why Governments launch technology strategies that cannot win - and the Ten Golden Rules to fix them.