© 2025 Temple. All rights reserved.
Engineering this prompt took a lot of time and effort because the requirement was for it to be AI Platform agnostic. The identical process steps had to be taken for every run, every case study and every AI Platform. That would be easy enough for a computer based evaluation. But AI Platforms behave differently. It took a lot of trial and error to get right. The challenge was to have strong enough guard rails to get the Platform to follow the process steps (and not disrupt the process with friendly suggestions) but not so aggressive that it triggered the platforms own security monitors against jail breaks. This was a problem with Claude but readily soluble with help from the Claud Platform. It was a problem with Grok. But that platform was not the least bit helpful to find a solution. So Grok was dropped from the research. After these development issues were solved the Prompt ran reliably on Anthropic Claud Sonnet 4.5, Chat GPT 5.1 and Gemini Thinking with 3 pro.
The use of the Prompt is available free of charge to any UK university researcher for the purposes of their research.
To apply for a copy of the Prompt and its free use just email me at stephen.temple@ntlworld.com
Background information on the Prompt Development
The user interface
The focus was on minimising information overload.
A scoring system was embedded. Initially its purpose was to measure divergence between platforms. It was never intended to measure probability of success. That would give a spurious accuracy. The scores were provided in the book just to show how far adrift a case study was from the next traffic light.
AI Platform Agnostic Challenge
All the three LLMs follow their own company guidelines to be helpful to users, engage in conversation, and offer to show them possibilities. This the last thing that is needed in the middle of a disciplined repeatable process. This behaviour had to be contained adding guard rails. The wording in the guard rails to follow the process had to be very aggressive to stop this tendency to break in with helpful suggestions.
However, making the language too aggressive triggered the safety monitors on the Claud and Grok AI Platforms that a hacker might be trying to jail break their AI Platforms. Anthropic Claude provided helpful advice of how to add discipline to the process without tripping the safety alarms. Grok would not engage to find a solution and that platform was dumped from the research project.
It took a lot of trial and error to get right
Process Flow
Out of Scope or Partially Out of Scope Case Studies
The prompt had a feature included to detect case studies that were totally out of scope. That worked. But case studies with content both in and out of scope needed a solution. Only the in-scope part would be evaluated. At the end of the evaluation the Prompt provided the user with an option to ask for the list of things that were out of scope and therefore not evaluated.
Sufficiency of the information provided in the case study
The reliability of the assessment depends critically on the sufficiency of the information contained in the case study document. That is all the AI Platforms have to work from. For this reason as assessment is done by the AI Platform on the breadth and detail of the report relative to what it would need to apply the ten golden rule tests.
The “sufficiency” is determined on the following 3-point scale:
This single word descriptor assessment is added to the traffic light summary page. In effect it tells the user how reliable the evaluation result is.
Scoring Model to Drive The Traffic Lights
A 5-level scoring model was adopted (0-4) where each level had a descriptor defining it. The scores for Rules 1 to 3 were given a higher weighting to reflect their high impact on the probability of success.
If either Rule 3 scoring zero or both Rules 1 and 2 scoring 0 the AI Platforms were instructed to give a red traffic light irrespective of how high the other Rules scored. For example, no matter how good the technology turns out to be, if the economic numbers did not add up, the initiative will fail.
From scores to traffic lights
Three bands were created to determine the colour of the traffic light:
Green – 22-30 points
Amber 10-21 points
Red 0-9 points
The widths of the bands were set to give slightly less space at the two ends. It is not the intention of the scoring system to lend a false precision to the diagnostic tool. The three traffic lights are the design granularity that matches the job to be done on a case study. This brings us onto score divergence.
Score Divergences
Three AI Platforms all evaluating the same case study using an identical process may well emerge with different scores for three reasons:
i) Reflecting different insights (the divergence we want)
The score divergence that adds valuable insight arises from:
ii) From poorly drafted golden rules
The score divergence that adds no value is where the particular way a golden rule was drafted allowed very different interpretations of what was meant. This led to us elaborating the golden rule descriptions. But made them more difficult to read. So the detailed version was used in the prompt and can be found in Annex Three. We then produced an Executive Summary version to go into Chapter 2.
iii) The divergence outside of our control
This included:
Having three AI platforms independently driving three traffic lights ironed these differences out through triangulation.
Expected Divergence
This could be up to 20% over the development cycle. But narrowed as the interpretative differences f the golden rules were reduced. Being this high was of no practical consequence for two reasons: the three light traffic lights is our intended scoring granularity, and a large divergence based on one of the LLMs having better domain knowledge in its training data needs to emerge and not be suppressed. It triggers the fourth step in our error suppression protocol discuss elsewhere.
The Three AI Platforms Used
The three different AI Platforms were: Anthropic Claude, Chat GPT and Gemini. They all performed consistently. They all did better or worse in respect of errors and hallucinations at different times.
Read more
All content here (c) copyright of Stephen Temple