AI Prompt Engineered To Automate The Use Of The Ten Golden Rules By Different AI Platforms

Engineering this prompt took a lot of time and effort because the requirement was for it to be AI Platform agnostic. The identical process steps had to be taken for every run, every case study and every AI Platform. That would be easy enough for a computer based evaluation. But AI Platforms behave differently. It took a lot of trial and error to get right. The challenge was to have strong enough guard rails to get the Platform to follow the process steps (and not disrupt the process with friendly suggestions) but not so aggressive that it triggered the platforms own security monitors against jail breaks. This was a problem with Claude but readily soluble with help from the Claud Platform. It was a problem with Grok. But that platform was not the least bit helpful to find a solution. So Grok was dropped from the research. After these development issues were solved the Prompt ran reliably on Anthropic Claud Sonnet 4.5, Chat GPT 5.1 and Gemini Thinking with 3 pro.

The use of the Prompt is available free of charge to any UK university researcher for the purposes of their research.

To apply for a copy of the Prompt and its free use just email me at stephen.temple@ntlworld.com

Background information on the Prompt Development

The user interface

The focus was on minimising information overload.

A very simple “likely to succeed or fail” test using traffic light presentation.
More detail only if requested.
Even more detail on what was in the use case that was out of scope and therefore not evaluated (this was for hybrid case studies that were only partially in scope)

A scoring system was embedded. Initially its purpose was to measure divergence between platforms. It was never intended to measure probability of success. That would give a spurious accuracy. The scores were provided in the book just to show how far adrift a case study was from the next traffic light.

AI Platform Agnostic Challenge

All the three LLMs follow their own company guidelines to be helpful to users, engage in conversation, and offer to show them possibilities. This the last thing that is needed in the middle of a disciplined repeatable process. This behaviour had to be contained adding guard rails. The wording in the guard rails to follow the process had to be very aggressive to stop this tendency to break in with helpful suggestions.

However, making the language too aggressive triggered the safety monitors on the Claud and Grok AI Platforms that a hacker might be trying to jail break their AI Platforms. Anthropic Claude provided helpful advice of how to add discipline to the process without tripping the safety alarms. Grok would not engage to find a solution and that platform was dumped from the research project.

It took a lot of trial and error to get right

Process Flow

Out of Scope or Partially Out of Scope Case Studies

The prompt had a feature included to detect case studies that were totally out of scope. That worked. But case studies with content both in and out of scope needed a solution. Only the in-scope part would be evaluated. At the end of the evaluation the Prompt provided the user with an option to ask for the list of things that were out of scope and therefore not evaluated.

Sufficiency of the information provided in the case study

The reliability of the assessment depends critically on the sufficiency of the information contained in the case study document. That is all the AI Platforms have to work from. For this reason as assessment is done by the AI Platform on the breadth and detail of the report relative to what it would need to apply the ten golden rule tests.

The “sufficiency” is determined on the following 3-point scale:

Low sufficiency – Patchy, anecdotal, or very high-level; several rules cannot be scored without guesswork. Diagnostic is not reliable overall.
Medium sufficiency – Rich narrative with clear evidence on the foundations (Rules 1–3) and some others, but gaps or ambiguities for a few rules. Diagnostic is reliable on structural verdict (RED/AMBER/GREEN and main failure modes), less reliable on fine-grained per-rule scoring.
High sufficiency – Systematic description of goals, governance, economics, market design, standards, R&D wiring, demand, and statecraft; explicit mechanisms, dates, powers, and funding. Diagnostic is reliable at both overall verdict and detailed rule-by-rule scoring.

This single word descriptor assessment is added to the traffic light summary page. In effect it tells the user how reliable the evaluation result is.

Scoring Model to Drive The Traffic Lights

A 5-level scoring model was adopted (0-4) where each level had a descriptor defining it. The scores for Rules 1 to 3 were given a higher weighting to reflect their high impact on the probability of success.

If either Rule 3 scoring zero or both Rules 1 and 2 scoring 0 the AI Platforms were instructed to give a red traffic light irrespective of how high the other Rules scored. For example, no matter how good the technology turns out to be, if the economic numbers did not add up, the initiative will fail.

From scores to traffic lights

Three bands were created to determine the colour of the traffic light:

Green – 22-30 points

Amber 10-21 points

Red 0-9 points

The widths of the bands were set to give slightly less space at the two ends. It is not the intention of the scoring system to lend a false precision to the diagnostic tool. The three traffic lights are the design granularity that matches the job to be done on a case study. This brings us onto score divergence.

Score Divergences

Three AI Platforms all evaluating the same case study using an identical process may well emerge with different scores for three reasons:

i) Reflecting different insights (the divergence we want)

The score divergence that adds valuable insight arises from:

Different depth of relevant training data
Judgment about emphasis/weighting
Different but defensible interpretations
Where human experts would also disagree

ii) From poorly drafted golden rules

The score divergence that adds no value is where the particular way a golden rule was drafted allowed very different interpretations of what was meant. This led to us elaborating the golden rule descriptions. But made them more difficult to read. So the detailed version was used in the prompt and can be found in Annex Three. We then produced an Executive Summary version to go into Chapter 2.

iii) The divergence outside of our control

This included:

The temperature (ie where on the scale of random to deterministic) the LLM is set is a decision by the AI Platform provider
The LLMs judgement whether it worked entirely from its training data or search the web,
Random inconsistencies.

Having three AI platforms independently driving three traffic lights ironed these differences out through triangulation.

Expected Divergence

This could be up to 20% over the development cycle. But narrowed as the interpretative differences f the golden rules were reduced. Being this high was of no practical consequence for two reasons: the three light traffic lights is our intended scoring granularity, and a large divergence based on one of the LLMs having better domain knowledge in its training data needs to emerge and not be suppressed. It triggers the fourth step in our error suppression protocol discuss elsewhere.

The Three AI Platforms Used

The three different AI Platforms were: Anthropic Claude, Chat GPT and Gemini. They all performed consistently. They all did better or worse in respect of errors and hallucinations at different times.

Rebuilding Britain’s Capacity to Win in Strategic Technologies
Making AI Platforms More Trustworthy
Stepping off the well trodden road to technology leadership failure
AI Prompt Engineered To Automate The Use Of The Ten Golden Rules By Different AI Platforms

Why Governments launch technology strategies that cannot win - and the Ten Golden Rules to fix them.