Beyond Responsible AI

Helen Khezrzadeh
53 minutes ago
12 min read

By Dr. Matt Bland

TL;DR: Policing is making real progress on responsible AI. Legality, fairness, transparency, governance and public confidence are now being taken seriously. But those safeguards do not quite answer a different question: whether AI tools improve the policing problems they are brought in to solve.

The next step is a shared evidence system: common reporting standards, a useful register, synthesis of findings and ongoing monitoring. That would let forces learn not only whether AI tools work technically, but whether they help in practice.

Artificial intelligence is arriving in policing very quickly. Across England and Wales, forces are trialling everything from policy assistants and automated redaction to intelligence analysis, risk assessment and live facial recognition. The attention it draws, from chief officers, government, researchers and the public, is intense.

What strikes me is that, this time, policing is not walking in blindly. The College of Policing's work on responsible AI, the creation of Police AI, the National Police Chiefs' Council's Covenant for Using Artificial Intelligence in Policing and a long list of exploratory projects all point to a profession trying to think before it deploys. Legality, transparency, accountability, fairness, governance and public confidence are getting serious attention. Recent decisions to pause some uses until they can be properly assessed also show a service that is not willing to be seduced by automation without due diligence.

That is real progress, no two ways about it. Historically, policing innovations have often travelled faster than the evidence needed to understand them. The conversation around AI is already more mature than that.

But there is a question sitting underneath the responsible-AI agenda that I do not think we are asking clearly enough. Not whether AI should be governed, nor whether it should be transparent or fair. Whether, once a tool is in use, policing will actually learn what effects it has, and whether that learning will be useful to the next force facing the same decision.

Working is not the same as helping

The distinction matters, because almost everything else in this piece revolves around it.

Most of the scrutiny AI gets in policing is about performance: is the system accurate, is it lawful, is it biased, does it produce false matches, and can its outputs be trusted? That work is essential, and a lot of it is being done. Live facial recognition instances for example, has been through technical testing, equitability studies, legal challenge and independent academic review.

But performance is not the same as effect. A tool can be accurate, lawful and unbiased and still make no difference to the problem it was bought in to solve. It can save time while weakening the quality of decisions. It can work beautifully in a lab test and struggle in a busy custody suite. It can help one group of people while creating new burdens for others. Knowing that a system works is not the same as knowing that it helps.

That second question, whether the tool improves the thing it was meant to improve, is not really a technical or governance question. It is an evidence-based policing question. And it is the one we are not yet answering in a way that accumulates systematically.

So, to be clear about scope, this is not a piece about false positives, accuracy or bias. Those issues are well covered elsewhere. This is about effects: whether AI tools are solving the problems they were introduced to solve, for whom, under what conditions, and at what cost.

It is worth being precise here, because policing has done good work that is easy to mistake for the answer. The PROBabLE Futures and NPCC Responsible AI Checklist, now embedded in procurement guidance and increasingly a condition of funding, asks whether a tool is technically valid, whether it supports accurate and relevant decisions, and whether its use is lawful and proportionate. Its own framing is telling: it is about deciding whether a tool should be used, rather than whether it could be. Those are exactly the right questions to ask before something is switched on. But they do not answer the question that can only be answered afterwards: did it actually work on the problem? The checklist gates responsible use. It does not (and was never meant to) evaluate effect.

The problem is not too little evaluation. It is fragmentation.

At first glance this might sound like a plea for more evaluation. It is not.

Plenty of evaluation is happening. The trouble is that it appears through a sprawl of pilots, operational trials, academic papers, internal reports, vendor assessments and consultancy reviews: different methods, different outcomes, the same outcomes measured differently, and findings written up in formats that do not easily transfer across audiences.

That diversity is not a weakness in itself. Different questions need different methods. An ethnographic study of how officers actually use an AI assistant can reveal things a controlled trial never will; an operational pilot can surface risks technical testing misses. The problem starts when the method, the context and the limitations are not described clearly enough for anyone else to work out what was tested, judge whether the findings might travel, or test them somewhere new.

Consistent reporting will not make findings universally generalisable. Nothing will. But it would let a leadership team approaching a vendor with a pressing problem make use of what others have already learned, rather than starting from scratch or taking the sales deck on trust. As the volume of AI activity grows, policing faces a very old challenge in a new form: making knowledge accumulate rather than fragment.

What medicine worked out

Other fields have been here. Medicine did not fix inconsistent evidence by forcing every study into the same mould. It developed shared standards for reporting different kinds of research. The CONSORT Statement sets minimum reporting standards for randomised trials; STROBE does the same for observational studies; and AI-specific extensions such as CONSORT-AI spell out the extra detail needed to make sense of studies involving AI: how the system was used, how inputs and outputs were handled, how humans interacted with it, and how errors were analysed.

These standards do not dictate the research question or guarantee that a study is good. They do something simpler and more useful: they make it possible to understand what was done, decide how much weight to put on it, and set it alongside other studies.

I would strongly suggest it is not for policing is not to copy clinical trials. Most police AI evaluations will look nothing like them. The lesson is perhaps that evidence becomes useful when it is reported consistently enough to be scrutinised, reused where it fits, and combined with other evidence. Reporting is not the end of the process - it is the foundation everything else stands on:

Evaluation generates the evidence.
Reporting makes it interpretable.
A register makes it discoverable.
Synthesis turns it into shared knowledge.
And decision-makers use that knowledge, alongside legal, ethical, financial and operational judgement, to decide whether to adopt, scale, change or stop.

Take any one of those links out and the chain doesn’t hold.

A test case: live facial recognition

Live facial recognition is the fairest demonstration of this argument I can think of, because it is the most heavily scrutinised AI-enabled intervention in English and Welsh policing. If evidence should accumulate anywhere, it should accumulate here.

A quick search turns up several pieces: the 2018 evaluation of South Wales Police's use of automated facial recognition, produced through the Universities' Police Science Institute at Cardiff; the 2019 independent review of Metropolitan Police trials by Fussey and Murray, held in the University of Essex repository; the 2023 National Physical Laboratory equitability study on technical performance and demographic differences; the 2020 Court of Appeal judgment in Bridges on legality; force websites with deployment records; regulators and civil-society bodies with further analysis.

That is not an absence of evidence. It is an evidence trail, scattered across university repositories, force websites, technical reports, case law and campaign groups.

And notice the range of coverage. One covers accuracy. One covers demographic difference. One covers legality. One covers operational deployment. Almost all of it is about performance or compliance. None of it, alone, answers the question a chief officer actually has to decide: is live facial recognition a worthwhile investment compared with the alternatives? For a technology this expensive, intrusive and resource-hungry, that means coverage, opportunity cost and additionality: what share of relevant places and times can realistically be reached, how many people are identified who would otherwise have been missed, what happens after a match, and how the spend compares with putting the same money into intelligence, neighbourhood policing or investigation.

The evidence to start answering those questions exists in pieces. What is missing is a front door: a single catalogue entry describing the question, the system and version, the setting, the method, the outcomes and the limitations. When I searched the College of Policing's What Works Crime Reduction Toolkit (in June 2026), live facial recognition was not there as a catalogued intervention. The Toolkit has a particular remit and this is not a criticism of it, but it does show the gap.

A register could be that front door. Not a sprawling national database, just checklist-compliant metadata, links to the source material and a clear note of any access restrictions, open where it can be and controlled where it must be. Encouragingly, a national, public-facing registry is now being built as part of policing's new AI centre, which makes this less a call to invent something than a chance to shape what it captures. Catalogue evidence consistently and lightweight synthesis tools, built fittingly with AI, could find comparable evaluations, summarise their findings and limitations, and show honestly where the evidence is too thin to support a confident answer. Not to replace expert judgement, but to make a fragmented evidence base navigable.

The facial recognition case makes one last point. The thing being evaluated is never just the algorithm. It is the whole intervention in use: the technology, the threshold, the users, their training, the workflow, the local policy and the human oversight wrapped around it. Catalogue that, or you never really know what you are looking at.

A register is coming, but a register of what?

This is the part I most want to get right, because policing is already moving and the worst thing this argument could do is ignore that.

A national, public-facing registry is on its way. That is genuinely good news. But look closely at what it is for. A transparency registry tells the public that a force is using a tool and how it is governed. The benefits work running alongside it, including economists, cashable savings and hours of officer time released, tells us what a tool costs and saves. Both are important but neither addresses whether the tool change the outcome it was meant to change (unless it was to save money, perhaps). What about questions on crime prevention or investigative outcomes or harm reduction? As far as I can see, nobody is yet asking that systematically across AI-enabled crime interventions. That is the gap, and the register that is coming is exactly the thing to build it into.

It also reframes the ownership question. The governance question is largely settled: there is a body that owns the responsible-AI requirement across policing, and a funding and procurement gate that increasingly enforces it. Its current centre of gravity, reasonably enough, is productivity and back-office AI, where credibility is quickest to demonstrate. So the open question is narrower and more important than 'who governs AI?' It is: who owns the synthesis and stewardship of effect-evidence across the whole span, from productivity tools to outward-facing crime interventions, where effect and public legitimacy are most tightly bound, so that what one force learns is usable by the next?

Not more money, necessarily, but an expectation

A coherent evidence base will not appear through individual research incentives and goodwill. It needs an owner who keeps the standards current, assures the quality of what goes into the register, and turns accumulated findings into a plain 'so what?' for busy practitioners. And it needs consequences.

The tempting lever is money: make compliant reporting a condition of central funding. I am wary of that. Most police AI initiatives are not centrally funded. They are bought locally, so the funding lever reaches only part of the problem. But the machinery for the better lever already exists. Responsible use is already becoming the gate for funding, procurement and approval through the Responsible AI Checklist. The same gate could carry a clear expectation that the effects of AI interventions are evaluated and reported to a common standard, distinct from and additional to the performance and bias testing the checklist already covers, until it becomes muscle memory rather than a new bureaucracy bolted on the side.

That expectation has to be proportionate. An AI assistant for navigating HR policy should not carry the same evidential weight as a risk-assessment tool shaping safeguarding decisions, or a facial recognition system that can lead to someone being stopped in the street. The answer is a shared minimum: a small set of questions every evaluation should answer, with heavier requirements layered on for higher-risk uses.

At a minimum, I’d want every evaluation to be clear about:

what problem the tool was meant to solve
what system and version were tested
how it changed the existing process
who used it, where, and under what conditions
what baseline or comparison was used
which outcomes, costs and potential harms were measured
how human judgement interacted with the system
whether effects differed between groups or contexts
what failed, and what surprised everyone
whether the system or the environment changed mid-evaluation

Not methodological uniformity. Just enough consistency to learn.

The opportunity, and the catch

AI does give evidence-based policing a real opportunity. These systems generate a detailed data footprint as they run: usage, outputs, corrections, response times, error rates and what happened next. Access can sometimes be phased between sites, or interfaces compared through A/B testing where that is lawful and appropriate. Some AI interventions are, as a result, unusually measurable, and policing could learn from them much faster than from innovations whose effects are almost invisible.

But measurable is not the same as easy. The data may be incomplete, owned by the vendor, or only loosely connected to anything that matters. An A/B test can tell you which version gets used more without telling you which produces better policing. Models update, prompts change, performance drifts, and the most important harms are often rare, delayed or hard to count. That is exactly why effect-evaluation has to be planned from the start, not bolted on at the end, and why it needs literate customers. If leaders, practitioners and commissioners cannot judge what a system does, where it fails and which vendor claims deserve testing, policing lets the people selling the tools define both the problem and the proof of success.

Why bother, isn't this just more bureaucracy?

Very few members of the public spend their evenings thinking about intelligence triage, records management or disclosure tools. Yet those unglamorous systems shape the quality of policing profoundly, and we usually only notice when something goes wrong. Nobody campaigns for better intelligence triage until a dangerous offender is missed; nobody worries about a risk-assessment tool until a safeguarding failure; nobody thinks about disclosure support until a miscarriage of justice. If investment follows visibility, novelty and political heat, policing will keep underinvesting in the quiet tools that do real good and overinvesting in the ones that catch the eye.

The antidote is to make far better use of what policing already knows. There are deep reserves of insight in forces up and down the country, and too much of it is stuck in procurement files, internal reports and local pilots. Much of it is effectively lost. The result is a service where outcomes are shaped, in part, by postcode. For a profession that has spent three decades arguing that evidence should inform policing, that should sit uncomfortably.

So, to be clear about what I am not arguing for: not another governance checklist to be completed and filed before go-live, and not a demand that every deployment clears a randomised controlled trial first. What policing needs is a shared evidence architecture that supports learning across the whole life of a system. Five parts:

A common minimum reporting standard for the effects of AI interventions, informed by CONSORT, STROBE and CONSORT-AI but built for policing. It should be demonstrable through a short checklist, without pretending that the checklist turns the underlying work into peer-reviewed research.
Risk-tiered, use-case-specific guidance on method, so higher-risk uses face more demanding requirements and low-risk admin tools are assessed proportionately.
Effect built into the register that is coming. The national registry now in development is the right foundation, but a registry that records that a tool is used and how it is governed is not the same as one that records whether it worked. It needs to carry evaluation findings, including the negative and inconclusive ones, open where possible and controlled where necessary. The government's Algorithmic Transparency Recording Standard is a useful precedent, but transparency about a system is not evidence about its effects.
A synthesis and stewardship function that keeps the standards current, assures the register’s quality, produces usable summaries and says plainly what the evidence does and doesn’t support.
Continuing monitoring after deployment, because models, behaviour and environments change, and a pilot’s reassurance has a shelf life.

None of this is free of hard questions: commercial confidentiality, operational sensitivity, data protection, the cost of evaluation, and the limits of generalising from one force to another. But those are reasons to design carefully, not reasons to settle for fragmented learning.

Policing has set itself the goal of becoming a more intelligent customer of AI. I would gently point out that you cannot be an intelligent customer of something whose effects you never measure. Evidence-based policing has spent more than thirty years making the case that scientific method can improve policing. AI is about to increase both the pace of innovation and the chance to evaluate it. This is the moment to build the habits and the plumbing that let the public benefit from what policing learns, not just about whether its tools work, but about whether they help.

References and further reading

Algorithmic Transparency Recording Standard: guidance for public sector bodies
CONSORT Statement
CONSORT-AI: reporting guidelines for clinical trials involving artificial intelligence
DECIDE-AI: reporting guideline for early-stage clinical evaluation of AI decision-support systems
NIST AI Risk Management Framework
STROBE Statement
PROBabLE Futures and NPCC, Responsible AI Checklist for Policing (Oswald, Calder, Paterson-Young and Dunkwu), May 2025
NPCC, Artificial Intelligence Playbook for Policing, 2025
National AI centre for policing (Police.AI), launch announcement, including a public-facing AI registry
Davies, B., Innes, M. and Dawson, A. (2018), An Evaluation of South Wales Police's Use of Automated Facial Recognition, Universities' Police Science Institute / Cardiff University.
Fussey, P. and Murray, D. (2019), Independent Report on the London Metropolitan Police Service's Trial of Live Facial Recognition Technology, University of Essex Research Repository.
National Physical Laboratory (2023), Facial Recognition Technology in Law Enforcement: Equitability Study, commissioned by the Metropolitan Police and South Wales Police.
R (Bridges) v Chief Constable of South Wales Police [2020] EWCA Civ 1058, Court of Appeal judgment.
College of Policing, Crime Reduction Toolkit, searched for live facial recognition as a catalogued intervention, June 2026.
Metropolitan Police and South Wales Police live facial recognition deployment records and published operational results.

Beyond Responsible AI

Recent Posts

BETA Contact us to help improve this site