The gap between pilot and production
Enterprise AI pilots succeed at a remarkable rate. A team of engineers, a capable LLM, and two months of effort will reliably produce a demonstration that impresses a steering committee. The model answers questions accurately, summarises documents correctly, and handles the test cases the engineers have prepared.
Then the organisation tries to put it in production. Legal wants to know where the data goes. Information security wants to know what access controls are in place. The compliance team wants an audit trail. The business wants to know what happens when the model is wrong. The project stalls, the engineers move on, and the pilot becomes a case study in AI that didn't scale.
The five guardrails
The gap between pilot and production is not a model problem. State-of-the-art LLMs are genuinely capable of the tasks most enterprises need. The gap is a governance problem. The five guardrails that consistently block production deployment are data residency, access control, audit trails, hallucination handling, and model lifecycle management.
Data residency is the first blocker in regulated industries. When an LLM processes a query containing personal data, that data is transmitted to the model provider's infrastructure. For financial services and healthcare organisations subject to GDPR or HIPAA, this is not straightforwardly permissible without contractual controls and, in some cases, regulatory approval. The solution is either deploying a model within your own cloud tenancy — which requires significantly more infrastructure — or selecting a provider with appropriate data processing agreements and regional data residency guarantees.
Access control is the second blocker. Most enterprise AI pilots are built without row-level security or document-level permissions. The model has access to everything in the knowledge base. When the pilot becomes a production system used by 500 employees, the question of who can query what becomes non-trivial. The architecture needs to enforce the same access controls on AI-retrieved content that it enforces on direct document access.
Audit trails are the third blocker. When a financial analyst uses an AI assistant to draft a regulatory report, the compliance team needs to be able to reconstruct which sources were retrieved, which model version generated the output, and what the analyst did with the result. Without a structured logging layer that captures this at the application level, the AI system cannot be used for any purpose where the process needs to be auditable.
Hallucination handling is the fourth — and most visible — blocker. An LLM that confidently states an incorrect figure in a customer-facing document creates legal and reputational risk that most enterprises are not willing to accept. The mitigation is architectural: retrieval-augmented generation with source citation, output validation against the retrieved sources, and confidence thresholds below which the model declines to answer rather than guessing. None of these are trivial to implement, but all are established patterns.
Model lifecycle management is the fifth, and most commonly overlooked, blocker. An LLM that is deployed to production will need to be updated as better models become available, as the enterprise's knowledge base changes, and as the model provider depreciates the version the organisation is using. Without a defined process for testing updates against a regression suite, managing prompt version control, and communicating changes to users, model updates become a source of unpredictable behaviour change in production.
The architecture question
The architecture question — whether to use a hosted API, a model deployed in your own cloud tenancy, or a self-hosted open-source model — is downstream of the governance requirements. For most enterprises, a hosted API with appropriate data processing agreements and regional data residency is the right starting point. Self-hosted models are appropriate when data residency requirements cannot be met by any provider, or when the inference volume makes self-hosting more economical than API pricing.
The retrieval layer is where most of the engineering complexity lives. A vector database containing enterprise documents, chunked and embedded with appropriate access control metadata, a retrieval pipeline that filters results by the requesting user's permissions, and a reranking step that improves relevance — this is the infrastructure that determines whether the model gives accurate, relevant answers or confidently hallucinates.
The prompt engineering layer — the system prompts, the output format instructions, the guardrails against out-of-scope queries — is where the model's behaviour is controlled. This layer needs version control, regression testing, and a change management process. It is software engineering, not machine learning research.
A realistic timeline
A realistic timeline for production deployment of an enterprise Gen AI system — from a validated use case to a production system with appropriate guardrails — is 10–16 weeks. Pilots that demonstrate the model's capability can be built in two weeks. The remaining 8–14 weeks are governance infrastructure: data processing agreements, access control integration, audit logging, hallucination mitigation, testing, and deployment pipelines. The teams that get to production in 12 weeks are the ones that start on the governance architecture at the same time as the model prototype.