AI Governance

AI System Description
& Model Card

This document describes the AI system that powers The Insurance Professor's consumer-facing responses, the constraints under which it operates, and the failure modes its operators acknowledge openly. It is published as a companion to the Corpus Accuracy Report.

Q2 2026 Model Card · Published May 2, 2026 · Updated quarterly

System summary

The Insurance Professor is a consumer education platform for insurance policyholders. It is not a licensed insurance producer, agent, broker, adjuster, or attorney. It does not sell, recommend, or place insurance products. It does not issue legal opinions. It explains regulatory and policy concepts in plain language, grounded in public-domain state regulatory content.

A user submits a question through the chat interface. The system retrieves relevant content from a curated corpus of public-domain regulatory materials, passes the user's question and the retrieved content to a large language model running locally, and delivers the model's response back to the user — after the response passes through a verification layer described in section 5.

The platform operates from the principle that consumers benefit from being able to ask plain-language questions about insurance and receive plain-language explanations grounded in real regulatory text — but only if those explanations cite real text, do not direct the consumer toward specific products or actions that constitute the practice of insurance or law, and openly disclose what they cannot do.

Model and infrastructure

Base model

Llama 3.3 70B Instruct (Meta, released December 2024), used in its 4-bit quantized variant (mlx-community/Llama-3.3-70B-Instruct-4bit). Open-weights model used under Meta's Llama 3.3 Community License. The 4-bit quantization reduces memory footprint at a measurable cost in raw model accuracy compared to the full-precision base; the platform's evaluation harness measures performance on the quantized variant directly so that internal benchmarks reflect what users actually receive.

Fine-tune

The base model has been adapted with a LoRA fine-tune trained on 80 layers (approximately 103.5M trainable parameters, 0.147% of the base model's parameter count). The LoRA adapter is fused (merged) into the base weights to produce the deployment artifact. The fine-tune is trained on a curated dataset of insurance-domain question-and-answer examples assembled from public regulatory materials and the platform's own educational content. The current production fine-tune is identified internally as fused-v9, deployed 2026-04-29. Training cycles are iterative; successive fine-tune versions replace earlier ones in production when they pass an internal evaluation harness covering accuracy, action correctness, sequential reasoning, perspective-taking, and voice compliance dimensions.

Inference framework

MLX-LM (the Apple Silicon-native inference framework). The fine-tuned model is served through an MLX server running on the same host as the rest of the platform.

Embedding models — disclosure of a current mismatch

The platform's corpus chunks were originally indexed using BGE-large-en-v1.5 (Beijing Academy of Artificial Intelligence), a 1024-dimensional embedding model. Runtime user queries are currently embedded using mxbai-embed-large (Mixedbread AI), also a 1024-dimensional embedding model, served via Ollama on the same host. Both models produce vectors of the same dimensionality, which permits the retrieval system to operate; however, they are different models, and the platform's operators acknowledge that retrieval relevance is degraded compared to a deployment in which the same model is used on both sides. Reconciling the embedding-side and query-side models is on the operational backlog and will be addressed in a future quarterly cycle.

Retrieval system

ChromaDB, an open-source vector database. The platform's production deployment as of this document holds 10,956 corpus chunks across 27 states, all carrying full provenance metadata. The corpus accuracy report describes the corpus contents in detail.

Inference hardware

Apple Mac Studio with M3 Ultra chip — 28-core CPU, 60-core GPU, 32-core Neural Engine, 256 GB unified memory, 4 TB SSD storage. The 4-bit quantized 70B-parameter base model and the LoRA fine-tune both run on this single machine, in the founder's possession. Inference does not pass through any third-party AI API. No user question, no model response, no portion of any user's conversation leaves the machine that runs the model.

Hosting

The frontend, backend, vector database, Postgres, Redis, observability layer, and reverse proxy operate as Docker containers on the same machine that runs the model. The platform is reachable on the public internet via a Cloudflare Tunnel — a persistent outbound connection from the platform's machine to Cloudflare's edge network. No inbound port is exposed on the host. SSL termination and rate limiting occur at the platform's nginx container; Cloudflare's role is to route incoming public requests through the tunnel to nginx.

What the model is and is not trained on

The fine-tune dataset includes

Public-domain state insurance statutes, administrative rules, and consumer guides; the platform's own educational explainer content; and question-answer pairs constructed by the founder from publicly available reference materials drawn from the insurance industry.

All source materials used to construct training pairs were publicly available at the time they were accessed. The source materials were used as input from which question-answer training pairs were constructed in the founder's own words; the original source materials are not present in the fine-tune dataset in their original form. The deployed model is designed not to reproduce source materials, and the platform's voice and citation verification layers (section 5) further constrain outputs. The platform's operators intend to test the deployed model empirically against the question of source-material reproduction and will document the results in a future quarterly cycle.

Legal basis for use of publicly available reference materials

The platform relies on the transformative-use doctrine of fair use as that doctrine has been applied to AI training in two recent federal decisions: Bartz v. Anthropic PBC (N.D. Cal. June 23, 2025, Judge Alsup) and Kadrey v. Meta Platforms, Inc. (N.D. Cal. June 25, 2025, Judge Chhabria). Both courts held, on the facts before them, that training large language models on copyrighted reference materials was “highly transformative” and constituted fair use, with Judge Alsup describing the use as “quintessentially transformative.”

The platform's use of publicly available reference materials in fine-tuning is consistent with the factual posture that produced fair-use findings in those cases:

The materials are used to train the model to perform a function (insurance education) different in character from the purpose of the source materials.
The model is designed not to reproduce any meaningful portion of the source materials in outputs.
The platform does not flood any market with substitutable secondary works — the platform produces educational responses to consumer questions, not derivative versions of the source materials.

The platform also does not maintain a permanent “shadow library” of source materials beyond what is required for the training process itself.

The law in this area is unsettled.

Both Bartz and Kadrey are summary judgment decisions in a single federal district; both are likely to be appealed; the Supreme Court has not addressed AI training under fair use; and Judge Chhabria's opinion specifically noted that “in most cases” plaintiffs presenting evidence of infringing outputs or market dilution may prevail. The platform's operators are in the process of discussing the legal posture of the fine-tuning process with counsel, and will adapt practices as the legal landscape clarifies and as counsel advises.

The fine-tune dataset does not include

Customer messages, customer-side conversation transcripts, customer-uploaded documents, internal company communications, attorney-client communications, paywalled content (case law databases, agent-licensing exam materials, proprietary industry research), trade-secret content, or content that would be unethical or unlawful to use.

The base model's training data

The base model's training data is whatever Meta used to train Llama 3.3 70B. Meta has published partial information about that training. The platform inherits whatever knowledge and whatever biases the base model carries from that training. The fine-tune narrows and corrects that knowledge for the platform's domain but does not replace it.

Customer interactions are not used to train any model

Conversations are stored in the platform's database for the purpose of providing chat history to authenticated users; they are not exported, transmitted to third parties, or used to update any model. If a customer requests deletion, their conversation history is deleted from the database.

No fine-tune iteration of the model is or will be trained on customer conversations without explicit, opt-in consent from the customers whose conversations would be used and a clear public disclosure of that policy change.

What the system does, in operational detail

A consumer submits a question. The platform's flow:

The user's question is embedded using mxbai-embed-large (via Ollama on the local host), producing a 1024-dimensional query vector.
The query vector is passed to ChromaDB along with a metadata filter for the user's state. ChromaDB returns the top-10 most similar corpus chunks for that state along with their full metadata: state, content category, citation reference, source URL, scrape date, full original text.
A prompt is constructed for the language model containing the user's question, the retrieved chunks, and a system instruction that defines the platform's voice constraints (no UPI/UPL-violating phrasing, three-tier confidence framing, citation requirement).
The fine-tuned language model produces a response.
The response passes through the verification layer (section 5).
The verified response is delivered to the user along with its confidence tier and any citation footers.

This flow is referred to internally as RAG (retrieval-augmented generation). It exists specifically to ground the model's outputs in real regulatory text, rather than relying on the model's own potentially-stale or potentially-fabricated knowledge of state insurance law.

Output constraints

The platform applies three independent constraints to model outputs before delivery. These mechanisms are described in detail in the corpus accuracy report; this section summarizes them.

Citation verification

Every regulatory section number cited in a response is verified against the citation metadata of the corpus chunks retrieved during generation. If the section number cannot be matched, or if the matched chunk's scrape date is older than 365 days, the citation is removed, the response's confidence tier is downgraded, and a footer directing the user to their state DOI is appended. This catches model fabrication of citation numbers and the use of stale corpus content. It does not catch citations that reference real, current statutes that the response interprets incorrectly.

Voice compliance

Every response is scanned against a set of forbidden phrases that constitute risk under unauthorized practice of insurance (UPI) or unauthorized practice of law (UPL) statutes. Examples include directives to buy, accept, or reject specific policies; characterizations of a legal position as strong or weak; guarantees of outcomes. Responses containing absolute-forbidden phrases are blocked before delivery and the user receives a controlled fallback response.

Three-tier confidence display

Every response carries one of three confidence indicators tied to the strength of corpus retrieval: HIGH (regulatory source verified), MEDIUM (based on general state guidance, claims qualified), or LOW (limited regulatory data, user directed to state DOI). The LOW display is the most consequential element of the three. It exists so that the platform never substitutes generic model inference for missing corpus content.

Known failure modes

The platform's operators acknowledge the following failure modes openly:

Misinterpretation

The verification layer catches fabricated citations and stale citations. It does not catch citations where the section number is real and current but the response's interpretation of the statute is wrong. This failure mode is mitigated by the corpus quality work, by the voice compliance system, and by consistent direction of users to their state DOI for authoritative interpretation. It is not eliminated.

Embedding model mismatch

As disclosed in section 2, the corpus is indexed with one embedding model and queries are embedded with another. Both produce 1024-dimensional vectors, and the retrieval system returns results, but relevance is degraded compared to using a single consistent model on both sides. Users receive responses that are correct on the substantive verification mechanisms but may receive less-relevant context than they would under a reconciled embedding configuration. This is an active item on the operational backlog.

Coverage gaps

The corpus does not yet contain comprehensive coverage for all states. Of fifty states plus territories, 27 currently have active scrapers. Responses for states without active scrapers fall back to the LOW confidence tier, which directs the user to their state DOI. The platform does not refuse to engage with such questions; it provides what it can with appropriate hedging. This is a policy choice and a known limitation.

Staleness

Some corpus chunks lack a scrape timestamp. As of this document, 62 chunks (0.6% of corpus) have no provenance timestamp. The verification layer treats these chunks as stale and removes any citation that depends on them. Backfilling these timestamps is in the operational backlog.

Quantization tradeoff

The platform serves the 4-bit quantized variant of Llama 3.3 70B, which reduces the model's memory footprint at a measurable cost in raw accuracy compared to the full-precision base. The accuracy impact of low-bit quantization on large language models has been studied directly in the academic literature: Dettmers, Lewis, Belkada, and Zettlemoyer (NeurIPS 2022, “LLM.int8()”, arXiv:2208.07339) demonstrated that 8-bit quantization of LLMs at 175B-parameter scale could be performed without measurable performance degradation; Frantar, Ashkboos, Hoefler, and Alistarh (ICLR 2023, “GPTQ”, arXiv:2210.17323) showed that GPT-class models can be quantized to 3-4 bits per weight with what the authors describe as “negligible accuracy degradation relative to the uncompressed baseline.” Subsequent work has converged on 4-bit quantization as a practical accuracy-efficiency sweet spot.

The platform's operators chose 4-bit quantization rather than full-precision deployment because doing so allows a 70-billion-parameter base model to run inference on a single 256GB Mac Studio without external API dependency. The platform's evaluation harness scores the deployed fused model directly; internal benchmarks reflect the model variant users actually receive.

Model behavior outside the prompted domain

The base model is a general-purpose language model. It can be coaxed off-topic by sufficiently determined users. The platform mitigates this through prompt engineering that strongly constrains the model to insurance-education contexts, but a sufficiently long or adversarial conversation could elicit content outside the platform's intended scope.

Bias inherited from the base model

Llama 3.3 70B carries whatever biases Meta's training process introduced or failed to address. The platform's fine-tune focuses on insurance-domain accuracy rather than bias remediation. Where regulatory content would clearly produce different outputs for different demographic groups, the platform's responses are constrained by the actual regulatory content and the voice gate; they do not invoke the base model's general dispositions on group differences. This constraint is not absolute, and operators acknowledge that bias-related failure modes are a thing the system can produce.

Robots.txt and access constraints

Some state DOI websites block automated access. Where alternative-source content is insufficient, the relevant state's coverage is shallower than it would otherwise be. This is documented per-state in the corpus accuracy report.

Operational practices

Local inference.

The model runs on a single machine in the founder's possession. No third-party AI API receives user inputs or model outputs.

Quarterly publication.

The Corpus Accuracy Report and this companion document are published quarterly. Each quarterly cycle includes a content-change events section documenting corrections, deletions, and infrastructure changes that affected the platform's substantive integrity during the period.

Open methodology.

The credibility of an AI-mediated educational service depends on the willingness of its operators to disclose what the system actually is. This document is part of that disclosure. Anyone — consumer, regulator, attorney, journalist — can read what the platform is built on, what it constrains, and where it fails.

Direct engagement.

State DOI staff, attorneys, and consumers who identify any concern or inaccuracy are invited to contact [email protected]. The platform's operator commits to responding to and, where appropriate, publicly acknowledging substantive concerns.

The Insurance Professor is an educational service. Not licensed insurance advice. Not legal advice. Not financial advice. Users who require advice tailored to their specific facts and circumstances should consult their state Department of Insurance, a licensed insurance producer in their state, or a licensed attorney.

The Insurance Professor ·About·Corpus Accuracy·insurance-professor.com

AI System Description& Model Card