AI & Agents Reference
A practical cheat sheet for building production agents — local LLMs, cloud APIs, RAG, agentic patterns, and the surrounding ecosystem. Skim it, bookmark it, or work through a learning path below.
Learning paths
Reference map
The mental model behind every agent — what they are, how they loop, and where memory lives.
Self-hosted inference with Ollama — privacy, zero per-token cost, fully offline-capable.
The three frontier providers — current pricing, tradeoffs, and minimal quickstart code.
Reusable shapes for organising an agent's reasoning — pick the one that matches the task.
Retrieve from your own corpus and feed as context — the antidote to hallucination on private data.
The major frameworks — when to reach for each, and where they overlap.
When should I use what?
| If you need… | Reach for | Why |
|---|---|---|
| Cheapest at scale | Gemini 2.5 Flash · GPT-4.1 nano | ~$0.10–0.30 per 1M input. Solid for classification, extraction, simple chat. |
| Best reasoning | o3 · Claude Opus 4.7 | Multi-step logic, math, code generation that needs to actually run. |
| Longest context | Gemini 2.5 Pro (1M) · GPT-5.5 (1M) | Whole books, long PDFs, video transcripts. |
| Privacy / offline | LLaMA 3.1 via Ollama | No data leaves the box. Zero per-token cost. Needs 8GB+ RAM (8B) or 48GB+ VRAM (70B). |
| Coding agents | Claude Sonnet 4.6 · GPT-5.4 | Sonnet 4.6 edges out on long codebases; GPT-5.4 has tighter tool-use ergonomics. |
| Structured JSON output | OpenAI · Gemini | Native schema-mode is the most mature. Anthropic is closing the gap. |
| High-volume tool use | GPT-4.1 · Claude Haiku 4.5 | Best tool-use cost/quality ratio for production agents at scale. |
This reference reflects what I'm using in production agents today — not every framework, just the ones worth knowing. Each page below is short on theory and heavy on what actually matters when you ship.
What is an AI Agent?
An AI agent is a system that perceives its environment, reasons about a goal, takes actions using tools, and iterates — autonomously completing tasks that once required hand-crafted pipelines.
The core idea
A traditional program follows a fixed sequence of steps you define. An AI agent, by contrast, decides at runtime which steps to take. It uses a language model as its "brain" — the LLM reasons over the current context, decides whether to call a tool, and reacts to the tool's output before deciding what to do next.
Think of an agent as: an LLM + a loop + access to tools. The loop runs until the agent believes the task is done (or a stop condition is hit).
Agent vs. LLM call
- One prompt → one response
- No memory across calls
- No tool access
- Fixed, linear logic
- Use for: classification, generation, summarisation
- Goal → many LLM calls in a loop
- Maintains context window across steps
- Calls tools, APIs, databases
- Adaptive, branching logic
- Use for: research, coding, task automation
Four components of every agent
Like a surgeon who decides which instrument to pick up next based on what they see — agents make moment-to-moment decisions grounded in the latest context, not a pre-written script.
Minimal agent in Python
# Minimal agent loop (concept — not framework-specific)
from openai import OpenAI
import json
client = OpenAI()
tools = [search_tool, calculator_tool, email_tool] # your functions
def run_agent(goal: str) -> str:
messages = [{"role": "user", "content": goal}]
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tool_schemas, # JSON schemas for each tool
tool_choice="auto"
)
msg = response.choices[0].message
if msg.tool_calls:
for tc in msg.tool_calls:
result = call_tool(tc.function.name, tc.function.arguments)
messages.append({"role": "tool", "content": result})
else:
return msg.content # agent decided it's done
The Agent Loop
Every agentic system, from simple scripts to complex multi-agent pipelines, runs on the same fundamental observe → think → act cycle.
Observe → Think → Act
Stop conditions
Agents need clear termination logic. Common patterns:
max_iterations=20 in LangChain agents.finish() tool with its final structured output, ensuring clean termination and parseable results.max_iterations limit. Without it, a confused agent can spin forever, burning API credits. LangChain's default is 15 iterations.Tool Use
Tools extend what an LLM can do beyond text generation — enabling it to read files, query databases, run code, and interact with the real world.
How tool calling works
You provide the LLM with a list of tool schemas (JSON). When the model wants to call a tool, it outputs a structured JSON object with the tool name and arguments — you intercept this, run the real function, and feed the result back.
{
"type": "function",
"function": {
"name": "search_schedules",
"description": "Search logistics schedules by date range and carrier",
"parameters": {
"type": "object",
"properties": {
"start_date": { "type": "string", "description": "ISO 8601 date" },
"end_date": { "type": "string" },
"carrier": { "type": "string", "enum": ["FedEx","UPS","DHL"] }
},
"required": ["start_date"]
}
}
}
Common tool categories
Memory Types
An agent without memory starts from scratch every run. Good memory design separates one-shot experiments from production-grade systems.
Four memory tiers
The best agents combine all four — structured DB for facts, vector store for fuzzy recall, episodic summaries for continuity, and the context window for active reasoning.
Ollama Setup
Ollama lets you run open-source LLMs locally on your Mac, Linux, or Windows machine — no internet, no API key, full privacy.
Installation
# macOS (homebrew)
brew install ollama
# Or download the app from ollama.ai
# Then start the server:
ollama serve
# Pull a model (downloads ~4-8 GB)
ollama pull llama3.1 # Meta's LLaMA 3.1 8B
ollama pull mistral # Mistral 7B — fast & capable
ollama pull nomic-embed-text # For embeddings (768-dim)
ollama pull qwen2.5-coder # Qwen for code tasks
# Test in terminal
ollama run llama3.1 "Explain RAG in one paragraph"
Call from Python
import requests
def ollama_chat(prompt: str, model="llama3.1") -> str:
r = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False}
)
return r.json()["response"]
# Or via LangChain (much easier for agents)
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.1", temperature=0)
response = llm.invoke("Summarise this email: ...")
ollama list to see downloaded models.Full Local Stack
A complete agentic pipeline that runs entirely on your machine — no external APIs, no data leaving your network.
Stack overview
Full setup
# 1. Install dependencies
pip install langchain langchain-ollama langchain-community
pip install chromadb fastapi uvicorn python-dotenv
# 2. Start Ollama
ollama serve &
ollama pull llama3.1
ollama pull nomic-embed-text
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from langchain.tools.retriever import create_retriever_tool
# LLM + embeddings — both local via Ollama
llm = ChatOllama(model="llama3.1", temperature=0)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Vector store — persisted to disk
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
# Wrap retriever as a tool
rag_tool = create_retriever_tool(
retriever,
name="search_schedules",
description="Searches logistics schedules and past emails"
)
# Build agent
prompt = ChatPromptTemplate.from_messages([
("system", "You are a logistics scheduling assistant."),
("placeholder", "{chat_history}"),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, [rag_tool], prompt)
executor = AgentExecutor(agent=agent, tools=[rag_tool], verbose=True)
result = executor.invoke({"input": "Any conflicts on March 15?"})
Model Cost Comparison
Pricing for the most commonly used frontier models. Costs are per million tokens (input / output), captured June 2026 from each provider's official pricing page.
| Model | Provider | Input (per 1M) | Output (per 1M) | Context | Best for |
|---|---|---|---|---|---|
| Claude Opus 4.7 | Anthropic | $5.00 | $25.00 | 200k | Flagship reasoning, hardest tasks |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 200k | Coding, long-context analysis |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | 200k | Fast classification, triage |
| GPT-5.5 | OpenAI | $5.00 | $30.00 | 1M | Frontier general-purpose, agents |
| GPT-5.4 | OpenAI | $2.50 | $15.00 | 1M | Production workhorse, balanced |
| GPT-4.1 | OpenAI | $2.00 | $8.00 | 128k | Tool use, structured output |
| GPT-4.1 nano | OpenAI | $0.10 | $0.40 | 128k | Cheapest, simple tasks at scale |
| o3 | OpenAI | $2.00 | $8.00 | 200k | Reasoning — math, code, logic |
| o4-mini | OpenAI | $0.55 | $2.20 | 200k | Cheap reasoning, high volume |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M | Massive docs, multimodal | |
| Gemini 2.5 Flash | $0.30 | $2.50 | 1M | High-volume multimodal, cheap | |
| LLaMA 3.1 8B | Local (Ollama) | Free | Free | 128k | Privacy, edge, zero per-token cost |
| LLaMA 3.1 70B | Local (Ollama) | Free | Free | 128k | Local frontier, needs ≥48GB VRAM |
| Mistral 7B | Local (Ollama) | Free | Free | 32k | Lightweight instruction following |
For a production agent handling 10,000 emails/month at ~1,000 tokens each: Claude Sonnet 4.6 costs ~$180/mo vs. GPT-4.1 nano at ~$5/mo vs. local Ollama at $0 (hardware aside). With prompt caching on a fixed system prompt + RAG context, input costs typically drop 70–90% — making frontier models competitive with smaller ones at scale.
OpenAI GPT-4o
The most widely used model for agentic applications. Excellent tool-calling, strong reasoning, and a massive ecosystem of integrations.
Quick start
pip install openai
from openai import OpenAI
client = OpenAI(api_key="sk-...") # or use OPENAI_API_KEY env var
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarise this email..."}]
)
print(response.choices[0].message.content)
Anthropic Claude
Known for long-context mastery (200k tokens), nuanced instruction-following, and strong safety properties. Built for complex enterprise tasks.
Quick start
pip install anthropic
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Analyse this contract..."}]
)
print(message.content[0].text)
Google Gemini
Google's frontier model — multimodal-first (text, images, audio, video), with a 2M-token context window and very competitive pricing on Flash tier.
Quick start
pip install google-generativeai
import google.generativeai as genai
genai.configure(api_key="AIza...")
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content("Explain vector embeddings")
print(response.text)
The ReAct Pattern
Reasoning + Acting. The LLM alternates between thinking out loud (Thought) and calling a tool (Action) — making its reasoning transparent and debuggable.
Thought → Action → Observation cycle
Goal: "Is there a scheduling conflict on March 15th for DHL?"
Thought: I need to search the schedule database for March 15 DHL entries.
Action: search_schedules({"start_date": "2025-03-15", "carrier": "DHL"})
Observation: Found 3 entries. Two overlap: 14:00 pickup & 14:30 delivery.
Thought: There's a 30-minute conflict. I should flag this for review.
Action: create_conflict_alert({"time": "14:00", "type": "overlap"})
Observation: Alert #4821 created, assigned to scheduler queue.
Thought: Task complete. I have the answer and have taken action.
Final Answer: Yes, conflict found at 14:00-14:30 on March 15 (DHL).
Alert #4821 created for review.
Multi-Agent Systems
Divide complex tasks across specialised agents — each one focused on a subtask — orchestrated by a supervisor agent.
Orchestrator + Worker pattern
When to use multi-agent
A single agent handles most tasks up to moderate complexity. Add multiple agents when: tasks are parallelisable (research while coding), specialisation improves quality (a dedicated critic agent reviewing a writer agent's output), or context limits are hit (each agent handles a slice of a large document).
Human-in-the-Loop
For high-stakes actions, the agent pauses and routes to a human review queue before proceeding. Essential for email sending, financial operations, and customer-facing outputs.
Confidence threshold routing
CONFIDENCE_THRESHOLD = 0.88
async def route_decision(decision: Decision) -> ActionResult:
if decision.confidence >= CONFIDENCE_THRESHOLD:
# High confidence — auto-execute
return await execute_action(decision)
else:
# Low confidence — queue for human review
review_id = await queue_for_review({
"decision": decision,
"confidence": decision.confidence,
"reasoning": decision.reasoning,
"suggested_action": decision.action
})
return ActionResult(status="pending_review", review_id=review_id)
Plan & Execute
A two-phase pattern: first generate a complete plan as a structured list of steps, then execute each step in sequence with a dedicated executor agent.
Why separate planning from execution
ReAct agents interleave planning and execution — they can go down wrong paths and recover, but may be inefficient. Plan & Execute forces the LLM to think through the full approach before acting. This is better for tasks where the steps are known upfront and backtracking is expensive (e.g. writing a long document, running a multi-step analysis).
[{step: "Search for Q1 data"}, {step: "Calculate averages"}, …]How RAG Works
Retrieval-Augmented Generation (RAG) gives an LLM access to a private knowledge base at query time — without fine-tuning or putting everything in the context window.
The two pipelines
RAG vs. fine-tuning
- Knowledge updated instantly (re-ingest)
- No GPU or training cost
- Cites sources (transparent)
- Works with any LLM
- Limited by context window at retrieval
- Knowledge baked into weights
- Expensive GPU training required
- Better for style/format changes
- Faster at inference (no retrieval step)
- Stale: requires re-training to update
Embeddings Deep Dive
Embeddings transform text into dense numerical vectors where semantic similarity equals geometric proximity. The foundation of all modern semantic search and RAG.
What is an embedding?
An embedding model converts a string of text into a list of floating-point numbers — a vector in high-dimensional space. nomic-embed-text produces 768 numbers. OpenAI's text-embedding-3-large produces 3,072.
The magic: semantically similar texts produce vectors that are geometrically close. "DHL shipment delayed" and "FedEx delivery postponed" have high cosine similarity even though they share no words.
Cosine similarity
import numpy as np
def cosine_similarity(a: list, b: list) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# 1.0 = identical meaning, 0.0 = unrelated, -1.0 = opposite
# Typical RAG threshold: 0.70–0.80 (tune for your domain)
# With OllamaEmbeddings:
from langchain_ollama import OllamaEmbeddings
emb = OllamaEmbeddings(model="nomic-embed-text")
v1 = emb.embed_query("DHL shipment delayed") # 768 floats
v2 = emb.embed_query("FedEx delivery postponed") # 768 floats
print(cosine_similarity(v1, v2)) # → ~0.87
Embedding model comparison
| Model | Dims | Cost | Best for |
|---|---|---|---|
nomic-embed-text | 768 | Free (local) | General purpose, good quality/speed |
mxbai-embed-large | 1024 | Free (local) | High quality, slower |
text-embedding-3-small | 1536 | $0.02/1M tokens | Best OpenAI value |
text-embedding-3-large | 3072 | $0.13/1M tokens | Highest accuracy |
text-embedding-004 (Google) | 768 | Free tier | Gemini ecosystem |
Chunking Strategies
How you split documents before embedding determines retrieval quality more than almost anything else. Chunk too large: noisy. Chunk too small: missing context.
Four main strategies
SemanticChunker in LangChain.from langchain.text_splitter import (
RecursiveCharacterTextSplitter, # best default choice
SentenceTransformersTokenTextSplitter,
)
from langchain_experimental.text_splitter import SemanticChunker
# Recursive — respects paragraphs > sentences > words
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document_text)
# Semantic — groups similar sentences (best quality)
semantic_splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile" # or "standard_deviation"
)
Vector Databases
Specialised datastores built for billion-scale nearest-neighbour search on high-dimensional embedding vectors.
Comparison
| Database | Type | Self-hosted | Best for |
|---|---|---|---|
| ChromaDB | Local / embedded | Yes | Local dev, small datasets, privacy |
| Pinecone | Managed cloud | No | Production at scale, no ops overhead |
| Weaviate | Open-source / cloud | Yes | Hybrid search (BM25 + semantic) |
| Qdrant | Open-source / cloud | Yes | Rust-based, high performance |
| pgvector | Postgres extension | Yes | Already using Postgres, small-medium scale |
| FAISS | In-memory library | Yes | Research, large batches, no persistence needed |
For getting started: ChromaDB requires zero infrastructure and zero config. When you're ready for production scale, migrate to Pinecone or Qdrant with the same LangChain abstraction layer.
LangChain
The most widely adopted framework for building LLM applications — a composable set of abstractions for chains, agents, retrievers, memory, and tool use.
Core abstractions
prompt | llm | parser. LangChain Expression Language.Simple RAG chain (LCEL)
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
prompt = ChatPromptTemplate.from_template("""
Answer based on this context:
{context}
Question: {question}
""")
# Pipe syntax: each step feeds the next
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("What deliveries are due on Friday?")
LangGraph
Build stateful, multi-step agent workflows as explicit graphs — nodes are processing steps, edges define flow, state persists across the entire run.
State graph pattern
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
confidence: float
needs_review: bool
def analyse_email(state: AgentState) -> AgentState:
# Call LLM, update state
...
def check_confidence(state: AgentState) -> str:
return "human_review" if state["confidence"] < 0.88 else "send_reply"
builder = StateGraph(AgentState)
builder.add_node("analyse", analyse_email)
builder.add_node("send_reply", send_email)
builder.add_node("human_review", queue_for_review)
builder.set_entry_point("analyse")
builder.add_conditional_edges("analyse", check_confidence)
builder.add_edge("send_reply", END)
builder.add_edge("human_review", END)
graph = builder.compile()
LlamaIndex
Purpose-built for document indexing and RAG. While LangChain is general-purpose, LlamaIndex excels specifically at ingesting, indexing, and querying large document collections.
Key differences from LangChain
- General-purpose orchestration
- Agents, chains, tools, memory
- Larger ecosystem
- More boilerplate for RAG
- Document-first RAG specialist
- Automatic ingestion pipelines
- Advanced indexing (RAPTOR, etc.)
- Less code for pure RAG use cases
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Load all docs from a folder
documents = SimpleDirectoryReader("./data").load_data()
# Index (embed + store) automatically
index = VectorStoreIndex.from_documents(documents)
# Query with natural language
query_engine = index.as_query_engine()
response = query_engine.query("Summarise all DHL-related emails")
CrewAI
Define multi-agent workflows using a role-based metaphor — each agent has a role, goal, backstory, and set of tools. A Crew orchestrates them through Tasks.
Agents, Tasks, Crews
from crewai import Agent, Task, Crew
researcher = Agent(
role="Logistics Analyst",
goal="Find schedule conflicts in incoming emails",
backstory="Expert in supply chain scheduling with 10 years experience",
tools=[search_schedules, check_calendar],
llm=llm
)
writer = Agent(
role="Email Composer",
goal="Draft professional responses to logistics partners",
backstory="Specialist in B2B communications",
llm=llm
)
analyse_task = Task(
description="Review the email and identify any conflicts",
agent=researcher, expected_output="Conflict report"
)
reply_task = Task(
description="Draft a reply addressing identified conflicts",
agent=writer, expected_output="Email draft"
)
crew = Crew(agents=[researcher, writer], tasks=[analyse_task, reply_task])
result = crew.kickoff()
Choosing a Local Model
Not all models are equal for agent tasks. Smaller models are faster and cheaper, but need to reliably follow tool-calling JSON schemas.
Model recommendations by task
| Task | Recommended | Why |
|---|---|---|
| General agent reasoning | llama3.1:8b | Best small model for tool use and instruction following |
| Coding | qwen2.5-coder:7b | Fine-tuned specifically for code generation |
| Fast classification | mistral:7b | Very fast, good enough for binary/classification tasks |
| Embeddings | nomic-embed-text | High quality 768-dim, fast inference |
| Long context (local) | llama3.1:70b | Needs 64GB+ RAM — best local quality at any context |
| Vision + text | llava:13b | Multimodal, process images + text locally |