AI Resources — Leon Nicolas

🗺️

How to use this

Six sections in the sidebar, each with 3–4 short pages (~3 min read). Pick a learning path below, jump to a section card, or search topics from the sidebar.

Learning paths

For beginners

Build your first agent in a day

What is an Agent? → The Loop → Tool Use → First API call

RAG-first

Ground agents in your own data

How RAG works → Chunking → Embeddings → Vector DBs

Self-hosted

Air-gap, zero per-token cost

Ollama → Pick a model → Full local stack → ReAct loop

Reference map

Foundations

The mental model behind every agent — what they are, how they loop, and where memory lives.

🤖What is an AI Agent? 🔁The Agent Loop 🔧Tool Use 🧠Memory Types

Running Locally

Self-hosted inference with Ollama — privacy, zero per-token cost, fully offline-capable.

🦙Ollama Setup 🗂️Full Local Stack 📦Choosing a Model

Cloud APIs

The three frontier providers — current pricing, tradeoffs, and minimal quickstart code.

⚖️Cost Comparison ⚡OpenAI 🎭Anthropic Claude ✨Google Gemini

Agentic Patterns

Reusable shapes for organising an agent's reasoning — pick the one that matches the task.

⚗️ReAct (Reason + Act) 📋Plan & Execute 🕸️Multi-Agent 👤Human-in-the-Loop

RAG & Embeddings

Retrieve from your own corpus and feed as context — the antidote to hallucination on private data.

📚How RAG Works 🔢Embeddings Deep Dive ✂️Chunking Strategies 🗃️Vector Databases

Ecosystem

The major frameworks — when to reach for each, and where they overlap.

⛓️LangChain 🌐LangGraph 🦙LlamaIndex 🚢CrewAI

When should I use what?

If you need…	Reach for	Why
Cheapest at scale	Gemini 2.5 Flash · GPT-4.1 nano	~$0.10–0.30 per 1M input. Solid for classification, extraction, simple chat.
Best reasoning	o3 · Claude Opus 4.7	Multi-step logic, math, code generation that needs to actually run.
Longest context	Gemini 2.5 Pro (1M) · GPT-5.5 (1M)	Whole books, long PDFs, video transcripts.
Privacy / offline	LLaMA 3.1 via Ollama	No data leaves the box. Zero per-token cost. Needs 8GB+ RAM (8B) or 48GB+ VRAM (70B).
Coding agents	Claude Sonnet 4.6 · GPT-5.4	Sonnet 4.6 edges out on long codebases; GPT-5.4 has tighter tool-use ergonomics.
Structured JSON output	OpenAI · Gemini	Native schema-mode is the most mature. Anthropic is closing the gap.
High-volume tool use	GPT-4.1 · Claude Haiku 4.5	Best tool-use cost/quality ratio for production agents at scale.

This reference reflects what I'm using in production agents today — not every framework, just the ones worth knowing. Each page below is short on theory and heavy on what actually matters when you ship.

The core idea

A traditional program follows a fixed sequence of steps you define. An AI agent, by contrast, decides at runtime which steps to take. It uses a language model as its "brain" — the LLM reasons over the current context, decides whether to call a tool, and reacts to the tool's output before deciding what to do next.

Think of an agent as: an LLM + a loop + access to tools. The loop runs until the agent believes the task is done (or a stop condition is hit).

💡

Key insight

The fundamental shift: you define the goal, not the procedure. The agent figures out the steps itself by reasoning at each iteration.

Agent vs. LLM call

🗣️ Single LLM Call

One prompt → one response
No memory across calls
No tool access
Fixed, linear logic
Use for: classification, generation, summarisation

🤖 AI Agent

Goal → many LLM calls in a loop
Maintains context window across steps
Calls tools, APIs, databases
Adaptive, branching logic
Use for: research, coding, task automation

Four components of every agent

🧠

LLM Brain

The reasoning engine. Reads context, decides next action, interprets tool outputs. Usually GPT-4o, Claude 3.5, Gemini 2.0 or a local LLaMA model.

🔧

Tools

Functions the LLM can call — web search, code execution, database queries, email sending, API calls. Each has a JSON schema the LLM uses to call it.

🧩

Memory

Short-term (context window), long-term (vector DB), episodic (past run summaries). Determines what the agent "knows" about the current task.

🔁

Loop / Orchestration

The controller that runs: observe → think → act → observe again. Can be a simple while-loop or a complex graph (LangGraph, Prefect).

Like a surgeon who decides which instrument to pick up next based on what they see — agents make moment-to-moment decisions grounded in the latest context, not a pre-written script.

Minimal agent in Python

Python

# Minimal agent loop (concept — not framework-specific)
from openai import OpenAI
import json

client = OpenAI()
tools = [search_tool, calculator_tool, email_tool]  # your functions

def run_agent(goal: str) -> str:
    messages = [{"role": "user", "content": goal}]

    while True:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tool_schemas,        # JSON schemas for each tool
            tool_choice="auto"
        )
        msg = response.choices[0].message

        if msg.tool_calls:
            for tc in msg.tool_calls:
                result = call_tool(tc.function.name, tc.function.arguments)
                messages.append({"role": "tool", "content": result})
        else:
            return msg.content   # agent decided it's done

Observe → Think → Act

Core Agent Loop

👁️

Observe

Gather context

→

🧠Think
LLM reasons

→

⚡

Act

Call tool or respond

→

📥

Update

Add result to context

↩ Loop back to Observe until goal achieved or stop condition met

Stop conditions

Agents need clear termination logic. Common patterns:

1

LLM decides it's done

The model returns a final answer (no tool call). This is the most common pattern in ReAct-style agents.

2

Max iterations reached

Hard cap on the number of tool calls. Essential safety net — set via max_iterations=20 in LangChain agents.

3

Structured output produced

Agent is instructed to call a special finish() tool with its final structured output, ensuring clean termination and parseable results.

4

External signal / HITL approval

Agent pauses and waits for human confirmation before proceeding. Used in high-stakes pipelines (finance, emails to clients).

⚠️

Infinite loop risk

Always set a max_iterations limit. Without it, a confused agent can spin forever, burning API credits. LangChain's default is 15 iterations.

How tool calling works

You provide the LLM with a list of tool schemas (JSON). When the model wants to call a tool, it outputs a structured JSON object with the tool name and arguments — you intercept this, run the real function, and feed the result back.

JSON — Tool Schema

{
  "type": "function",
  "function": {
    "name": "search_schedules",
    "description": "Search logistics schedules by date range and carrier",
    "parameters": {
      "type": "object",
      "properties": {
        "start_date": { "type": "string", "description": "ISO 8601 date" },
        "end_date":   { "type": "string" },
        "carrier":    { "type": "string", "enum": ["FedEx","UPS","DHL"] }
      },
      "required": ["start_date"]
    }
  }
}

Common tool categories

🌐

Web Search

Tavily, SerpAPI, or Bing Search. Gives the agent access to up-to-date information beyond its training cut-off.

💻

Code Execution

Run Python in a sandbox (E2B, Docker). Lets agents do maths, data analysis, chart generation programmatically.

🗃️

Database

Read/write SQLite, Postgres, ChromaDB. Essential for agents that need structured data or persistent memory.

📧

Email / Comms

SMTP, Gmail API, Slack SDK. Used in autonomous pipelines where the agent must respond to or send communications.

📁

File System

Read/write local files. Useful for agents that process documents, generate reports, or work with codebases.

🔗

External APIs

Any REST/GraphQL API the agent needs. Weather, stocks, maps, CRMs, ERPs — just wrap the HTTP call.

Four memory tiers

💬

In-Context (Working)

The active conversation window. Fast, zero-cost, but bounded. 128k tokens ≈ ~100k words. Lost when session ends.

🗃️

Semantic (Vector)

Past knowledge stored as embeddings in ChromaDB, Pinecone, Weaviate. Retrieved by cosine similarity. Survives across runs.

📋

Episodic (Summaries)

Compressed summaries of past sessions injected into new context windows. Lets the agent "remember" without storing raw transcripts.

🔢

Structured (DB)

Hard facts in SQL / key-value stores. Use for: user preferences, completed tasks, schedules, confirmed bookings.

The best agents combine all four — structured DB for facts, vector store for fuzzy recall, episodic summaries for continuity, and the context window for active reasoning.

Installation

Shell

# macOS (homebrew)
brew install ollama

# Or download the app from ollama.ai
# Then start the server:
ollama serve

# Pull a model (downloads ~4-8 GB)
ollama pull llama3.1          # Meta's LLaMA 3.1 8B
ollama pull mistral           # Mistral 7B — fast & capable
ollama pull nomic-embed-text  # For embeddings (768-dim)
ollama pull qwen2.5-coder     # Qwen for code tasks

# Test in terminal
ollama run llama3.1 "Explain RAG in one paragraph"

Call from Python

Python

import requests

def ollama_chat(prompt: str, model="llama3.1") -> str:
    r = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return r.json()["response"]

# Or via LangChain (much easier for agents)
from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3.1", temperature=0)
response = llm.invoke("Summarise this email: ...")

✅

Hardware tip

8B models run comfortably on a Mac with 16 GB RAM (Apple Silicon). For 70B models you'll need 64 GB+ RAM or a GPU with VRAM. Check ollama list to see downloaded models.

Stack overview

🦙

Ollama

LLM inference server. Serves llama3.1 + nomic-embed-text via REST.

⛓️

LangChain

Agent orchestration — handles the loop, tool routing, memory injection.

🎨

ChromaDB

Local vector database. Stores and retrieves embeddings by cosine similarity.

🗄️

SQLite

Structured data — schedules, task history, confirmed decisions.

🚀

FastAPI

Optional: HTTP API layer for the agent so it can be triggered by webhooks.

Full setup

Shell

# 1. Install dependencies
pip install langchain langchain-ollama langchain-community
pip install chromadb fastapi uvicorn python-dotenv

# 2. Start Ollama
ollama serve &
ollama pull llama3.1
ollama pull nomic-embed-text

Python — Local RAG Agent

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from langchain.tools.retriever import create_retriever_tool

# LLM + embeddings — both local via Ollama
llm = ChatOllama(model="llama3.1", temperature=0)
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Vector store — persisted to disk
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# Wrap retriever as a tool
rag_tool = create_retriever_tool(
    retriever,
    name="search_schedules",
    description="Searches logistics schedules and past emails"
)

# Build agent
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a logistics scheduling assistant."),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, [rag_tool], prompt)
executor = AgentExecutor(agent=agent, tools=[rag_tool], verbose=True)

result = executor.invoke({"input": "Any conflicts on March 15?"})

ℹ️

Pricing note

API pricing changes frequently. Always verify at the provider's official pricing page before budgeting a production system. Batch API tiers typically cut costs by ~50%; prompt caching can cut input costs by up to 90% on repeated context.

Model	Provider	Input (per 1M)	Output (per 1M)	Context	Best for
Claude Opus 4.7	Anthropic	$5.00	$25.00	200k	Flagship reasoning, hardest tasks
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	200k	Coding, long-context analysis
Claude Haiku 4.5	Anthropic	$1.00	$5.00	200k	Fast classification, triage
GPT-5.5	OpenAI	$5.00	$30.00	1M	Frontier general-purpose, agents
GPT-5.4	OpenAI	$2.50	$15.00	1M	Production workhorse, balanced
GPT-4.1	OpenAI	$2.00	$8.00	128k	Tool use, structured output
GPT-4.1 nano	OpenAI	$0.10	$0.40	128k	Cheapest, simple tasks at scale
o3	OpenAI	$2.00	$8.00	200k	Reasoning — math, code, logic
o4-mini	OpenAI	$0.55	$2.20	200k	Cheap reasoning, high volume
Gemini 2.5 Pro	Google	$1.25	$10.00	1M	Massive docs, multimodal
Gemini 2.5 Flash	Google	$0.30	$2.50	1M	High-volume multimodal, cheap
LLaMA 3.1 8B	Local (Ollama)	Free	Free	128k	Privacy, edge, zero per-token cost
LLaMA 3.1 70B	Local (Ollama)	Free	Free	128k	Local frontier, needs ≥48GB VRAM
Mistral 7B	Local (Ollama)	Free	Free	32k	Lightweight instruction following

For a production agent handling 10,000 emails/month at ~1,000 tokens each: Claude Sonnet 4.6 costs ~$180/mo vs. GPT-4.1 nano at ~$5/mo vs. local Ollama at $0 (hardware aside). With prompt caching on a fixed system prompt + RAG context, input costs typically drop 70–90% — making frontier models competitive with smaller ones at scale.

Quick start

Python

pip install openai

from openai import OpenAI

client = OpenAI(api_key="sk-...")   # or use OPENAI_API_KEY env var

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise this email..."}]
)
print(response.choices[0].message.content)

Quick start

Python

pip install anthropic

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-...")

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyse this contract..."}]
)
print(message.content[0].text)

Quick start

Python

pip install google-generativeai

import google.generativeai as genai

genai.configure(api_key="AIza...")

model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content("Explain vector embeddings")
print(response.text)

Thought → Action → Observation cycle

Agent Trace — ReAct

Goal: "Is there a scheduling conflict on March 15th for DHL?"

Thought: I need to search the schedule database for March 15 DHL entries.
Action: search_schedules({"start_date": "2025-03-15", "carrier": "DHL"})
Observation: Found 3 entries. Two overlap: 14:00 pickup & 14:30 delivery.

Thought: There's a 30-minute conflict. I should flag this for review.
Action: create_conflict_alert({"time": "14:00", "type": "overlap"})
Observation: Alert #4821 created, assigned to scheduler queue.

Thought: Task complete. I have the answer and have taken action.
Final Answer: Yes, conflict found at 14:00-14:30 on March 15 (DHL).
             Alert #4821 created for review.

🔬

Why ReAct works

By externalising reasoning as text, the LLM "thinks before it acts." This dramatically reduces hallucinations and makes debugging trivial — you can read the full thought trace.

Orchestrator + Worker pattern

Multi-Agent Topology

👑Orchestrator Agent
Routes tasks, aggregates results

▼

🔍

Research Agent

Web search + docs

▼

💻

Coder Agent

Generate + run code

▼

📝

Writer Agent

Compose output

When to use multi-agent

A single agent handles most tasks up to moderate complexity. Add multiple agents when: tasks are parallelisable (research while coding), specialisation improves quality (a dedicated critic agent reviewing a writer agent's output), or context limits are hit (each agent handles a slice of a large document).

Confidence threshold routing

Python — HITL Router

CONFIDENCE_THRESHOLD = 0.88

async def route_decision(decision: Decision) -> ActionResult:
    if decision.confidence >= CONFIDENCE_THRESHOLD:
        # High confidence — auto-execute
        return await execute_action(decision)
    else:
        # Low confidence — queue for human review
        review_id = await queue_for_review({
            "decision": decision,
            "confidence": decision.confidence,
            "reasoning": decision.reasoning,
            "suggested_action": decision.action
        })
        return ActionResult(status="pending_review", review_id=review_id)

🔑

HITL is a feature, not a limitation

The best production agents aren't fully autonomous — they're autonomy-calibrated. Low confidence + high stakes = always route to human. High confidence + low stakes = auto-execute. Tune thresholds empirically against your domain.

Why separate planning from execution

ReAct agents interleave planning and execution — they can go down wrong paths and recover, but may be inefficient. Plan & Execute forces the LLM to think through the full approach before acting. This is better for tasks where the steps are known upfront and backtracking is expensive (e.g. writing a long document, running a multi-step analysis).

1

Planner LLM call

Given the goal, generate a JSON array of steps: [{step: "Search for Q1 data"}, {step: "Calculate averages"}, …]

2

Executor loop

For each step, call a ReAct sub-agent that executes just that step and returns a result.

3

Re-plan if needed

After each step, optionally let the planner revise remaining steps based on what was discovered.

The two pipelines

RAG Architecture — Ingestion + Retrieval

INGESTION (offline)

📄 Raw Documents

PDFs, emails, markdown, web pages, SQL tables

→

✂️ Chunker

Split into ~500 token overlapping chunks

→

🔢 Embed

nomic-embed-text → 768-dim float vector per chunk

→

🗃️ ChromaDB

Store (vector, metadata, raw text)

─────────────────────────────────

RETRIEVAL (at query time)

❓ User Query

"Any DHL conflicts on March 15?"

→

🔢 Embed Query

Same model → 768-dim query vector

→

🔍 Similarity Search

cosine(q, docs) — top-k above threshold 0.72

→

🧠 LLM + Context

Retrieved chunks injected into prompt → grounded answer

RAG vs. fine-tuning

📚 RAG

Knowledge updated instantly (re-ingest)
No GPU or training cost
Cites sources (transparent)
Works with any LLM
Limited by context window at retrieval

🎯 Fine-tuning

Knowledge baked into weights
Expensive GPU training required
Better for style/format changes
Faster at inference (no retrieval step)
Stale: requires re-training to update

✅

Rule of thumb

Use RAG when you need the LLM to reason over your data. Use fine-tuning when you need the LLM to behave differently (different tone, format, domain-specific reasoning patterns).

What is an embedding?

An embedding model converts a string of text into a list of floating-point numbers — a vector in high-dimensional space. nomic-embed-text produces 768 numbers. OpenAI's text-embedding-3-large produces 3,072.

The magic: semantically similar texts produce vectors that are geometrically close. "DHL shipment delayed" and "FedEx delivery postponed" have high cosine similarity even though they share no words.

Cosine similarity

Python

import numpy as np

def cosine_similarity(a: list, b: list) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 1.0 = identical meaning, 0.0 = unrelated, -1.0 = opposite
# Typical RAG threshold: 0.70–0.80 (tune for your domain)

# With OllamaEmbeddings:
from langchain_ollama import OllamaEmbeddings
emb = OllamaEmbeddings(model="nomic-embed-text")

v1 = emb.embed_query("DHL shipment delayed")       # 768 floats
v2 = emb.embed_query("FedEx delivery postponed")    # 768 floats
print(cosine_similarity(v1, v2))  # → ~0.87

Embedding model comparison

Model	Dims	Cost	Best for
`nomic-embed-text`	768	Free (local)	General purpose, good quality/speed
`mxbai-embed-large`	1024	Free (local)	High quality, slower
`text-embedding-3-small`	1536	$0.02/1M tokens	Best OpenAI value
`text-embedding-3-large`	3072	$0.13/1M tokens	Highest accuracy
`text-embedding-004` (Google)	768	Free tier	Gemini ecosystem

Four main strategies

📏

Fixed Size

Split every N tokens regardless of content. Simple, fast. Add 10–20% overlap to avoid cutting mid-sentence. Default: 512 tokens, 50 overlap.

🔤

Sentence / Paragraph

Split on natural boundaries (sentences, paragraphs). Better semantic coherence, variable chunk size. Best for prose documents.

📑

Semantic

Embed sentences, then group consecutive sentences with similar embeddings. More expensive but best quality. Use SemanticChunker in LangChain.

🌲

Hierarchical (RAPTOR)

Build a tree: embed chunks → cluster → summarise clusters → embed summaries. Enables both detailed and high-level retrieval.

Python — LangChain Chunkers

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,  # best default choice
    SentenceTransformersTokenTextSplitter,
)
from langchain_experimental.text_splitter import SemanticChunker

# Recursive — respects paragraphs > sentences > words
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document_text)

# Semantic — groups similar sentences (best quality)
semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"  # or "standard_deviation"
)

Comparison

Database	Type	Self-hosted	Best for
ChromaDB	Local / embedded	Yes	Local dev, small datasets, privacy
Pinecone	Managed cloud	No	Production at scale, no ops overhead
Weaviate	Open-source / cloud	Yes	Hybrid search (BM25 + semantic)
Qdrant	Open-source / cloud	Yes	Rust-based, high performance
pgvector	Postgres extension	Yes	Already using Postgres, small-medium scale
FAISS	In-memory library	Yes	Research, large batches, no persistence needed

For getting started: ChromaDB requires zero infrastructure and zero config. When you're ready for production scale, migrate to Pinecone or Qdrant with the same LangChain abstraction layer.

Core abstractions

🔗

Chain (LCEL)

Compose LLM calls, prompts, parsers, and tools with pipe syntax: prompt | llm | parser. LangChain Expression Language.

🤖

AgentExecutor

Wraps an agent + tool list into a runnable loop. Handles tool dispatch, error recovery, and max iteration limits.

📚

Retriever

Standard interface for vector stores, BM25, web search. Swap ChromaDB for Pinecone with no code changes.

🧠

Memory

Conversation buffer, summary memory, vector memory. Manages what gets injected into the context window each turn.

Simple RAG chain (LCEL)

Python — LangChain LCEL

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template("""
Answer based on this context:
{context}

Question: {question}
""")

# Pipe syntax: each step feeds the next
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("What deliveries are due on Friday?")

🌐

LangGraph vs. LangChain AgentExecutor

LangChain's AgentExecutor is great for simple ReAct loops. LangGraph is for complex workflows: conditional branching, parallel nodes, human-in-the-loop checkpoints, and persistent state that survives across API calls.

State graph pattern

Python — LangGraph

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    confidence: float
    needs_review: bool

def analyse_email(state: AgentState) -> AgentState:
    # Call LLM, update state
    ...

def check_confidence(state: AgentState) -> str:
    return "human_review" if state["confidence"] < 0.88 else "send_reply"

builder = StateGraph(AgentState)
builder.add_node("analyse", analyse_email)
builder.add_node("send_reply", send_email)
builder.add_node("human_review", queue_for_review)

builder.set_entry_point("analyse")
builder.add_conditional_edges("analyse", check_confidence)
builder.add_edge("send_reply", END)
builder.add_edge("human_review", END)

graph = builder.compile()

Key differences from LangChain

⛓️ LangChain

General-purpose orchestration
Agents, chains, tools, memory
Larger ecosystem
More boilerplate for RAG

🦙 LlamaIndex

Document-first RAG specialist
Automatic ingestion pipelines
Advanced indexing (RAPTOR, etc.)
Less code for pure RAG use cases

Python — LlamaIndex Quick Start

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load all docs from a folder
documents = SimpleDirectoryReader("./data").load_data()

# Index (embed + store) automatically
index = VectorStoreIndex.from_documents(documents)

# Query with natural language
query_engine = index.as_query_engine()
response = query_engine.query("Summarise all DHL-related emails")

Agents, Tasks, Crews

Python — CrewAI

from crewai import Agent, Task, Crew

researcher = Agent(
    role="Logistics Analyst",
    goal="Find schedule conflicts in incoming emails",
    backstory="Expert in supply chain scheduling with 10 years experience",
    tools=[search_schedules, check_calendar],
    llm=llm
)

writer = Agent(
    role="Email Composer",
    goal="Draft professional responses to logistics partners",
    backstory="Specialist in B2B communications",
    llm=llm
)

analyse_task = Task(
    description="Review the email and identify any conflicts",
    agent=researcher, expected_output="Conflict report"
)

reply_task = Task(
    description="Draft a reply addressing identified conflicts",
    agent=writer, expected_output="Email draft"
)

crew = Crew(agents=[researcher, writer], tasks=[analyse_task, reply_task])
result = crew.kickoff()

🚢

When to use CrewAI

CrewAI shines when your problem maps naturally to roles: researcher, writer, critic, coder. The role-based framing makes it easy to onboard non-engineers who can reason about "who does what" without understanding agent internals.

Model recommendations by task

Task	Recommended	Why
General agent reasoning	`llama3.1:8b`	Best small model for tool use and instruction following
Coding	`qwen2.5-coder:7b`	Fine-tuned specifically for code generation
Fast classification	`mistral:7b`	Very fast, good enough for binary/classification tasks
Embeddings	`nomic-embed-text`	High quality 768-dim, fast inference
Long context (local)	`llama3.1:70b`	Needs 64GB+ RAM — best local quality at any context
Vision + text	`llava:13b`	Multimodal, process images + text locally

AI & Agents Reference

Learning paths

Reference map

When should I use what?

What is an AI Agent?

The core idea

Agent vs. LLM call

Four components of every agent

Minimal agent in Python

The Agent Loop

Observe → Think → Act

Stop conditions

Tool Use

How tool calling works

Common tool categories

Memory Types

Four memory tiers

Ollama Setup

Installation

Call from Python

Full Local Stack

Stack overview

Full setup

Model Cost Comparison

OpenAI GPT-4o

Quick start

Anthropic Claude

Quick start

Google Gemini

Quick start

The ReAct Pattern

Thought → Action → Observation cycle

Multi-Agent Systems

Orchestrator + Worker pattern

When to use multi-agent

Human-in-the-Loop

Confidence threshold routing

Plan & Execute

Why separate planning from execution

How RAG Works

The two pipelines

RAG vs. fine-tuning

Embeddings Deep Dive

What is an embedding?

Cosine similarity

Embedding model comparison

Chunking Strategies

Four main strategies

Vector Databases

Comparison

LangChain

Core abstractions

Simple RAG chain (LCEL)

LangGraph

State graph pattern

LlamaIndex

Key differences from LangChain

CrewAI

Agents, Tasks, Crews

Choosing a Local Model

Model recommendations by task