Local Inference

Goose supports local inference using llama.cpp, enabling offline usage, data privacy, and cost savings. This guide covers setup, configuration, and optimization.

Overview

Local inference allows you to:

Run offline: No internet required once models are downloaded
Preserve privacy: Data never leaves your machine
Eliminate API costs: No per-token charges
Customize models: Use fine-tuned or specialized models
Control resources: Manage GPU/CPU usage precisely

Quick Start

1. Install Prerequisites

macOS/Linux:

# llama.cpp is bundled with Goose
# Just ensure you have enough disk space for models

GPU Acceleration (Optional):

# CUDA (NVIDIA)
sudo apt install nvidia-cuda-toolkit

# ROCm (AMD)
sudo apt install rocm-hip-runtime

# Metal (macOS)
# Built-in, no installation needed

2. Download a Model

Goose supports GGUF format models from Hugging Face:

# Download a recommended model (Qwen 2.5 Coder)
goose download-model qwen2.5-coder-7b-instruct

# Or specify a Hugging Face repo
goose download-model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF

Models are stored in ~/.cache/goose/models/.

3. Configure Goose

# Set provider to local
goose configure set GOOSE_PROVIDER local

# Specify model
goose configure set GOOSE_MODEL qwen2.5-coder-7b-instruct

4. Run Goose

goose session start

Goose will automatically:

Load the model into memory
Allocate GPU/CPU resources
Initialize the inference engine

Supported Models

Goose works with any GGUF model, but these are recommended:

Coding Models

Model	Size	Context	Best For
Qwen 2.5 Coder 7B	7B	32K	General coding, fast
Qwen 2.5 Coder 14B	14B	32K	Complex tasks, accurate
DeepSeek Coder V2 16B	16B	128K	Long context, architecture
CodeLlama 13B	13B	16K	Python, legacy

General Purpose

Model	Size	Context	Best For
Llama 3.1 8B	8B	128K	Balanced performance
Mistral 7B	7B	32K	Fast, efficient
Phi-3 Mini	3.8B	128K	Low memory

Model sizes are approximate. Quantized versions (Q4, Q5, Q6) reduce memory usage at the cost of slight accuracy loss.

Quantization Levels

GGUF models come in various quantization levels:

Quantization	Size	Quality	Speed	Recommended
Q8_0	100%	Best	Slow	High VRAM
Q6_K	75%	Excellent	Medium	Balanced
Q5_K_M	65%	Very Good	Fast	Recommended
Q4_K_M	50%	Good	Faster	Low VRAM
Q3_K_M	40%	Acceptable	Fastest	Constrained

Example: Download specific quantization:

goose download-model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --quant Q5_K_M

Configuration

Model Settings

Configure in ~/.config/goose/config.yaml:

GOOSE_PROVIDER: local
GOOSE_MODEL: qwen2.5-coder-7b-instruct

# Local inference settings
local_inference:
  # Context size (tokens)
  context_size: 8192
  
  # Generation settings
  temperature: 0.7
  top_p: 0.95
  top_k: 40
  repeat_penalty: 1.1
  
  # Performance
  n_batch: 512
  n_threads: 8
  flash_attention: true
  
  # GPU offload (0 = CPU only, -1 = all layers)
  n_gpu_layers: -1

Environment Variables

# Force CPU inference
export GOOSE_LOCAL_CPU_ONLY=1

# Set GPU layers manually
export GOOSE_N_GPU_LAYERS=32

# Model cache directory
export GOOSE_MODEL_CACHE=~/.cache/goose/models

Memory Management

Goose automatically estimates memory requirements and adjusts context size.

Memory Estimation

From crates/goose/src/providers/local_inference/inference_engine.rs:

pub fn estimate_max_context_for_memory(
    model: &LlamaModel,
    runtime: &InferenceRuntime,
) -> Option<usize> {
    let available = available_inference_memory_bytes(runtime);
    
    // Reserve 50% for computation buffers
    let usable = (available as f64 * 0.5) as u64;
    
    // Calculate KV cache size per token
    let bytes_per_token = (k_per_head + v_per_head) * n_head_kv * n_layer * 2;
    
    Some((usable / bytes_per_token) as usize)
}

Memory Requirements

Typical requirements for Q5_K_M quantization:

Model Size	Model RAM	Context (8K)	Context (32K)	Total
3B	2.5 GB	0.5 GB	2 GB	3-5 GB
7B	5 GB	1 GB	4 GB	6-9 GB
13B	9 GB	2 GB	8 GB	11-17 GB
34B	22 GB	5 GB	20 GB	27-42 GB

If your prompt exceeds available memory, Goose will return an error: “Prompt exceeds estimated memory capacity”. Reduce context size or use a smaller model.

Performance Tuning

GPU Acceleration

Check GPU usage:

# NVIDIA
nvidia-smi

# AMD
rocm-smi

# macOS
sudo powermetrics --samplers gpu_power

Optimize GPU layers:

# Start with all layers
n_gpu_layers: -1

# If OOM, reduce gradually
n_gpu_layers: 32  # Offload 32 layers to GPU

CPU Optimization

# Use all CPU cores
n_threads: -1  # Auto-detect

# Or specify manually
n_threads: 16

# Batch size (larger = faster, more memory)
n_batch: 512

Flash Attention

Enable for 2-3x faster inference on supported hardware:

flash_attention: true

Requires:

CUDA compute capability ≥ 7.0 (RTX 20 series+)
Metal (macOS M1+)
ROCm 5.0+

Tool Support

Local models support tool calling through two modes:

Native Tools (Preferred)

Models trained with native tool calling (e.g., Qwen 2.5):

// Automatically detected from model metadata
if model.supports_native_tools() {
    // Use built-in tool calling
}

Emulated Tools

For models without native support, Goose emulates tool calling:

// Inject tool definitions into system prompt
let prompt_with_tools = format!(
    "{}\n\nAvailable tools:\n{}",
    system_prompt,
    tool_definitions
);

Native tool calling is more reliable. Choose models like Qwen 2.5, Mistral, or Llama 3.1 for best results.

Sampling Strategies

Temperature Sampling (Default)

Balanced creativity and coherence:

sampling:
  type: temperature
  temperature: 0.7  # 0 = deterministic, 1 = creative
  top_p: 0.95       # Nucleus sampling
  top_k: 40         # Top-k sampling
  min_p: 0.05       # Minimum probability

Greedy Sampling

Always select most likely token (deterministic):

sampling:
  type: greedy

Mirostat v2

Adaptive sampling for consistent perplexity:

sampling:
  type: mirostat_v2
  tau: 5.0   # Target perplexity
  eta: 0.1   # Learning rate

Troubleshooting

Model won’t load

# Verify model file exists
ls ~/.cache/goose/models/

# Check format (must be .gguf)
file ~/.cache/goose/models/qwen2.5-coder-7b-instruct.gguf

# Re-download if corrupted
goose download-model qwen2.5-coder-7b-instruct --force

Out of Memory (OOM)

# Reduce context size
context_size: 4096  # Instead of 32768

# Use smaller quantization
# Download Q4_K_M instead of Q5_K_M

# Reduce GPU layers
n_gpu_layers: 24  # Instead of -1

Slow generation

# Increase batch size
n_batch: 1024  # Instead of 512

# Enable flash attention
flash_attention: true

# Use all CPU cores
n_threads: -1

# Offload more to GPU
n_gpu_layers: -1

Poor quality responses

# Use higher quantization (Q6 or Q8)
# Increase temperature
temperature: 0.8

# Adjust penalties
repeat_penalty: 1.15
frequency_penalty: 0.1

Advanced: Custom Model Registry

Define custom models in ~/.config/goose/local_models.yaml:

models:
  my-custom-model:
    path: /path/to/model.gguf
    context_size: 16384
    temperature: 0.7
    n_gpu_layers: -1
    description: "My fine-tuned model"

Use it:

goose configure set GOOSE_MODEL my-custom-model

Implementation Details

Source Code

Engine: crates/goose/src/providers/local_inference/inference_engine.rs
Model registry: crates/goose/src/providers/local_inference/local_model_registry.rs
Native tools: crates/goose/src/providers/local_inference/inference_native_tools.rs
Emulated tools: crates/goose/src/providers/local_inference/inference_emulated_tools.rs
Hugging Face models: crates/goose/src/providers/local_inference/hf_models.rs

llama.cpp Integration

Goose uses the llama-cpp-2 Rust bindings:

use llama_cpp_2::{
    context::LlamaContext,
    model::LlamaModel,
    sampling::LlamaSampler,
};

// Load model
let model = LlamaModel::load_from_file(path, params)?;

// Create context
let ctx = model.new_context(backend, ctx_params)?;

// Generate
let sampler = LlamaSampler::chain_simple(samplers);
let token = sampler.sample(&ctx, -1);

Resources

llama.cpp: https://github.com/ggerganov/llama.cpp
GGUF models: https://huggingface.co/models?library=gguf
Quantization guide: https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md
Goose local inference: crates/goose/src/providers/local_inference/

Getting Started

Core Concepts

User Guides

Advanced

Troubleshooting

Documentation Index

​Overview

​Quick Start

​1. Install Prerequisites

​2. Download a Model

​3. Configure Goose

​4. Run Goose

​Supported Models

​Coding Models

​General Purpose

​Quantization Levels

​Configuration

​Model Settings

​Environment Variables

​Memory Management

​Memory Estimation

​Memory Requirements

​Performance Tuning

​GPU Acceleration

​CPU Optimization

​Flash Attention

​Tool Support

​Native Tools (Preferred)

​Emulated Tools

​Sampling Strategies

​Temperature Sampling (Default)

​Greedy Sampling

​Mirostat v2

​Troubleshooting

​Model won’t load

​Out of Memory (OOM)

​Slow generation

​Poor quality responses

​Advanced: Custom Model Registry

​Implementation Details

​Source Code

​llama.cpp Integration

​Resources

Overview

Quick Start

1. Install Prerequisites

2. Download a Model

3. Configure Goose

4. Run Goose

Supported Models

Coding Models

General Purpose

Quantization Levels

Configuration

Model Settings

Environment Variables

Memory Management

Memory Estimation

Memory Requirements

Performance Tuning

GPU Acceleration

CPU Optimization

Flash Attention

Tool Support

Native Tools (Preferred)

Emulated Tools

Sampling Strategies

Temperature Sampling (Default)

Greedy Sampling

Mirostat v2

Troubleshooting

Model won’t load

Out of Memory (OOM)

Slow generation

Poor quality responses

Advanced: Custom Model Registry

Implementation Details

Source Code

llama.cpp Integration

Resources