Documentation Index
Fetch the complete documentation index at: https://mintlify.com/block/goose/llms.txt
Use this file to discover all available pages before exploring further.
Goose supports local inference using llama.cpp, enabling offline usage, data privacy, and cost savings. This guide covers setup, configuration, and optimization.
Overview
Local inference allows you to:
- Run offline: No internet required once models are downloaded
- Preserve privacy: Data never leaves your machine
- Eliminate API costs: No per-token charges
- Customize models: Use fine-tuned or specialized models
- Control resources: Manage GPU/CPU usage precisely
Quick Start
1. Install Prerequisites
macOS/Linux:
# llama.cpp is bundled with Goose
# Just ensure you have enough disk space for models
GPU Acceleration (Optional):
# CUDA (NVIDIA)
sudo apt install nvidia-cuda-toolkit
# ROCm (AMD)
sudo apt install rocm-hip-runtime
# Metal (macOS)
# Built-in, no installation needed
2. Download a Model
Goose supports GGUF format models from Hugging Face:
# Download a recommended model (Qwen 2.5 Coder)
goose download-model qwen2.5-coder-7b-instruct
# Or specify a Hugging Face repo
goose download-model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
Models are stored in ~/.cache/goose/models/.
# Set provider to local
goose configure set GOOSE_PROVIDER local
# Specify model
goose configure set GOOSE_MODEL qwen2.5-coder-7b-instruct
4. Run Goose
Goose will automatically:
- Load the model into memory
- Allocate GPU/CPU resources
- Initialize the inference engine
Supported Models
Goose works with any GGUF model, but these are recommended:
Coding Models
| Model | Size | Context | Best For |
|---|
| Qwen 2.5 Coder 7B | 7B | 32K | General coding, fast |
| Qwen 2.5 Coder 14B | 14B | 32K | Complex tasks, accurate |
| DeepSeek Coder V2 16B | 16B | 128K | Long context, architecture |
| CodeLlama 13B | 13B | 16K | Python, legacy |
General Purpose
| Model | Size | Context | Best For |
|---|
| Llama 3.1 8B | 8B | 128K | Balanced performance |
| Mistral 7B | 7B | 32K | Fast, efficient |
| Phi-3 Mini | 3.8B | 128K | Low memory |
Model sizes are approximate. Quantized versions (Q4, Q5, Q6) reduce memory usage at the cost of slight accuracy loss.
Quantization Levels
GGUF models come in various quantization levels:
| Quantization | Size | Quality | Speed | Recommended |
|---|
| Q8_0 | 100% | Best | Slow | High VRAM |
| Q6_K | 75% | Excellent | Medium | Balanced |
| Q5_K_M | 65% | Very Good | Fast | Recommended |
| Q4_K_M | 50% | Good | Faster | Low VRAM |
| Q3_K_M | 40% | Acceptable | Fastest | Constrained |
Example: Download specific quantization:
goose download-model Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --quant Q5_K_M
Configuration
Model Settings
Configure in ~/.config/goose/config.yaml:
GOOSE_PROVIDER: local
GOOSE_MODEL: qwen2.5-coder-7b-instruct
# Local inference settings
local_inference:
# Context size (tokens)
context_size: 8192
# Generation settings
temperature: 0.7
top_p: 0.95
top_k: 40
repeat_penalty: 1.1
# Performance
n_batch: 512
n_threads: 8
flash_attention: true
# GPU offload (0 = CPU only, -1 = all layers)
n_gpu_layers: -1
Environment Variables
# Force CPU inference
export GOOSE_LOCAL_CPU_ONLY=1
# Set GPU layers manually
export GOOSE_N_GPU_LAYERS=32
# Model cache directory
export GOOSE_MODEL_CACHE=~/.cache/goose/models
Memory Management
Goose automatically estimates memory requirements and adjusts context size.
Memory Estimation
From crates/goose/src/providers/local_inference/inference_engine.rs:
pub fn estimate_max_context_for_memory(
model: &LlamaModel,
runtime: &InferenceRuntime,
) -> Option<usize> {
let available = available_inference_memory_bytes(runtime);
// Reserve 50% for computation buffers
let usable = (available as f64 * 0.5) as u64;
// Calculate KV cache size per token
let bytes_per_token = (k_per_head + v_per_head) * n_head_kv * n_layer * 2;
Some((usable / bytes_per_token) as usize)
}
Memory Requirements
Typical requirements for Q5_K_M quantization:
| Model Size | Model RAM | Context (8K) | Context (32K) | Total |
|---|
| 3B | 2.5 GB | 0.5 GB | 2 GB | 3-5 GB |
| 7B | 5 GB | 1 GB | 4 GB | 6-9 GB |
| 13B | 9 GB | 2 GB | 8 GB | 11-17 GB |
| 34B | 22 GB | 5 GB | 20 GB | 27-42 GB |
If your prompt exceeds available memory, Goose will return an error: “Prompt exceeds estimated memory capacity”. Reduce context size or use a smaller model.
GPU Acceleration
Check GPU usage:
# NVIDIA
nvidia-smi
# AMD
rocm-smi
# macOS
sudo powermetrics --samplers gpu_power
Optimize GPU layers:
# Start with all layers
n_gpu_layers: -1
# If OOM, reduce gradually
n_gpu_layers: 32 # Offload 32 layers to GPU
CPU Optimization
# Use all CPU cores
n_threads: -1 # Auto-detect
# Or specify manually
n_threads: 16
# Batch size (larger = faster, more memory)
n_batch: 512
Flash Attention
Enable for 2-3x faster inference on supported hardware:
Requires:
- CUDA compute capability ≥ 7.0 (RTX 20 series+)
- Metal (macOS M1+)
- ROCm 5.0+
Local models support tool calling through two modes:
Models trained with native tool calling (e.g., Qwen 2.5):
// Automatically detected from model metadata
if model.supports_native_tools() {
// Use built-in tool calling
}
For models without native support, Goose emulates tool calling:
// Inject tool definitions into system prompt
let prompt_with_tools = format!(
"{}\n\nAvailable tools:\n{}",
system_prompt,
tool_definitions
);
Native tool calling is more reliable. Choose models like Qwen 2.5, Mistral, or Llama 3.1 for best results.
Sampling Strategies
Temperature Sampling (Default)
Balanced creativity and coherence:
sampling:
type: temperature
temperature: 0.7 # 0 = deterministic, 1 = creative
top_p: 0.95 # Nucleus sampling
top_k: 40 # Top-k sampling
min_p: 0.05 # Minimum probability
Greedy Sampling
Always select most likely token (deterministic):
Mirostat v2
Adaptive sampling for consistent perplexity:
sampling:
type: mirostat_v2
tau: 5.0 # Target perplexity
eta: 0.1 # Learning rate
Troubleshooting
Model won’t load
# Verify model file exists
ls ~/.cache/goose/models/
# Check format (must be .gguf)
file ~/.cache/goose/models/qwen2.5-coder-7b-instruct.gguf
# Re-download if corrupted
goose download-model qwen2.5-coder-7b-instruct --force
Out of Memory (OOM)
# Reduce context size
context_size: 4096 # Instead of 32768
# Use smaller quantization
# Download Q4_K_M instead of Q5_K_M
# Reduce GPU layers
n_gpu_layers: 24 # Instead of -1
Slow generation
# Increase batch size
n_batch: 1024 # Instead of 512
# Enable flash attention
flash_attention: true
# Use all CPU cores
n_threads: -1
# Offload more to GPU
n_gpu_layers: -1
Poor quality responses
# Use higher quantization (Q6 or Q8)
# Increase temperature
temperature: 0.8
# Adjust penalties
repeat_penalty: 1.15
frequency_penalty: 0.1
Advanced: Custom Model Registry
Define custom models in ~/.config/goose/local_models.yaml:
models:
my-custom-model:
path: /path/to/model.gguf
context_size: 16384
temperature: 0.7
n_gpu_layers: -1
description: "My fine-tuned model"
Use it:
goose configure set GOOSE_MODEL my-custom-model
Implementation Details
Source Code
- Engine:
crates/goose/src/providers/local_inference/inference_engine.rs
- Model registry:
crates/goose/src/providers/local_inference/local_model_registry.rs
- Native tools:
crates/goose/src/providers/local_inference/inference_native_tools.rs
- Emulated tools:
crates/goose/src/providers/local_inference/inference_emulated_tools.rs
- Hugging Face models:
crates/goose/src/providers/local_inference/hf_models.rs
llama.cpp Integration
Goose uses the llama-cpp-2 Rust bindings:
use llama_cpp_2::{
context::LlamaContext,
model::LlamaModel,
sampling::LlamaSampler,
};
// Load model
let model = LlamaModel::load_from_file(path, params)?;
// Create context
let ctx = model.new_context(backend, ctx_params)?;
// Generate
let sampler = LlamaSampler::chain_simple(samplers);
let token = sampler.sample(&ctx, -1);
Resources