Models & Providers

ScalyClaw routes LLM calls through a configurable provider stack. You can register multiple models from different providers and control how they are picked at call time through priority (selection tier) and weight (load-balancing share). All model configuration lives in Redis — no config files on disk — and changes take effect immediately without a restart.

Model Configuration

Models are added and managed in the dashboard under Settings → Models. Each entry in the model list describes a single provider/model combination. You can have as many entries as you need; the orchestrator selects the right one at call time based on priority, weight, and the capabilities required by that particular request.

Model Properties

Property	Type	Description
`id`	string	Unique identifier for this model entry within ScalyClaw. Used internally and in logs — choose something readable like `gpt4o-primary` or `claude-opus-main`.
`name`	string	The model name as the provider expects it — e.g. `gpt-4o`, `claude-sonnet-4-20250514`, `gemini-1.5-pro`. For local/Ollama, this is the tag pulled in Ollama.
`provider`	string	Key into the `providers` map — e.g. `openai`, `anthropic`, `google`. Determines which configured provider credentials and base URL are used.
`enabled`	boolean	Toggle this model on or off without removing it from the config. Disabled models are never selected by the orchestrator.
`priority`	integer	Lower number = higher priority. Only the lowest-numbered enabled group is used for selection; higher-numbered groups act as disabled unless you remove or disable every model in a lower-numbered one. Within the selected group, models are distributed by `weight`. There is no automatic cross-priority fallback on failure — see the Selection Algorithm and "Retry Behavior" below.
`weight`	integer	Relative weight (0–100) for load balancing among models sharing the same priority. A model with `weight: 3` receives three times as many requests as one with `weight: 1`. Useful for spreading load across multiple keys for the same provider.
`temperature`	number	Sampling temperature for this model. Range 0.0–2.0 depending on provider; 0.7 is a reasonable default for conversational use.
`maxTokens`	integer	Maximum number of tokens the model may generate in a single response.
`contextWindow`	integer	Total context window size for this model in tokens, including both input and output. Used to guard against prompts that would exceed the model's limit.
`toolEnabled`	boolean	Whether this model supports tool/function calling. The orchestrator only selects this model for tool-enabled requests when this is `true`.
`imageEnabled`	boolean	Whether this model can process image inputs.
`audioEnabled`	boolean	Whether this model can process audio inputs.
`videoEnabled`	boolean	Whether this model can process video inputs.
`documentEnabled`	boolean	Whether this model can process document inputs (e.g. PDFs).
`reasoningEnabled`	boolean	Whether this model supports extended thinking / reasoning mode.
`inputPricePerMillion`	number	Cost in USD per one million input tokens. Used for budget tracking and spend estimates.
`outputPricePerMillion`	number	Cost in USD per one million output tokens. Used for budget tracking and spend estimates.

Model Selection Algorithm

ScalyClaw picks one model per request using a two-step algorithm:

Filter out disabled — only models with enabled: true are considered. Capability flags (toolEnabled, imageEnabled, etc.) are not enforced at this step — see the callout below.
Top priority group — candidates are sorted by priority (lower number = higher priority). Only the lowest-numbered group is considered. Higher-numbered groups are effectively dormant unless every model in a lower-numbered group is disabled or removed.
Weighted random pick — within that priority group, one model is selected probabilistically. Each model's chance equals weight / totalWeight. A model with weight: 75 is picked three times as often as one with weight: 25 in the same group.
Retry on transient failure — if the selected provider/model call fails with a retryable error, the same provider/model is retried with exponential backoff (default 3 attempts, 500ms base delay; see shared/src/const/constants.ts). Retries never swap to a different model or priority group. If all retries fail, the error propagates to the caller.

Weight is probabilistic, not round-robin

Weights control the probability of selection, not a strict rotation. Each request independently rolls the dice. Over many requests the distribution converges to the weight ratio, but short runs may show variance. A weight of 0 effectively disables a model without removing it from the config.

No automatic cross-provider fallback

If you need provider-level high availability, do not rely on priority tiers — put all the candidates you want to load-balance across into the same priority group with non-zero weights. Today the selector picks one model at the start of a turn; a transient provider failure will exhaust the retry budget on that same model and then surface the error rather than trying the next group. Building true cross-priority fallback is on the roadmap.

Selection Hierarchy

Different components in ScalyClaw can define their own model lists. Selection follows a fallback chain from most specific to most general:

Component	Primary source	Fallback
Orchestrator	`orchestrator.models[]`	Global `models.models[]` (enabled only)
Agent	Agent-specific `models[]`	Orchestrator models, then global
Guards	Guard-specific `model` field	Global `models.models[]`
Proactive	`proactive.model`	Orchestrator models, then global
Memory extraction	Global `models.models[]`	—
Embeddings	`memory.embeddingModel`	`models.embeddingModels[]` via weighted selection

This means the orchestrator and each agent can run on different models. For example, the orchestrator can use Claude Opus while a lightweight research agent uses GPT-4o-mini — all configurable without code changes.

"auto" Model Selection

Several config fields accept the special value "auto", which means "use weighted-random selection from enabled models instead of a fixed model." This is the recommended default — it enables load balancing and automatic fallback.

Field	Behavior when `"auto"`
`memory.embeddingModel`	Select from `models.embeddingModels[]` using priority + weight. Default is `"auto"`.
Agent `models[].model`	Entries with `model: "auto"` are skipped, causing the agent to fall through to the orchestrator's model list, then to the global pool.
`proactive.model`	When empty or `""`, falls through to orchestrator models, then global.
Guard `model` fields	When empty or `""`, falls through to global model selection.

Supported Providers

Providers are registered under config.models.providers as a map from a provider key to an object with an optional apiKey and optional baseUrl. Each model entry's provider field references one of these keys. ScalyClaw ships with 13 built-in provider keys. anthropic uses the native Anthropic SDK; every other key routes through a single OpenAI-compatible factory with the provider's own base URL.

`provider` key	Default baseUrl	API key	Notes
`openai`	`https://api.openai.com/v1`	required	GPT-4 / GPT-4.1 / GPT-4o / GPT-5 / o3 / o4 families. Reasoning models auto-detected by name.
`anthropic`	`https://api.anthropic.com`	required	Claude Opus / Sonnet / Haiku. Extended and adaptive thinking wired up for Claude 4.x.
`google`	`https://generativelanguage.googleapis.com/v1beta/openai`	required	Gemini via Google's OpenAI-compatible endpoint. Gemini 2.5 / 3 / 3.1 families.
`openrouter`	`https://openrouter.ai/api/v1`	required	Any model from OpenRouter's catalog. Pricing on their site.
`mistral`	`https://api.mistral.ai/v1`	required	Mistral Large / Medium / Small / Codestral.
`groq`	`https://api.groq.com/openai/v1`	required	Llama 3.3 / Mixtral / Gemma 2 on Groq's fast-inference stack.
`xai`	`https://api.x.ai/v1`	required	Grok 2 / Grok 3 / Grok 3 mini.
`deepseek`	`https://api.deepseek.com/v1`	required	DeepSeek Chat and Reasoner (R1). Reasoning traces are captured and discarded (echoing back violates the DeepSeek API contract).
`cohere`	`https://api.cohere.com/compatibility/v1`	required	Command R / R+ and Cohere embedding models.
`minimax`	`https://api.minimax.io/v1`	required	MiniMax M2.5 family.
`ollama`	`http://localhost:11434/v1`	not required	Local Ollama over the OpenAI-compatible endpoint. Tool calling works on models that support it (Qwen, Llama 3.3, etc.); Gemma replies in plain text.
`lmstudio`	`http://localhost:1234/v1`	not required	Local LM Studio. Same tool-calling caveats as Ollama.
`custom`	(you supply one)	optional	Arbitrary OpenAI-compatible endpoint. Use for vLLM, llama.cpp server, proxies, or any self-hosted gateway.

A provider is registered at boot and on every config reload. If a provider is marked "key required" above and you haven't supplied apiKey, registration is skipped with a warning log — the provider won't be usable until you add one. There is no azure key in the current build; point a custom entry at your Azure OpenAI endpoint instead.

Example Configuration

A realistic multi-provider setup: Claude Opus 4.7 as the primary, GPT-5.4 as a same-priority peer with lower weight, and a local Qwen model via Ollama for offline resilience. Because there is no automatic cross-priority fallback, put the two cloud models in the same priority tier to share load; the local model sits in a higher-numbered tier and is effectively dormant until you disable the cloud ones.

json

{
  "models": {
    "providers": {
      "anthropic": { "apiKey": "sk-ant-..." },
      "openai":    { "apiKey": "sk-..." },
      "ollama":    { "baseUrl": "http://localhost:11434/v1" }
    },
    "models": [
      {
        "id": "anthropic:claude-opus-4-7",
        "name": "anthropic:claude-opus-4-7",
        "provider": "anthropic",
        "enabled": true,
        "priority": 1,
        "weight": 75,
        "temperature": 0.7,
        "maxTokens": 8192,
        "contextWindow": 1000000,
        "toolEnabled": true,
        "imageEnabled": true,
        "audioEnabled": false,
        "videoEnabled": false,
        "documentEnabled": true,
        "reasoningEnabled": true,
        "inputPricePerMillion": 5.00,
        "outputPricePerMillion": 25.00
      },
      {
        "id": "openai:gpt-5.4",
        "name": "openai:gpt-5.4",
        "provider": "openai",
        "enabled": true,
        "priority": 1,
        "weight": 25,
        "temperature": 0.7,
        "maxTokens": 8192,
        "contextWindow": 1000000,
        "toolEnabled": true,
        "imageEnabled": true,
        "audioEnabled": false,
        "videoEnabled": false,
        "documentEnabled": false,
        "reasoningEnabled": true,
        "inputPricePerMillion": 2.50,
        "outputPricePerMillion": 15.00
      },
      {
        "id": "ollama:qwen3",
        "name": "ollama:qwen3",
        "provider": "ollama",
        "enabled": true,
        "priority": 2,
        "weight": 100,
        "temperature": 0.7,
        "maxTokens": 2048,
        "contextWindow": 32768,
        "toolEnabled": true,
        "imageEnabled": false,
        "audioEnabled": false,
        "videoEnabled": false,
        "documentEnabled": false,
        "reasoningEnabled": true,
        "inputPricePerMillion": 0,
        "outputPricePerMillion": 0
      }
    ],
    "embeddingModels": []
  }
}

With this config, 75% of priority-1 requests go to Claude and 25% go to GPT-5.4. The local Qwen entry at priority 2 is not consulted while any priority-1 model is enabled — it only comes into play once you disable the cloud models (manually or from the dashboard).

Capability flags are advisory

toolEnabled, imageEnabled, audioEnabled, videoEnabled, and documentEnabled are stored for UI hints and future enforcement but are not consulted during model selection today. If you configure a model that lacks a capability your code path needs, the upstream provider will either reject the request or produce a degraded response — the selector will not route around it. Only reasoningEnabled is actually consumed at call time (see the Reasoning section below).

Reasoning Flag

When a model has reasoningEnabled: true, the orchestrator threads the flag through to the provider call. What happens next is provider-specific:

OpenAI-compatible providers (openai, openrouter, google, groq, xai, deepseek, cohere, mistral, minimax, ollama, lmstudio, custom) translate it to reasoning_effort: "medium" on the request body. Providers that don't implement that field ignore it silently.
OpenAI reasoning models (o1/o3/o4/gpt-5.x) are auto-detected by name. For these, the provider also switches max_tokens → max_completion_tokens and omits temperature — both required by the API.
DeepSeek V4 (deepseek-v4-flash, deepseek-v4-pro) supports thinking mode natively; the provider captures any reasoning_content in the response for observability but never echoes it back on subsequent turns (echoing it produces a 400).
Anthropic wraps the flag with a model-name regex check for Claude 4.x. When matched, it sets thinking: { type: "enabled", budget_tokens: … } — Anthropic's extended / adaptive thinking API. Requests against non-4.x models with reasoningEnabled set are not rejected; the flag is simply not forwarded.
Local models (Qwen, Gemma, DeepSeek-R1 via Ollama/LM Studio) emit <think>…</think> blocks or ChatML-style channel delimiters, which the provider strips before returning content to the orchestrator. Those traces are not persisted anywhere.

Embedding Models

ScalyClaw's memory system stores every saved memory entry alongside a high-dimensional vector embedding. When the orchestrator retrieves relevant context before an LLM call, it runs a cosine-similarity search using sqlite-vec against those stored vectors. The accuracy of that search depends entirely on the quality of the embedding model you choose.

How Embeddings Are Generated

When a memory entry is saved — either automatically by the orchestrator or explicitly via the save_memory tool — ScalyClaw calls the configured embedding model to convert the text into a float32 vector. That vector is stored in the SQLite database alongside the entry. At retrieval time, the query text is embedded on the fly using the same model, and sqlite-vec finds the nearest stored vectors by cosine distance.

Switching embedding models is not seamless

The memory_vec virtual table is created with the dimension count of the embedding model enabled at first boot. Switching later to a model with the same dimension count (e.g. OpenAI text-embedding-3-small ↔ LM Studio text-embedding-bge-small-en-v1.5 both 1536 / 384 respectively — match yours) will work, but semantic search quality may degrade because old vectors come from a different model. Switching to a model with a different dimension count will fail the runtime dimension guard inside generateEmbedding. A built-in bulk re-embed tool is not yet implemented; the only clean recovery today is to wipe the SQLite database file and let ScalyClaw rebuild it on next boot (memories are lost unless you've exported them first).

Recommended Models

Model	Provider	Dimensions	Recommended for
`text-embedding-3-small`	OpenAI	1536	Best cost-to-quality ratio for most deployments. Good multilingual support. Default recommendation.
`text-embedding-3-large`	OpenAI	3072	Higher accuracy for large, diverse memory stores. Higher cost and storage per entry.
`text-embedding-ada-002`	OpenAI	1536	Legacy model. Use `text-embedding-3-small` instead for new deployments.
`nomic-embed-text`	Local / Ollama	768	Fully local, no API cost. Good quality for English-primary content. Pull with `ollama pull nomic-embed-text`.
`mxbai-embed-large`	Local / Ollama	1024	Higher-quality local embedding. Slightly larger and slower than nomic-embed-text but better recall.

Embedding Configuration

Embedding models live in config.models.embeddingModels, the same config section as chat models but in a separate array. You can use a different provider for embeddings than for chat — for example, use Anthropic for chat but OpenAI's cheaper embedding API for memory. Each entry shares the same providers map as chat models.

json

{
  "models": {
    "providers": {
      "openai": { "apiKey": "sk-..." }
    },
    "models": [ /* ... chat models ... */ ],
    "embeddingModels": [
      {
        "id": "openai-embed",
        "name": "text-embedding-3-small",
        "provider": "openai",
        "enabled": true,
        "priority": 1,
        "weight": 100,
        "dimensions": 1536,
        "inputPricePerMillion": 0.02,
        "outputPricePerMillion": 0
      }
    ]
  }
}

For a fully local setup with Ollama, add the local provider to the providers map and point the embedding model at it:

json

{
  "models": {
    "providers": {
      "ollama": { "baseUrl": "http://localhost:11434/v1" }
    },
    "models": [ /* ... chat models ... */ ],
    "embeddingModels": [
      {
        "id": "ollama:nomic-embed-text",
        "name": "ollama:nomic-embed-text",
        "provider": "ollama",
        "enabled": true,
        "priority": 1,
        "weight": 100,
        "dimensions": 768,
        "inputPricePerMillion": 0,
        "outputPricePerMillion": 0
      }
    ]
  }
}

Tip

The dimensions field must exactly match what the model actually produces. If you set it wrong, sqlite-vec will reject the insert. Check your model's documentation for the exact output dimension before setting this value.

Budget Control

LLM API calls cost money. ScalyClaw tracks token usage per model and per day, accumulates spending estimates based on the inputPricePerMillion and outputPricePerMillion values you configure on each model, and enforces configurable global daily and monthly limits. Budget is a single global config block — there are no per-model budget caps. You can choose between hard enforcement (block all calls when the limit is reached) or soft enforcement (warn but continue).

Enforcement Modes

Mode	Behavior when limit is reached	Use when
Hard stop	All LLM calls are blocked immediately. The system returns an error message to the channel explaining the budget limit has been reached. No calls go out until the limit resets (midnight UTC for daily, first of month for monthly).	Production deployments with strict cost controls, shared installations, or when you want to guarantee a monthly maximum spend.
Soft warn	LLM calls continue normally. A warning is emitted to the dashboard logs and, optionally, to a configured alert channel. The system does not stop; it only signals that the threshold has been crossed.	Personal deployments where uninterrupted service matters more than strict spend enforcement, or when you want visibility without interruption.

Per-Model Cost Tracking

Every LLM call records the number of input tokens, output tokens, and the estimated cost in USD using the pricing table ScalyClaw maintains for each known model. Costs are stored in Redis and aggregated by day and by month. The dashboard usage page displays:

Daily and monthly spend broken down by model
Token usage histograms per model per day
Budget consumption as a percentage of configured limits
A list of the most expensive individual requests

For models with custom or unknown pricing (e.g., local models or new provider releases), set inputPricePerMillion and outputPricePerMillion directly on the model entry in config.models.models. ScalyClaw uses those figures for all cost tracking and budget accounting for that model. Set both to 0 for free local models.

Budget Configuration

json

{
  "budget": {
    "monthlyLimit": 150,
    "dailyLimit": 10,
    "hardLimit": true,
    "alertThresholds": [50, 80, 90]
  }
}

Budget is a single global block — there are no per-model caps. The fields are:

monthlyLimit — maximum USD spend per calendar month. Set to 0 for unlimited.
dailyLimit — maximum USD spend per day (resets at midnight UTC). Set to 0 for unlimited.
hardLimit — when true, all LLM calls are blocked once a limit is reached. When false, the system continues but emits warnings.
alertThresholds — array of percentage values (e.g. [50, 80, 90]). A warning is emitted to dashboard logs and any configured alert channels each time cumulative spend crosses one of these thresholds — giving you advance notice before a hard stop occurs.

Custom Pricing

Set inputPricePerMillion and outputPricePerMillion directly on any model entry. ScalyClaw uses those values for all cost tracking for that model:

json

{
  "id": "custom:my-gateway-gpt4o",
  "name": "custom:my-gateway-gpt4o",
  "provider": "custom",
  "enabled": true,
  "priority": 1,
  "weight": 100,
  "temperature": 0.7,
  "maxTokens": 4096,
  "contextWindow": 128000,
  "toolEnabled": true,
  "imageEnabled": true,
  "audioEnabled": false,
  "videoEnabled": false,
  "documentEnabled": false,
  "reasoningEnabled": false,
  "inputPricePerMillion": 2.50,
  "outputPricePerMillion": 10.00
}

Tip

Set monthlyLimit and dailyLimit conservatively with hardLimit: true in production. The alertThresholds array lets you get warnings at e.g. 50%, 80%, and 90% of the limit so you can react before the system blocks calls entirely.