Model Aliases

Model Aliases in Bodhi App provide a streamlined way to manage and apply LLM configurations for inference.

There are two kinds of model aliases:

User Defined Model Alias
GGUF Model File Defined Alias

User Defined Model Alias

A User Defined Model Alias is essentially a YAML configuration file that contains default request parameters as well as server (context) parameters. This approach makes it easy to reuse and switch between specific model setups without having to reconfigure complex settings each time.

Sample Model Alias YAML File

All User Defined Model Aliases can be found in the $BODHI_HOME/aliases folder. A sample model alias file is shown below:

alias: myllama3
repo: QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
filename: Meta-Llama-3-8B-Instruct.Q8_0.gguf
chat_template: llama3
context_params:
  n_ctx: 2048         # Maximum tokens in the prompt
  n_threads: 4        # Number of CPU threads to use
  n_parallel: 1       # Default parallel requests
  n_predict: 4096     # Limit on tokens to generate
  n_keep: 24          # Tokens to keep from the initial prompt
request_params:
  seed: 42            # Ensures determinism
  temperature: 0.7    # Adjusts response randomness
  frequency_penalty: 0.8  # Reduces repetition
  stop:
    - <|start_header_id|>
    - <|end_header_id|>
    - <|eot_id|>

A Model Alias YAML file includes the following keys:

alias: (required) A unique name for your model configuration.
repo: (required) The source repository (typically from HuggingFace).
filename: (required) The specific GGUF model file used.
snapshot: (optional) The commit hash if you want to target a specific version; defaults to the main branch if omitted.
chat_template: The template used to convert the conversation into an LLM input prompt.
context_params: Default server settings that impact model initialization:
- n_ctx: Maximum tokens in the prompt.
- n_threads: Number of CPU threads to use.
- n_parallel: Number of requests to process concurrently.
- n_predict: Maximum tokens to generate.
- n_keep: Tokens retained from the initial prompt.
request_params: Default request parameters applied if not specified during a request:
- frequency_penalty: Reduces repetition.
- max_tokens: Limits the length of responses.
- presence_penalty: Encourages topic diversity.
- seed: Ensures reproducible outputs.
- stop: Up to four sequences that, when encountered, halt the response.
- temperature: Adjusts the randomness of responses.
- top_p: Sets the token probability threshold.

Chat Template Types

Bodhi App supports three types of chat templates:

Embedded Chat Template:
The GGUF file may include metadata for a chat template. Ensure the selected file contains this metadata; otherwise, a runtime error might occur.
Inbuilt Chat Template:
A curated list of popular templates is available out of the box. If your model uses one of these templates, simply select it from the inbuilt list. Supported templates include llama3, llama2, llama-2-legacy, phi3, gemma, deepseek, commandr, openchat, and tinyllama.
Repo Chat Template:
Specify a HuggingFace repository (e.g., meta-llama/Meta-Llama-3-8B-Instruct). Bodhi will fetch the tokenizer_config.json from the repository and use the defined chat template and token ID information.

GGUF Model File Defined Alias

A GGUF Model File Defined Alias leverages complete metadata embedded in the GGUF file—including the chat template and token IDs. In this case, all the default request and context parameters are used, and you cannot override them. This method is the quickest, most direct way to run a model within the app.

The model alias ID for this type is typically a combination of the model repository and the quantization detail. For example, for a repo QuantFactory/Meta-Llama-3-8B-Instruct-GGUF and filename Meta-Llama-3-8B-Instruct.Q8_0.gguf, the model alias ID would be:

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF:Q8_0

How Model Aliases Work

For a User Defined Model Alias, when you reference the alias ID in your chat settings or API calls (using the model parameter), Bodhi App will:

Launch the LLM inference server (if not already running) with the provided context_params settings.
Apply the request_params as the default settings on the request.
Convert the chat message into an LLM input prompt using the configured chat_template.
Forward the request to the inference server.
Stream back the response received from the inference server.

Similarly, for a GGUF Model File Defined Alias, the process is:

Launch the LLM inference server, if not already running, with default server settings.
Convert the chat message into an LLM input prompt using the embedded chat_template and metadata.
Forward the request to the inference server.
Stream back the response.

This approach offers several advantages:

Simplicity: Manage complex configuration details with a single, easy-to-reference alias.
Speed: Quickly start running inferences against a downloaded GGUF file without additional configuration.
Consistency: Ensure that the same parameters are applied across multiple chat sessions or API interactions.
Flexibility: Easily update your configurations via the UI or API, with the server restarting to apply new settings.

Models Page

The Models page lists all your User Defined Model Aliases as well as GGUF Model File Defined Aliases. From this page, you can browse, edit, and start a chat with any model alias. A copy button appears when you hover over a column value, allowing you to easily copy configuration details.

Models Page

Model Alias Form

You can access the New Model Alias form directly from the Models page.

Model Alias Form

We hope the above form is self-explanatory.

Best Practices and Reference Configurations

Bodhi App's Model Alias system is designed to simplify advanced model configuration. By leveraging aliases, you can ensure that each chat session uses a clear, consistent setup tailored to your requirements.

Performance Considerations

When configuring model aliases, consider these key performance factors:

Memory Usage vs Thread Count

Higher thread counts (n_threads) can improve inference speed
But each thread requires additional memory
Recommended: Start with n_threads = number of CPU cores / 2

Context Size Impact

Larger context sizes (n_ctx) allow for longer conversations
But increase memory usage and initial load time
Recommended: Start with 2048 tokens and adjust based on needs

Quantization Effects

Lower bit models (Q4_K_M) use less memory but may reduce quality
Higher bit models (Q8_0) provide better quality but use more memory
Recommended: Test different quantization levels for your use case

Optimization Tips

Set n_parallel based on your expected concurrent usage
Use n_keep to maintain important context while reducing memory usage
Consider using stop sequences to prevent unnecessary token generation

Happy configuring!

Home

Model Aliases

User Defined Model Alias

Sample Model Alias YAML File

Chat Template Types

GGUF Model File Defined Alias

How Model Aliases Work

Models Page

Model Alias Form

Best Practices and Reference Configurations

Performance Considerations

Memory Usage vs Thread Count

Context Size Impact

Quantization Effects

Optimization Tips