LLM Servers
There are two serving runtimes for Chariot LLMs: Hugging Face Pipelines and vLLM. While Hugging Face Pipelines are able to run virtually every Text Generation and Conversational LLM, they are not the most efficient in terms of memory management and throughput. On the other hand, vLLM is a state-of-the-art LLM inference engine that has many optimizations—such as PagedAttention, continuous batching, and optimized CUDA kernels—but it doesn't support every open source LLM. To check if your model is supported by vLLM, check their documentation. Generally, vLLM supports most popular LLM architectures and adds new support frequently. The default inference engine for LLMs in Chariot is Hugging Face Pipelines.
Configuring vLLM
When creating a vLLM Inference Server for an LLM in Chariot, there are a few pre-defined vLLM configuration options that are available.
Option name | Explanation | Possible values | Default value |
---|---|---|---|
bitsandbytes_4bit | Whether to load the model with in-flight 4-bit quantization | True or False | False |
max_model_length | The desired context length for the model | Integer > 0 | Determined by model config |
enable_prefix_caching | Whether to enable prefix caching | True or False | False |
seed | Initial seed to use for generation | Integer > 0 | 0 |
More details about these options can be found in the vLLM documentation.
To start a vLLM Inference Server and configure it with one or more of these settings, use the vllm_config
keyword:
from chariot.client import connect
from chariot.models import Model
connect()
model = Model(name="MyLLM", project_name="MyProject", start_server=False)
vllm_config = {"max_model_length": 8000, "bitsandbytes_4bit": True}
model.start_inference_server(
cpu="7",
memory="57Gi",
min_replicas=1,
gpu_count=1,
gpu_type="Tesla-V100-SXM2-16GB",
inference_engine="vLLM",
vllm_config=vllm_config
)
Configuring Hugging Face Pipelines
When creating a Hugging Face Pipeline Inference Server for Hugging Face models within Chariot, additional keyword arguments—called "kwargs"—can be provided to customize the model loading process. These keyword arguments are passed directly to the from_pretrained method when the model gets loaded for the pipeline.
- UI
- SDK
In the Create Inference Server modal, a list of Hugging Face kwargs is provided at the bottom of the modal, along with each kwarg's value. To add a new kwarg, click the + Add new button and provide the necessary information.
Added kwargs may be removed by clicking the trash can icon to the right of the kwarg's value. Previously added kwargs will be retained once added, so when additional Inference Servers are created, the previous kwargs will be listed as available.
Key word arguments supplied in this way are not validated, so be sure to type them correctly in order to prevent issues.
Once your Hugging Face kwargs have been added as intended, click the Submit button to finish creating your Inference Server.
Use the huggingface_model_kwargs
argument when creating an Inference Server with the Chariot SDK to supply the kwargs to the model load function.
import chariot.client
from chariot.models import Model
chariot.client.connect()
model = Model(
name="<NAME OF MODEL>",
# One of `project_id` or `project_name` is required.
project_id="<PROJECT ID>",
project_name="<PROJECT NAME>",
start_server=False
)
model.start_inference_server(
huggingface_model_kwargs={"attention_dropout ": 0.05}
)
By supplying your additional keyword arguments when creating your Inference Server, you can leverage the full flexibility of Hugging Face models in the Chariot platform and tailor them to your specific needs. For a comprehensive list of available kwargs, please refer to the from_pretrained Hugging Face docs for your specific model type.
In-Flight Quantization
As mentioned above, it is very likely that your model won't fit on the GPUs that you have access to in your Chariot instance, even in 3B models. To alleviate this, both of our serving runtimes support in-flight inference quantization via bitsandbytes
. Quantizing a model's weights to lower precision will allow them to fit into a smaller GPU for inference.
Chariot does not yet support serving pre-quantized models, which are models whose weights have been quantized before upload. Examples of pre-quantization strategies include GGUF, AWQ, and GPTQ. Pre-quantized models typically have the quantization strategy in their name (e.g., Llama-3.2-3B-Instruct-GGUF, so you'll know if your model is pre-quantized. Therefore, when you upload an LLM to Chariot, make sure it is an unquantized model. You'll have the option to quantize it in flight when you start the Inference Server, as shown below.
The following is an example of how to start an Inference Server with in-flight quantization in the SDK:
from chariot.client import connect
from chariot.models import Model
connect()
model = Model(name="MyLLM", project_name="MyProject", start_server=False)
model.start_inference_server(
cpu="7",
memory="57Gi",
min_replicas=1,
gpu_count=1,
gpu_type="Tesla-V100-SXM2-16GB",
quantization_bits=4,
inference_engine="Huggingface" # use inference_engine="vLLM" to use vLLM instead
)
Note that you could equivalently pass huggingface_model_kwargs = {"load_in_4bit": True}
or vllm_config = {"bitsandbytes_4bit": True}
, depending on which serving runtime (inference engine) you specify.
Our Hugging Face Pipeline runtime supports 8-bit and 4-bit quantization, and vLLM supports 4-bit quantization. As a rough benchmark, an 8-bit quantized 7B model will need about 8-10GB of VRAM, and a 4-bit quantized 7B model will need about 5-6GB of VRAM. In addition, about 25Gi of CPU memory is needed (sometimes more, like with 13B models).
Some LLMs are known to have quirks when you quantize them, which can result in errors. In our experience, 8-bit quantization can be more problematic than 4-bit quantization.
Performing Inferences
There are two task types for LLMs, so the way you should perform inference depends on the task type.
Text Generation Models
For Text Generation models, use the complete
inference method, and simply pass the string you wish to complete:
from chariot.client import connect
from chariot.models import Model
connect()
model = Model(name="MyLLM", project_name="MyProject", start_server=False)
# Assume inference server is already started.
response = model.complete("The cat in the")
print(response.choices[0].text) # "Hat"
Conversational Models
For Conversational models, use the chat
inference method, and pass a sequence of user/assistant messages:
from chariot.client import connect
from chariot.models import Model
connect()
model = Model(name="MyLLM", project_name="MyProject", start_server=False)
# Assume inference server is already started.
messages = [
{"role": "user", "content": "Can you help me with my geography homework?"},
{"role": "assistant", "content": "Sure! What do you need help with?"},
{"role": "user", "content": "I need to identify the similarities and differences between tributaries and estuaries"}
]
response = model.chat(messages)
print(response.choices[-1].message.content) # "Tributaries and estuaries are both important features ..."
For chat
, the messages must alternate between user
and assistant
roles; otherwise, it will raise an error.
In both cases, the response objects coming from complete
and chat
are compatible with the OpenAI API standard, meaning the response format is identical to that of ChatGPT via API.
Inference Parameters
Inference parameters are also supported. For example, if you wanted to use temperature = 0.2
for your inference call, you pass it as normal:
model.chat(messages, temperature=0.2)
The available inference parameters for both Text Generation and Conversational task types are described below. Note that some options are only supported in vLLM runtimes.
Parameter Name | Description | Supported With Hugging Face Pipelines | Supported With vLLM |
---|---|---|---|
max_completion_tokens | Maximum number of new tokens for the model to generate. If using the Hugging Face engine, the default value is 50. | ✅ | ✅ |
min_tokens | Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated. | ✅ | ✅ |
presence_penalty | Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens while values < 0 encourage the model to repeat tokens. | ❌ | ✅ |
frequency_penalty | Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens while values < 0 encourage the model to repeat tokens. | ❌ | ✅ |
repetition_penalty | Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens while values < 1 encourage the model to repeat tokens. | ✅ | ✅ |
temperature | Float that controls the randomness of the sampling. Lower values make the model more deterministic while higher values make the model more random. Zero means greedy sampling. | ✅ | ✅ |
top_p | Float that controls the cumulative probability of the top tokens to consider. Must be in [0, 1]. Set to 1 to consider all tokens. | ✅ | ✅ |
top_k | Integer that controls the number of top tokens to consider. | ✅ | ✅ |
min_p | Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this. | ✅ | ✅ |
seed | Random seed to use for the generation. | ❌ | ✅ |
use_beam_search | Whether to use beam search. | ✅ | ✅ |
best_of | Number of output sequences that are generated from the prompt. From these best_of sequences, the best sequence is returned. best_of must be greater than or equal to the number of returned sequences. This is treated as the beam width when use_beam_search is True. By default, best_of is set to 1. | ✅ | ✅ |
length_penalty | Float that penalizes sequences based on their length. Used in beam search. length_penalty > 0.0 promotes longer sequences while length_penalty < 0.0 encourages shorter sequences. | ✅ | ✅ |
stop | List of strings that stop the generation when they are generated. The returned output will not contain the stop strings. | ❌ | ✅ |
stop_token_ids | List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens. | ✅ | ✅ |
include_stop_str_in_output | Whether to include the stop strings in output text. Defaults to False. | ❌ | ✅ |
ignore_eos | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | ❌ | ✅ |
logprobs | Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. | ❌ | ✅ |
top_logprobs | Number of log probabilities to return per prompt token. | ❌ | ✅ |
do_sample | Whether or not to use sampling. Only a valid option for Hugging Face engine. | ✅ | ❌ |