LLM Servers

There are two serving runtimes for Chariot LLMs: Hugging Face Pipelines and vLLM. While Hugging Face Pipelines are able to run virtually every Text Generation and Conversational LLM, they are not the most efficient in terms of memory management and throughput. On the other hand, vLLM is a state-of-the-art LLM inference engine that has many optimizations—such as PagedAttention, continuous batching, and optimized CUDA kernels—but it doesn't support every open source LLM. To check if your model is supported by vLLM, check their documentation. Generally, vLLM supports most popular LLM architectures and adds new support frequently. The default inference engine for LLMs in Chariot is Hugging Face Pipelines.

Configuring vLLM

When creating a vLLM Inference Server for an LLM in Chariot, there are a few pre-defined vLLM configuration options that are available.

Option name	Explanation	Possible values	Default value
`bitsandbytes_4bit`	Whether to load the model with in-flight 4-bit quantization	True or False	False
`max_model_length`	The desired context length for the model	Integer > 0	Determined by model config
`enable_prefix_caching`	Whether to enable prefix caching	True or False	False
`seed`	Initial seed to use for generation	Integer > 0	0

More details about these options can be found in the vLLM documentation.

To start a vLLM Inference Server and configure it with one or more of these settings, use the vllm_config keyword:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="MyLLM", project_name="MyProject", start_server=False)
vllm_config = {"max_model_length": 8000, "bitsandbytes_4bit": True}
model.start_inference_server(
    cpu="7", 
    memory="57Gi", 
    min_replicas=1, 
    gpu_count=1, 
    gpu_type="Tesla-V100-SXM2-16GB", 
    inference_engine="vLLM", 
    vllm_config=vllm_config
)

Configuring Hugging Face Pipelines

When creating a Hugging Face Pipeline Inference Server for Hugging Face models within Chariot, additional keyword arguments—called "kwargs"—can be provided to customize the model loading process. These keyword arguments are passed directly to the from_pretrained method when the model gets loaded for the pipeline.

In the Create Inference Server modal, a list of Hugging Face kwargs is provided at the bottom of the modal, along with each kwarg's value. To add a new kwarg, click the + Add new button and provide the necessary information.

Added kwargs may be removed by clicking the trash can icon to the right of the kwarg's value. Previously added kwargs will be retained once added, so when additional Inference Servers are created, the previous kwargs will be listed as available.

note

Key word arguments supplied in this way are not validated, so be sure to type them correctly in order to prevent issues.

Once your Hugging Face kwargs have been added as intended, click the Submit button to finish creating your Inference Server.

ModelKwargs

Use the huggingface_model_kwargs argument when creating an Inference Server with the Chariot SDK to supply the kwargs to the model load function.

import chariot.client
from chariot.models import Model

chariot.client.connect()

model = Model(
    name="<NAME OF MODEL>",
    # One of `project_id` or `project_name` is required.
    project_id="<PROJECT ID>",
    project_name="<PROJECT NAME>",
    start_server=False
)
model.start_inference_server(
    huggingface_model_kwargs={"attention_dropout ": 0.05}
)

By supplying your additional keyword arguments when creating your Inference Server, you can leverage the full flexibility of Hugging Face models in the Chariot platform and tailor them to your specific needs. For a comprehensive list of available kwargs, please refer to the from_pretrained Hugging Face docs for your specific model type.

In-Flight Quantization

As mentioned above, it is very likely that your model won't fit on the GPUs that you have access to in your Chariot instance, even in 3B models. To alleviate this, both of our serving runtimes support in-flight inference quantization via bitsandbytes. Quantizing a model's weights to lower precision will allow them to fit into a smaller GPU for inference.

note

Chariot does not yet support serving pre-quantized models, which are models whose weights have been quantized before upload. Examples of pre-quantization strategies include GGUF, AWQ, and GPTQ. Pre-quantized models typically have the quantization strategy in their name (e.g., Llama-3.2-3B-Instruct-GGUF, so you'll know if your model is pre-quantized. Therefore, when you upload an LLM to Chariot, make sure it is an unquantized model. You'll have the option to quantize it in flight when you start the Inference Server, as shown below.

The following is an example of how to start an Inference Server with in-flight quantization in the SDK:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="MyLLM", project_name="MyProject", start_server=False)
model.start_inference_server(
    cpu="7", 
    memory="57Gi", 
    min_replicas=1, 
    gpu_count=1, 
    gpu_type="Tesla-V100-SXM2-16GB", 
    quantization_bits=4, 
    inference_engine="Huggingface" # use inference_engine="vLLM" to use vLLM instead
) 

Note that you could equivalently pass huggingface_model_kwargs = {"load_in_4bit": True} or vllm_config = {"bitsandbytes_4bit": True}, depending on which serving runtime (inference engine) you specify.

Our Hugging Face Pipeline runtime supports 8-bit and 4-bit quantization, and vLLM supports 4-bit quantization. As a rough benchmark, an 8-bit quantized 7B model will need about 8-10GB of VRAM, and a 4-bit quantized 7B model will need about 5-6GB of VRAM. In addition, about 25Gi of CPU memory is needed (sometimes more, like with 13B models).

note

Some LLMs are known to have quirks when you quantize them, which can result in errors. In our experience, 8-bit quantization can be more problematic than 4-bit quantization.

Performing Inferences

There are two task types for LLMs, so the way you should perform inference depends on the task type.

Text Generation Models

For Text Generation models, use the complete inference method, and simply pass the string you wish to complete:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="MyLLM", project_name="MyProject", start_server=False)
# Assume inference server is already started.

response = model.complete("The cat in the")

print(response.choices[0].text) # "Hat"

Conversational Models

For Conversational models, use the chat inference method, and pass a sequence of user/assistant messages:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="MyLLM", project_name="MyProject", start_server=False)
# Assume inference server is already started.
messages = [
    {"role": "user", "content": "Can you help me with my geography homework?"},
    {"role": "assistant", "content": "Sure! What do you need help with?"},
    {"role": "user", "content": "I need to identify the similarities and differences between tributaries and estuaries"}
]

response = model.chat(messages)

print(response.choices[-1].message.content) # "Tributaries and estuaries are both important features ..."

For chat, the messages must alternate between user and assistant roles; otherwise, it will raise an error.

In both cases, the response objects coming from complete and chat are compatible with the OpenAI API standard, meaning the response format is identical to that of ChatGPT via API.

Inference Parameters

Inference parameters are also supported. For example, if you wanted to use temperature = 0.2 for your inference call, you pass it as normal:

model.chat(messages, temperature=0.2)

The available inference parameters for both Text Generation and Conversational task types are described below. Note that some options are only supported in vLLM runtimes.

Parameter Name	Description	Supported With Hugging Face Pipelines	Supported With vLLM
`max_completion_tokens`	Maximum number of new tokens for the model to generate. If using the Hugging Face engine, the default value is 50.	✅	✅
`min_tokens`	Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated.	✅	✅
`presence_penalty`	Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens while values < 0 encourage the model to repeat tokens.	❌	✅
`frequency_penalty`	Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens while values < 0 encourage the model to repeat tokens.	❌	✅
`repetition_penalty`	Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens while values < 1 encourage the model to repeat tokens.	✅	✅
`temperature`	Float that controls the randomness of the sampling. Lower values make the model more deterministic while higher values make the model more random. Zero means greedy sampling.	✅	✅
`top_p`	Float that controls the cumulative probability of the top tokens to consider. Must be in [0, 1]. Set to 1 to consider all tokens.	✅	✅
`top_k`	Integer that controls the number of top tokens to consider.	✅	✅
`min_p`	Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.	✅	✅
`seed`	Random seed to use for the generation.	❌	✅
`use_beam_search`	Whether to use beam search.	✅	✅
`best_of`	Number of output sequences that are generated from the prompt. From these best_of sequences, the best sequence is returned. best_of must be greater than or equal to the number of returned sequences. This is treated as the beam width when use_beam_search is True. By default, best_of is set to 1.	✅	✅
`length_penalty`	Float that penalizes sequences based on their length. Used in beam search. length_penalty > 0.0 promotes longer sequences while length_penalty < 0.0 encourages shorter sequences.	✅	✅
`stop`	List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.	❌	✅
`stop_token_ids`	List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.	✅	✅
`include_stop_str_in_output`	Whether to include the stop strings in output text. Defaults to False.	❌	✅
`ignore_eos`	Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.	❌	✅
`logprobs`	Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens.	❌	✅
`top_logprobs`	Number of log probabilities to return per prompt token.	❌	✅
`do_sample`	Whether or not to use sampling. Only a valid option for Hugging Face engine.	✅	❌

Configuring vLLM​

Configuring Hugging Face Pipelines​

In-Flight Quantization​

Performing Inferences​

Text Generation Models​

Conversational Models​

Inference Parameters​

Configuring vLLM

Configuring Hugging Face Pipelines

In-Flight Quantization

Performing Inferences

Text Generation Models

Conversational Models

Inference Parameters