Skip to main content

LLM Servers

There are two serving runtimes for Chariot LLMs: Hugging Face Pipelines and vLLM. While Hugging Face Pipelines are able to run virtually every Text Generation and Conversational LLM, they are not the most efficient in terms of memory management and throughput. On the other hand, vLLM is a state-of-the-art LLM inference engine that has many optimizations—such as PagedAttention, continuous batching, and optimized CUDA kernels—but it doesn't support every open source LLM. To check if your model is supported by vLLM, check their documentation. Generally, vLLM supports most popular LLM architectures and adds new support frequently. The default inference engine for LLMs in Chariot is Hugging Face Pipelines.

Configuring vLLM

When creating a vLLM Inference Server for an LLM in Chariot, there are a few pre-defined vLLM configuration options that are available.

Option nameExplanationPossible valuesDefault value
bitsandbytes_4bitWhether to load the model with in-flight 4-bit quantizationTrue or FalseFalse
max_model_lengthThe desired context length for the modelInteger > 0Determined by model config
enable_prefix_cachingWhether to enable prefix cachingTrue or FalseFalse
seedInitial seed to use for generationInteger > 00

More details about these options can be found in the vLLM documentation.

To start a vLLM Inference Server and configure it with one or more of these settings, use the vllm_config keyword:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="MyLLM", project_name="MyProject", start_server=False)
vllm_config = {"max_model_length": 8000, "bitsandbytes_4bit": True}
model.start_inference_server(
cpu="7",
memory="57Gi",
min_replicas=1,
gpu_count=1,
gpu_type="Tesla-V100-SXM2-16GB",
inference_engine="vLLM",
vllm_config=vllm_config
)

Configuring Hugging Face Pipelines

When creating a Hugging Face Pipeline Inference Server for Hugging Face models within Chariot, additional keyword arguments—called "kwargs"—can be provided to customize the model loading process. These keyword arguments are passed directly to the from_pretrained method when the model gets loaded for the pipeline.

In the Create Inference Server modal, a list of Hugging Face kwargs is provided at the bottom of the modal, along with each kwarg's value. To add a new kwarg, click the + Add new button and provide the necessary information.

Added kwargs may be removed by clicking the trash can icon to the right of the kwarg's value. Previously added kwargs will be retained once added, so when additional Inference Servers are created, the previous kwargs will be listed as available.

note

Key word arguments supplied in this way are not validated, so be sure to type them correctly in order to prevent issues.

Once your Hugging Face kwargs have been added as intended, click the Submit button to finish creating your Inference Server.

ModelKwargs

By supplying your additional keyword arguments when creating your Inference Server, you can leverage the full flexibility of Hugging Face models in the Chariot platform and tailor them to your specific needs. For a comprehensive list of available kwargs, please refer to the from_pretrained Hugging Face docs for your specific model type.

In-Flight Quantization

As mentioned above, it is very likely that your model won't fit on the GPUs that you have access to in your Chariot instance, even in 3B models. To alleviate this, both of our serving runtimes support in-flight inference quantization via bitsandbytes. Quantizing a model's weights to lower precision will allow them to fit into a smaller GPU for inference.

note

Chariot does not yet support serving pre-quantized models, which are models whose weights have been quantized before upload. Examples of pre-quantization strategies include GGUF, AWQ, and GPTQ. Pre-quantized models typically have the quantization strategy in their name (e.g., Llama-3.2-3B-Instruct-GGUF, so you'll know if your model is pre-quantized. Therefore, when you upload an LLM to Chariot, make sure it is an unquantized model. You'll have the option to quantize it in flight when you start the Inference Server, as shown below.

The following is an example of how to start an Inference Server with in-flight quantization in the SDK:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="MyLLM", project_name="MyProject", start_server=False)
model.start_inference_server(
cpu="7",
memory="57Gi",
min_replicas=1,
gpu_count=1,
gpu_type="Tesla-V100-SXM2-16GB",
quantization_bits=4,
inference_engine="Huggingface" # use inference_engine="vLLM" to use vLLM instead
)

Note that you could equivalently pass huggingface_model_kwargs = {"load_in_4bit": True} or vllm_config = {"bitsandbytes_4bit": True}, depending on which serving runtime (inference engine) you specify.

Our Hugging Face Pipeline runtime supports 8-bit and 4-bit quantization, and vLLM supports 4-bit quantization. As a rough benchmark, an 8-bit quantized 7B model will need about 8-10GB of VRAM, and a 4-bit quantized 7B model will need about 5-6GB of VRAM. In addition, about 25Gi of CPU memory is needed (sometimes more, like with 13B models).

note

Some LLMs are known to have quirks when you quantize them, which can result in errors. In our experience, 8-bit quantization can be more problematic than 4-bit quantization.

Performing Inferences

There are two task types for LLMs, so the way you should perform inference depends on the task type.

Text Generation Models

For Text Generation models, use the complete inference method, and simply pass the string you wish to complete:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="MyLLM", project_name="MyProject", start_server=False)
# Assume inference server is already started.

response = model.complete("The cat in the")

print(response.choices[0].text) # "Hat"

Conversational Models

For Conversational models, use the chat inference method, and pass a sequence of user/assistant messages:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="MyLLM", project_name="MyProject", start_server=False)
# Assume inference server is already started.
messages = [
{"role": "user", "content": "Can you help me with my geography homework?"},
{"role": "assistant", "content": "Sure! What do you need help with?"},
{"role": "user", "content": "I need to identify the similarities and differences between tributaries and estuaries"}
]

response = model.chat(messages)

print(response.choices[-1].message.content) # "Tributaries and estuaries are both important features ..."

For chat, the messages must alternate between user and assistant roles; otherwise, it will raise an error.

In both cases, the response objects coming from complete and chat are compatible with the OpenAI API standard, meaning the response format is identical to that of ChatGPT via API.

Inference Parameters

Inference parameters are also supported. For example, if you wanted to use temperature = 0.2 for your inference call, you pass it as normal:

model.chat(messages, temperature=0.2)

The available inference parameters for both Text Generation and Conversational task types are described below. Note that some options are only supported in vLLM runtimes.

Parameter NameDescriptionSupported With Hugging Face PipelinesSupported With vLLM
max_completion_tokensMaximum number of new tokens for the model to generate. If using the Hugging Face engine, the default value is 50.
min_tokensMinimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated.
presence_penaltyFloat that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens while values < 0 encourage the model to repeat tokens.
frequency_penaltyFloat that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens while values < 0 encourage the model to repeat tokens.
repetition_penaltyFloat that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens while values < 1 encourage the model to repeat tokens.
temperatureFloat that controls the randomness of the sampling. Lower values make the model more deterministic while higher values make the model more random. Zero means greedy sampling.
top_pFloat that controls the cumulative probability of the top tokens to consider. Must be in [0, 1]. Set to 1 to consider all tokens.
top_kInteger that controls the number of top tokens to consider.
min_pFloat that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.
seedRandom seed to use for the generation.
use_beam_searchWhether to use beam search.
best_ofNumber of output sequences that are generated from the prompt. From these best_of sequences, the best sequence is returned. best_of must be greater than or equal to the number of returned sequences. This is treated as the beam width when use_beam_search is True. By default, best_of is set to 1.
length_penaltyFloat that penalizes sequences based on their length. Used in beam search. length_penalty > 0.0 promotes longer sequences while length_penalty < 0.0 encourages shorter sequences.
stopList of strings that stop the generation when they are generated. The returned output will not contain the stop strings.
stop_token_idsList of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.
include_stop_str_in_outputWhether to include the stop strings in output text. Defaults to False.
ignore_eosWhether to ignore the EOS token and continue generating tokens after the EOS token is generated.
logprobsNumber of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens.
top_logprobsNumber of log probabilities to return per prompt token.
do_sampleWhether or not to use sampling. Only a valid option for Hugging Face engine.