Uploading LLMs
Most LLMs that you can find on the Hugging Face Hub under the "Text Generation" task can be uploaded to Chariot. Currently, only models up to 13B parameters will fit on a single consumer-grade GPU (and that will require quantization, found in Inference Servers).
LLM Task Types
Chariot has two task types associated with LLMs:
- Text Generation: Tasks consist of completing sentences or prompts provided by a user. For example, given a prompt of "The Cat in the", the model may predict something like "Hat."
- Conversational: These tasks are chat models where the interface is a conversation between a user and an AI assistant.
Every Text Generation model in the Hugging Face Hub can be uploaded as a Text Generation model in Chariot. If your model happens to be an "instruct model"—also known as a "chat model"—it can also be uploaded as a Conversational model in Chariot. Verify that your model is an instruct model; otherwise, you will get an error at inference time.
For example, Llama 3 8B is a Text Generation model, and Llama 3 8B Instruct is the instruction-tuned variant. The former should be uploaded as a Text Generation task and the latter uploaded as a Conversational task.
How to Upload LLMs
LLMs can be uploaded to Chariot using the same methods as any other Hugging Face model:
- Direct import from Hugging Face Hub:: Automatically downloads the model without requiring manual download and re-upload. Direct import is especially beneficial for LLMs as it avoids downloading and re-uploading gigabytes of data from your local machine.
- Download to local machine and upload:: For cases where you need to modify files or work offline.
Uploading via the SDK can take a long time for large models like LLMs. It may be helpful to know the upload progress, which can be seen by enabling debug logging at the start of your session.
import logging
logging.basicConfig(level=logging.DEBUG)