llama-cli is a straightforward golang CLI interface for [llama.cpp](https://github.com/ggerganov/llama.cpp), providing a simple API and a command line interface that allows text generation using a GPT-based model like llama directly from the terminal. It is also compatible with [gpt4all](https://github.com/nomic-ai/gpt4all) and [alpaca](https://github.com/tatsu-lab/stanford_alpaca).
llama-cli is a straightforward golang CLI interface for [llama.cpp](https://github.com/ggerganov/llama.cpp), providing an API compatible with OpenAI with support for multiple-models and a command line interface that allows text generation using a GPT-based model like llama directly from the terminal. It is also compatible with the models supported by `llama.cpp`. You might need to convert older models to the new format, see [here](https://github.com/ggerganov/llama.cpp#using-gpt4all) for instance to run `gpt4all`.
`llama-cli`uses https://github.com/go-skynet/llama, which is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp) providing golang binding.
`llama-cli`doesn't shell-out, it uses https://github.com/go-skynet/go-llama.cpp, which is a golang binding of [llama.cpp](https://github.com/ggerganov/llama.cpp).
## Container images
## Container images
`llama-cli` comes by default as a container image.
To begin, run:
To begin, run:
```
```
docker run -ti --rm quay.io/go-skynet/llama-cli:v0.4 --instruction "What's an alpaca?" --topk 10000 --model ...
docker run -ti --rm quay.io/go-skynet/llama-cli:v0.6 --instruction "What's an alpaca?" --topk 10000 --model ...
```
Where `--model` is the path of the model you want to use.
Note: you need to mount a volume to the docker container in order to load a model, for instance:
```
# assuming your model is in /path/to/your/models/foo.bin
docker run -v /path/to/your/models:/models -ti --rm quay.io/go-skynet/llama-cli:v0.6 --instruction "What's an alpaca?" --topk 10000 --model /models/foo.bin
| alpaca | ALPACA | true | Set to true for alpaca models. |
| gpt4all | GPT4ALL | false | Set to true for gpt4all models. |
Once the server is running, you can start making requests to it using HTTP, using the OpenAI API.
### Supported OpenAI API endpoints
Once the server is running, you can start making requests to it using HTTP. For example, to generate text based on an instruction, you can send a POST request to the `/predict` endpoint with the instruction as the request body:
You can check out the [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create).
Following the list of endpoints/parameters supported.
#### Chat completions
For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:
```
```
curl --location --request POST 'http://localhost:8080/predict' --header 'Content-Type: application/json' --data-raw '{
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
```
Available additional parameters: `top_p`, `top_k`, `max_tokens`
#### List models
You can list all the models available with:
```
curl http://localhost:8080/v1/models
```
## Web interface
There is also available a simple web interface (for instance, http://localhost:8080/) which can be used as a playground.
There is also available a simple web interface (for instance, http://localhost:8080/) which can be used as a playground.
Note: The API doesn't inject a template for talking to the instance, while the CLI does. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release, for instance:
Note: The API doesn't inject a template for talking to the instance, while the CLI does. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release, for instance:
@ -115,17 +156,18 @@ Below is an instruction that describes a task. Write a response that appropriate
### Response:
### Response:
```
```
## Using other models
Note: You can use a use a default template for every model in your model path, by creating a corresponding file with the `.tmpl` suffix. For instance, if the model is called `foo.bin`, you can create a sibiling file, `foo.bin.tmpl` which will be used as a default prompt, for instance:
You can specify a model binary to be used for inference with `--model`.
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.
13B and 30B alpaca models are known to work:
### Instruction:
{{.Input}}
### Response:
```
```
# Download the model image, extract the model
# Use the model with llama-cli
## Using other models
docker run -v $PWD:/models -p 8080:8080 -ti --rm quay.io/go-skynet/llama-cli:v0.4 api --model /models/model.bin
```
gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted (same applies for old alpaca models, too):
gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted (same applies for old alpaca models, too):