Update README

add/first-example
mudler 2 years ago
parent 5556aa46dd
commit 7858a97254
  1. 102
      README.md

@ -1,16 +1,27 @@
## :camel: llama-cli ## :camel: llama-cli
llama-cli is a straightforward golang CLI interface for [llama.cpp](https://github.com/ggerganov/llama.cpp), providing a simple API and a command line interface that allows text generation using a GPT-based model like llama directly from the terminal. It is also compatible with [gpt4all](https://github.com/nomic-ai/gpt4all) and [alpaca](https://github.com/tatsu-lab/stanford_alpaca). llama-cli is a straightforward golang CLI interface for [llama.cpp](https://github.com/ggerganov/llama.cpp), providing an API compatible with OpenAI with support for multiple-models and a command line interface that allows text generation using a GPT-based model like llama directly from the terminal. It is also compatible with the models supported by `llama.cpp`. You might need to convert older models to the new format, see [here](https://github.com/ggerganov/llama.cpp#using-gpt4all) for instance to run `gpt4all`.
`llama-cli` uses https://github.com/go-skynet/llama, which is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp) providing golang binding. `llama-cli` doesn't shell-out, it uses https://github.com/go-skynet/go-llama.cpp, which is a golang binding of [llama.cpp](https://github.com/ggerganov/llama.cpp).
## Container images ## Container images
`llama-cli` comes by default as a container image.
To begin, run: To begin, run:
``` ```
docker run -ti --rm quay.io/go-skynet/llama-cli:v0.4 --instruction "What's an alpaca?" --topk 10000 --model ... docker run -ti --rm quay.io/go-skynet/llama-cli:v0.6 --instruction "What's an alpaca?" --topk 10000 --model ...
```
Where `--model` is the path of the model you want to use.
Note: you need to mount a volume to the docker container in order to load a model, for instance:
```
# assuming your model is in /path/to/your/models/foo.bin
docker run -v /path/to/your/models:/models -ti --rm quay.io/go-skynet/llama-cli:v0.6 --instruction "What's an alpaca?" --topk 10000 --model /models/foo.bin
``` ```
You will receive a response like the following: You will receive a response like the following:
@ -39,8 +50,6 @@ llama-cli --model <model_path> --instruction <instruction> [--input <input>] [--
| top_p | TOP_P | 0.85 | The cumulative probability for top-p sampling. | | top_p | TOP_P | 0.85 | The cumulative probability for top-p sampling. |
| top_k | TOP_K | 20 | The number of top-k tokens to consider for text generation. | | top_k | TOP_K | 20 | The number of top-k tokens to consider for text generation. |
| context-size | CONTEXT_SIZE | 512 | Default token context size. | | context-size | CONTEXT_SIZE | 512 | Default token context size. |
| alpaca | ALPACA | true | Set to true for alpaca models. |
| gpt4all | GPT4ALL | false | Set to true for gpt4all models. |
Here's an example of using `llama-cli`: Here's an example of using `llama-cli`:
@ -50,14 +59,14 @@ llama-cli --model ~/ggml-alpaca-7b-q4.bin --instruction "What's an alpaca?"
This will generate text based on the given model and instruction. This will generate text based on the given model and instruction.
## Advanced usage ## API
`llama-cli` also provides an API for running text generation as a service. The model will be pre-loaded and kept in memory. `llama-cli` also provides an API for running text generation as a service. The models once loaded the first time will be kept in memory.
Example of starting the API with `docker`: Example of starting the API with `docker`:
```bash ```bash
docker run -p 8080:8080 -ti --rm quay.io/go-skynet/llama-cli:v0.4 api --context-size 700 --threads 4 docker run -p 8080:8080 -ti --rm quay.io/go-skynet/llama-cli:v0.6 api --models-path /path/to/models --context-size 700 --threads 4
``` ```
And you'll see: And you'll see:
@ -72,36 +81,68 @@ And you'll see:
└───────────────────────────────────────────────────┘ └───────────────────────────────────────────────────┘
``` ```
Note: Models have to end up with `.bin`.
You can control the API server options with command line arguments: You can control the API server options with command line arguments:
``` ```
llama-cli api --model <model_path> [--address <address>] [--threads <num_threads>] llama-cli api --models-path <model_path> [--address <address>] [--threads <num_threads>]
``` ```
The API takes takes the following: The API takes takes the following:
| Parameter | Environment Variable | Default Value | Description | | Parameter | Environment Variable | Default Value | Description |
| ------------ | -------------------- | ------------- | -------------------------------------- | | ------------ | -------------------- | ------------- | -------------------------------------- |
| model | MODEL_PATH | | The path to the pre-trained GPT-based model. | | models-path | MODELS_PATH | | The path where you have models (ending with `.bin`). |
| threads | THREADS | CPU cores | The number of threads to use for text generation. | | threads | THREADS | CPU cores | The number of threads to use for text generation. |
| address | ADDRESS | :8080 | The address and port to listen on. | | address | ADDRESS | :8080 | The address and port to listen on. |
| context-size | CONTEXT_SIZE | 512 | Default token context size. | | context-size | CONTEXT_SIZE | 512 | Default token context size. |
| alpaca | ALPACA | true | Set to true for alpaca models. |
| gpt4all | GPT4ALL | false | Set to true for gpt4all models. |
Once the server is running, you can start making requests to it using HTTP, using the OpenAI API.
### Supported OpenAI API endpoints
Once the server is running, you can start making requests to it using HTTP. For example, to generate text based on an instruction, you can send a POST request to the `/predict` endpoint with the instruction as the request body: You can check out the [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create).
Following the list of endpoints/parameters supported.
#### Chat completions
For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:
``` ```
curl --location --request POST 'http://localhost:8080/predict' --header 'Content-Type: application/json' --data-raw '{ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"text": "What is an alpaca?", "model": "ggml-koala-7b-model-q4_0-r2.bin",
"topP": 0.8, "messages": [{"role": "user", "content": "Say this is a test!"}],
"topK": 50, "temperature": 0.7
"temperature": 0.7,
"tokens": 100
}' }'
``` ```
Available additional parameters: `top_p`, `top_k`, `max_tokens`
#### Completions
For example, to generate a comletion, you can send a POST request to the `/v1/completions` endpoint with the instruction as the request body:
```
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
```
Available additional parameters: `top_p`, `top_k`, `max_tokens`
#### List models
You can list all the models available with:
```
curl http://localhost:8080/v1/models
```
## Web interface
There is also available a simple web interface (for instance, http://localhost:8080/) which can be used as a playground. There is also available a simple web interface (for instance, http://localhost:8080/) which can be used as a playground.
Note: The API doesn't inject a template for talking to the instance, while the CLI does. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release, for instance: Note: The API doesn't inject a template for talking to the instance, while the CLI does. You have to use a prompt similar to what's described in the standford-alpaca docs: https://github.com/tatsu-lab/stanford_alpaca#data-release, for instance:
@ -115,17 +156,18 @@ Below is an instruction that describes a task. Write a response that appropriate
### Response: ### Response:
``` ```
## Using other models Note: You can use a use a default template for every model in your model path, by creating a corresponding file with the `.tmpl` suffix. For instance, if the model is called `foo.bin`, you can create a sibiling file, `foo.bin.tmpl` which will be used as a default prompt, for instance:
You can specify a model binary to be used for inference with `--model`. ```
Below is an instruction that describes a task. Write a response that appropriately completes the request.
13B and 30B alpaca models are known to work: ### Instruction:
{{.Input}}
### Response:
``` ```
# Download the model image, extract the model
# Use the model with llama-cli ## Using other models
docker run -v $PWD:/models -p 8080:8080 -ti --rm quay.io/go-skynet/llama-cli:v0.4 api --model /models/model.bin
```
gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted (same applies for old alpaca models, too): gpt4all (https://github.com/nomic-ai/gpt4all) works as well, however the original model needs to be converted (same applies for old alpaca models, too):
@ -154,7 +196,7 @@ import (
func main() { func main() {
cli := client.NewClient("http://ip:30007") cli := client.NewClient("http://ip:port")
out, err := cli.Predict("What's an alpaca?") out, err := cli.Predict("What's an alpaca?")
if err != nil { if err != nil {
@ -201,11 +243,11 @@ docker run --privileged -v /var/run/docker.sock:/var/run/docker.sock --rm -t -v
## Short-term roadmap ## Short-term roadmap
- Mimic OpenAI API (https://github.com/go-skynet/llama-cli/issues/10) - [x] Mimic OpenAI API (https://github.com/go-skynet/llama-cli/issues/10)
- Binary releases (https://github.com/go-skynet/llama-cli/issues/6) - Binary releases (https://github.com/go-skynet/llama-cli/issues/6)
- Upstream our golang bindings to llama.cpp (https://github.com/ggerganov/llama.cpp/issues/351) - Upstream our golang bindings to llama.cpp (https://github.com/ggerganov/llama.cpp/issues/351)
- Multi-model support - [x] Multi-model support
- Full Deployment and compatibility with https://github.com/mckaywrigley/chatbot-ui - Have a webUI!
## License ## License

Loading…
Cancel
Save