examples: add langchain-chroma example (#248)

2 years ago · 557ccc5ad8
parent 2488c445b6
commit 557ccc5ad8
9 changed files with 152 additions and 1 deletions
--- a/examples/README.md
+++ b/examples/README.md
@ -65,7 +65,7 @@ Run a slack bot which lets you talk directly with a model
 [Check it out here](https://github.com/go-skynet/LocalAI/tree/master/examples/slack-bot/)
-### Question answering on documents
+### Question answering on documents with llama-index
 _by [@mudler](https://github.com/mudler)_
@ -73,6 +73,14 @@ Shows how to integrate with [Llama-Index](https://gpt-index.readthedocs.io/en/st
 [Check it out here](https://github.com/go-skynet/LocalAI/tree/master/examples/query_data/)
 ### Question answering on documents with langchain and chroma
 _by [@mudler](https://github.com/mudler)_
 Shows how to integrate with `Langchain` and `Chroma` to enable question answering on a set of documents.
 [Check it out here](https://github.com/go-skynet/LocalAI/tree/master/examples/langchain-chroma/)
 ### Template for Runpod.io
 _by [@fHachenberg](https://github.com/fHachenberg)_
--- a/examples/langchain-chroma/README.md
+++ b/examples/langchain-chroma/README.md
@ -0,0 +1,54 @@
 # Data query example
 This example makes use of [langchain and chroma](https://blog.langchain.dev/langchain-chroma/) to enable question answering on a set of documents.
 ## Setup
 Download the models and start the API:
 ```bash
 # Clone LocalAI
 git clone https://github.com/go-skynet/LocalAI
 cd LocalAI/examples/query_data
 wget https://huggingface.co/skeskinen/ggml/resolve/main/all-MiniLM-L6-v2/ggml-model-q4_0.bin -O models/bert
 wget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j
 # start with docker-compose
 docker-compose up -d --build
 ```
 ### Python requirements
 ```
 pip install -r requirements.txt
 ```
 ### Create a storage
 In this step we will create a local vector database from our document set, so later we can ask questions on it with the LLM.
 ```bash
 export OPENAI_API_BASE=http://localhost:8080/v1
 export OPENAI_API_KEY=sk-
 wget https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt
 python store.py
 ```
 After it finishes, a directory "storage" will be created with the vector index database.
 ## Query
 We can now query the dataset. 
 ```bash
 export OPENAI_API_BASE=http://localhost:8080/v1
 export OPENAI_API_KEY=sk-
 python query.py
 # President Trump recently stated during a press conference regarding tax reform legislation that "we're getting rid of all these loopholes." He also mentioned that he wants to simplify the system further through changes such as increasing the standard deduction amount and making other adjustments aimed at reducing taxpayers' overall burden.    
 ```
 Keep in mind now things are hit or miss!
--- a/examples/langchain-chroma/models/completion.tmpl
+++ b/examples/langchain-chroma/models/completion.tmpl
@ -0,0 +1 @@
 {{.Input}}
--- a/examples/langchain-chroma/models/embeddings.yaml
+++ b/examples/langchain-chroma/models/embeddings.yaml
@ -0,0 +1,5 @@
 name: text-embedding-ada-002
 parameters:
  model: bert
 backend: bert-embeddings
 embeddings: true
--- a/examples/langchain-chroma/models/gpt-3.5-turbo.yaml
+++ b/examples/langchain-chroma/models/gpt-3.5-turbo.yaml
@ -0,0 +1,16 @@
 name: gpt-3.5-turbo
 parameters:
  model: ggml-gpt4all-j
  top_k: 80
  temperature: 0.2
  top_p: 0.7
 context_size: 1024
 stopwords:
 - "HUMAN:"
 - "GPT:"
 roles:
  user: " "
  system: " "
 template:
  completion: completion
  chat: gpt4all
--- a/examples/langchain-chroma/models/gpt4all.tmpl
+++ b/examples/langchain-chroma/models/gpt4all.tmpl
@ -0,0 +1,4 @@
 The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
 ### Prompt:
 {{.Input}}
 ### Response:
--- a/examples/langchain-chroma/query.py
+++ b/examples/langchain-chroma/query.py
@ -0,0 +1,31 @@
 import os
 from langchain.vectorstores import Chroma
 from langchain.embeddings import OpenAIEmbeddings
 from langchain.text_splitter import RecursiveCharacterTextSplitter,CharacterTextSplitter
 from langchain.llms import OpenAI
 from langchain.chains import VectorDBQA
 from langchain.document_loaders import TextLoader
 base_path = os.environ.get('OPENAI_API_BASE', 'http://localhost:8080/v1')
 # Load and process the text
 loader = TextLoader('state_of_the_union.txt')
 documents = loader.load()
 text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=70)
 texts = text_splitter.split_documents(documents)
 # Embed and store the texts
 # Supplying a persist_directory will store the embeddings on disk
 persist_directory = 'db'
 embedding = OpenAIEmbeddings()
 # Now we can load the persisted database from disk, and use it as normal. 
 vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
 qa = VectorDBQA.from_chain_type(llm=OpenAI(temperature=0, model_name="gpt-3.5-turbo", openai_api_base=base_path), chain_type="stuff", vectorstore=vectordb)
 query = "What the president said about taxes ?"
 print(qa.run(query))
--- a/examples/langchain-chroma/requirements.txt
+++ b/examples/langchain-chroma/requirements.txt
@ -0,0 +1,4 @@
 langchain==0.0.160
 openai==0.27.6
 chromadb==0.3.21
 llama-index==0.6.2
--- a/examples/langchain-chroma/store.py
+++ b/examples/langchain-chroma/store.py
@ -0,0 +1,28 @@
 import os
 from langchain.vectorstores import Chroma
 from langchain.embeddings import OpenAIEmbeddings
 from langchain.text_splitter import RecursiveCharacterTextSplitter,TokenTextSplitter,CharacterTextSplitter
 from langchain.llms import OpenAI
 from langchain.chains import VectorDBQA
 from langchain.document_loaders import TextLoader
 base_path = os.environ.get('OPENAI_API_BASE', 'http://localhost:8080/v1')
 # Load and process the text
 loader = TextLoader('state_of_the_union.txt')
 documents = loader.load()
 text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=70)
 #text_splitter = TokenTextSplitter()
 texts = text_splitter.split_documents(documents)
 # Embed and store the texts
 # Supplying a persist_directory will store the embeddings on disk
 persist_directory = 'db'
 embedding = OpenAIEmbeddings(model="text-embedding-ada-002")
 vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)
 vectordb.persist()
 vectordb = None