Build Your Own Private Chatbot: A Guide to Local Language Models

In today’s digital landscape, data privacy has become a critical concern for both businesses and individuals. One powerful solution to enhance privacy is to deploy your own local language model (LLM). This guide will walk you through creating a custom, locally-hosted chatbot using Python 3, and ChromaDB. This approach ensures maximum customization, privacy, and data security.

Why Deploy a Local Language Model?

Complete Customization: Gain full control over your model’s configuration, tailoring it precisely to your needs without dependence on third-party services.
Enhanced Privacy: Protect sensitive data by keeping it within your local environment, eliminating the risks associated with transmitting information online. This is especially vital for organizations that handle confidential data.
Data Security Assurance: Minimize security threats by storing training materials, like PDF documents, locally. This reduces exposure to external vulnerabilities.
Full Control Over Data Management: Enjoy the freedom to manage and process data according to your preferences. You can embed proprietary information into a ChromaDB vector store, aligning perfectly with your operational standards.
Offline Access: Ensure your chatbot is consistently available, even without an internet connection. This guarantees uninterrupted service, regardless of network availability.

This guide focuses on building a robust and secure local chatbot, prioritizing your privacy and control.

Understanding Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a cutting-edge technique that merges the power of information retrieval with text generation. The result is more accurate and contextually relevant responses from your language model.

What is RAG?

RAG is a hybrid model that boosts the capabilities of language models by incorporating an external knowledge base. The process involves two primary components:

Retrieval: The model retrieves relevant documents or information snippets from an external source (like a database or vector store) based on the user’s query.
Generation: The retrieved information is then fed into a generative language model, which produces a coherent and contextually appropriate response.

How Does RAG Work?

Query Input: A user enters a question or query.
Document Retrieval: The system searches an external knowledge base using the query, fetching the most relevant documents.
Response Generation: The generative model processes the retrieved information, integrating it with its existing knowledge to create a detailed and accurate answer.
Output: The final response, enriched with details from the knowledge base, is presented to the user.

Benefits of RAG

Improved Accuracy: RAG models leverage external data to provide more precise and detailed answers, particularly beneficial for domain-specific questions.
Contextual Relevance: The retrieval component ensures the generated response is grounded in relevant and current information, enhancing overall quality.
Scalability: RAG systems can easily scale to include vast amounts of data, allowing them to handle a wide variety of queries.
Flexibility: These models are adaptable to different domains by simply updating the external knowledge base.

Why Use RAG Locally?

Privacy and Security: Running a RAG model locally ensures sensitive data stays secure and private, as it doesn’t need to be transmitted to external servers.
Customization: You can tailor both the retrieval and generation processes to meet your specific requirements, including integrating proprietary data.
Operational Independence
Independence: A local setup guarantees your system remains operational even without internet access, providing consistent service.

By deploying a local RAG application using tools like Python and ChromaDB, you gain the advantages of advanced language models while maintaining complete control over your data and customization options.

Setting Up Your Local Development Environment

To begin setting up your local development environment, a streamlined solution called ServBay can be used. ServBay offers a comprehensive, one-click installation package designed for web, Python, AI, and PHP developers, particularly optimized for macOS. It integrates various essential web development services and tools, such as web servers, databases, programming languages, mail servers, and queue services, providing a unified and efficient development environment.

Key Features of ServBay

Multiple Python Version Support: Run different Python versions concurrently to suit the needs of various projects.
Custom Domain and SSL Configuration: Easily set up local domain names and SSL certificates to mimic real production environments.
Quick Access and Management: Features startup on boot, menu bar access, and command-line management for improved efficiency.
Unified Service Management: Integrates Python, PHP, Node.js, and Ollama for seamless management of multiple development services.
Isolated Environments: Prevents system pollution by running services in isolated environments.
Intranet Penetration: Supports intranet penetration for local websites, facilitating easy sharing of development results with team members.

ServBay Installation Guide

Requirements: macOS 12.0 Monterey or later.
Download the last version of Servbay

Installation Steps:

Open the downloaded .dmg file by double-clicking it.
Drag the ServBay.app icon into the Applications folder.
During the first use, initialize ServBay. You can choose the default installation or optionally include Ollama for AI programming support.
After installation, open ServBay.
Enter your password to complete the installation.
Access the main interface to manage your development environment.

ServBay not only supports Python but also provides robust support for PHP and Node.js, encompassing a wide range of versions. This quick switching capability is crucial for developers needing to test applications across different environments.

One-Click Installation Features

All Python Versions: Easily install any Python version with a single click.
All Ollama Models: Install various Ollama models effortlessly.

Prerequisites

Before you start, make sure you have the following:

Python 3: A versatile programming language you’ll use to write the code for your RAG application.
ChromaDB: A vector database to store and manage data embeddings.
Ollama: it will to download and serve custom LLMs in our local machine.

Step 1: Install Python 3 and Set Up Your Environment

Install Python:
Using Servbay click the python button, and then select one version of python.

Then Verify your Python 3 installation:
```
python3 --version
# Expected output: Python 3.12.9 (or similar)
```
Create Project Folder:
```
mkdir local-rag
cd local-rag
```
Create a Virtual Environment:
```
python3 -m venv venv
```

Activate the Virtual Environment:

source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 2: Install ChromaDB and Other Dependencies

Install ChromaDB:
```
pip install --q chromadb
```

Install Langchain Tools:

pip install --q unstructured langchain langchain-text-splitters
pip install --q "unstructured[all-docs]"

Install Flask:
```
pip install --q flask
```

Step 3: Install Ollama

Using Servbay click the AI button, and then select one model.

Building the RAG Application

With your environment set up, you can now build your custom local RAG application. This section provides the Python code and an overview of the application structure.

app.py

This is the main Flask application file. It defines routes for embedding files into the vector database and retrieving responses from the model.

import os
from dotenv import load_dotenv
from flask import Flask, request, jsonify
from embed import embed
from query import query
from get_vector_db import get_vector_db

# Set up a temporary folder
TEMP_FOLDER = os.getenv('TEMP_FOLDER', './_temp')
os.makedirs(TEMP_FOLDER, exist_ok=True)

app = Flask(__name__)

@app.route('/embed', methods=['POST'])
def route_embed():
    if 'file' not in request.files:
        return jsonify({"error": "No file part"}), 400

    file = request.files['file']
    if file.filename == '':
        return jsonify({"error": "No selected file"}), 400

    embedded = embed(file)
    if embedded:
        return jsonify({"message": "File embedded successfully"}), 200

    return jsonify({"error": "File embedded unsuccessfully"}), 400

@app.route('/query', methods=['POST'])
def route_query():
    data = request.get_json()
    response = query(data.get('query'))
    if response:
        return jsonify({"message": response}), 200

    return jsonify({"error": "Something went wrong"}), 400

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8080, debug=True)

embed.py

This module handles the embedding process, including saving uploaded files, loading and splitting data, and adding documents to the vector database.

import os
from datetime import datetime
from werkzeug.utils import secure_filename
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from get_vector_db import get_vector_db

TEMP_FOLDER = os.getenv('TEMP_FOLDER', './_temp')

# Function to check if the uploaded file is allowed (only PDF files)
def allowed_file(filename):
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in {'pdf'}

# Function to save the uploaded file to the temporary folder
def save_file(file):
    # Save the uploaded file with a secure filename and return the file path
    timestamp = datetime.now().timestamp()
    filename = f"{timestamp}_{secure_filename(file.filename)}"
    file_path = os.path.join(TEMP_FOLDER, filename)
    file.save(file_path)
    return file_path

# Function to load and split the data from the PDF file
def load_and_split_data(file_path):
    # Load the PDF file and split the data into chunks
    loader = UnstructuredPDFLoader(file_path=file_path)
    data = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
    chunks = text_splitter.split_documents(data)
    return chunks

# Main function to handle the embedding process
def embed(file):
    # Check if the file is valid, save it, load and split the data, add to the database, and remove the temporary file
    if file.filename != '' and file and allowed_file(file.filename):
        file_path = save_file(file)
        chunks = load_and_split_data(file_path)
        db = get_vector_db()
        db.add_documents(chunks)
        db.persist()
        os.remove(file_path)
        return True
    return False

query.py

This module processes user queries by generating multiple versions of the query, retrieving relevant documents, and providing answers based on the context.

import os
from langchain_community.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever
from get_vector_db import get_vector_db

LLM_MODEL = os.getenv('LLM_MODEL', 'deepseek-coder:1.3b')

# Function to get the prompt templates for generating alternative questions and answering based on context
def get_prompt():
    QUERY_PROMPT = PromptTemplate(
        input_variables=["question"],
        template="""You are an AI language model assistant. Your task is to generate five
        different versions of the given user question to retrieve relevant documents from
        a vector database. By generating multiple perspectives on the user question, your
        goal is to help the user overcome some of the limitations of the distance-based
        similarity search. Provide these alternative questions separated by newlines.
        Original question: {question}"""
    )

    template = """Answer the question based ONLY on the following context:
    {context}
    Question: {question}"""

    prompt = ChatPromptTemplate.from_template(template)
    return QUERY_PROMPT, prompt

# Main function to handle the query process
def query(input):
    if input:
        # Initialize the language model with the specified model name
        llm = ChatOllama(model=LLM_MODEL)

        # Get the vector database instance
        db = get_vector_db()

        # Get the prompt templates
        QUERY_PROMPT, prompt = get_prompt()

        # Set up the retriever to generate multiple queries using the language model and the query prompt
        retriever = MultiQueryRetriever.from_llm(db.as_retriever(), llm, prompt=QUERY_PROMPT)

        # Define the processing chain to retrieve context, generate the answer, and parse the output
        chain = ({"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser())

        response = chain.invoke(input)
        return response

    return None

get_vector_db.py

This module initializes and returns the vector database instance used for storing and retrieving document embeddings.

import os
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores.chroma import Chroma

CHROMA_PATH = os.getenv('CHROMA_PATH', 'chroma')
COLLECTION_NAME = os.getenv('COLLECTION_NAME', 'local-rag')
TEXT_EMBEDDING_MODEL = os.getenv('TEXT_EMBEDDING_MODEL', 'nomic-embed-text')

def get_vector_db():
    # Create an instance of the embedding model
    embedding = OllamaEmbeddings(model=TEXT_EMBEDDING_MODEL, show_progress=True)

    # Initialize the Chroma vector store with specified parameters
    db = Chroma(
        collection_name=COLLECTION_NAME,
        persist_directory=CHROMA_PATH,
        embedding_function=embedding
    )

    return db

Running Your Application

Create a .env file:

Store your environment variables in a .env file:

TEMP_FOLDER = './_temp'
CHROMA_PATH = 'chroma'
COLLECTION_NAME = 'local-rag'
LLM_MODEL = 'mistral'
TEXT_EMBEDDING_MODEL = 'nomic-embed-text'

Run app.py:
Start your application server:
```
python3 app.py
```

Interact with the Endpoints:

Once the server is running, you can make requests to the following endpoints:

Embed a PDF file:

#!/bin/bash

curl --request POST \
--url http://localhost:8080/embed \
--header 'Content-Type: multipart/form-data' \
--form file=@/path/to/your/file.pdf

# Expected Response
# {
#   "message": "File embedded successfully"
# }

Ask a question:

curl --request POST \
  --url http://localhost:8080/query \
  --header 'Content-Type: application/json' \
  --data '{ "query": "Your question here" }'

#Expected Response
#{
# "message": "Your answer"
#}

Conclusion

By following this guide, you can run and interact with a custom local RAG application using Python, Ollama, and ChromaDB, tailored precisely to your requirements. You can adjust and expand the functionality as needed to further enhance your application’s capabilities. Local deployment not only safeguards sensitive information but also boosts performance and responsiveness. Whether you aim to improve customer interactions or streamline internal processes, a locally deployed RAG application offers the flexibility and robustness to adapt and grow with your needs.

How Innovative Software Technology Can Help

At Innovative Software Technology, we specialize in developing cutting-edge, secure, and customized software solutions tailored to meet your specific business needs. Leveraging advanced technologies like local language models (LLMs) and Retrieval-Augmented Generation (RAG), we empower your business with enhanced data privacy, SEO optimization, and operational efficiency. Our expertise in deploying and managing local RAG applications ensures that your sensitive data remains secure, while also providing top-tier search engine optimization to improve your online visibility. Partner with us to transform your data management and customer interaction strategies, ensuring compliance, security, and a competitive edge in the digital marketplace.