Run Llama 2 with Retrieval Augmented Generation in Google Colab with GPUs

Run Llama 2 with Retrieval Augmented Generation in Google Colab with GPUs
Photo by Vincent Yuan @USA / Unsplash

Utilizing GenAI models on Colab with its free GPUs proves advantageous for GenAI developers. It enables faster execution compared to personal computers lacking powerful GPUs, thereby allowing the testing of more ideas within the same timeframe.

Colab GPU

This post will show you how you can:

  • Load Llama 2 gguf model from HuggingFace
  • Run Llam2 2 with GPUs
  • Create a vector store using Pinecone
  • Perform question and answering using Retrieval Augmented Generation(RAG)

1 Dependencies

Firstly, install Python dependencies as below:

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

!pip install huggingface_hub   chromadb langchain sentence-transformers pinecone_client

Then import dependencies as below:

import numpy as np
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

from langchain.llms import LlamaCpp
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate

Then mount the Google Drive to load the NBA player sample data shared by Meta in the llama-recipes repo. This dataset will be used to create the vector store:

from google.colab import drive
drive.mount('/content/drive')

source_text_file = '/content/drive/MyDrive/Research/Data/GenAI/nba.txt'
NBA Player Sample Data

2 Load Llama 2 from HuggingFace

Firstly create a callback manager for the streaming output of text, and specify the model names in the HuggingFace:

# for token-wise streaming so you'll see the answer gets generated token by token when Llama is answering your question
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Download the model
!wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_0.gguf

Then specify the model path to be loaded into LlamaCpp:

model_path = 'llama-2-7b-chat.Q5_0.gguf'

Specify the GPU settings:

n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

Next, let's load the model using langchain as below:

from langchain.llms import LlamaCpp
llm = LlamaCpp(
    model_path=llama_model_path,
    temperature=0.0,
    top_p=1,
    n_ctx=16000,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)
๐Ÿ’ก
Be sure to set up n_gpu_layers and n_batch, it shows BLAS = 1 in the output if it is set up correctly.

3 RAG

Retrieval Augmented Generation (RAG) is important because it addresses key limitations of large language models (LLMs). Here's why:

  • Factual Accuracy: LLMs can be creative and articulate, but they aren't always truthful. RAG integrates external knowledge sources, ensuring generated responses are grounded in real facts.
  • Reduced Hallucinations: LLMs can sometimes invent information or make false claims. RAG combats hallucinations by providing LLMs with reliable context from external sources.
  • Domain Expertise: LLMs struggle with specialized topics. RAG allows them access to specific knowledge bases, like medical journals or legal documents, enhancing their responses in niche areas.
  • Transparency and Trust: RAG systems can show their work! Users can see the sources used to generate responses, building trust and enabling fact-checking.

In short, RAG makes LLMs more reliable, accurate, and versatile, opening doors for their use in areas like education, legal advice, and scientific research. It's a crucial step towards trustworthy and grounded AI.

3.1 Initialize Pinecone

Let's import a few related packages and initialize Pinecone - a vector store provider.

๐Ÿ’ก
Quick start for Pinecone setup: https://docs.pinecone.io/docs/quickstart
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

from langchain.embeddings import HuggingFaceEmbeddings

Get your Pinecone API key and env here:

PINECONE_API_KEY = ''
PINECONE_ENV = ''

And initialize it:

import pinecone
from langchain.vectorstores import Pinecone

# Initialize Pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  
    environment=PINECONE_ENV  
)

pinecone_index_nm = 'qabot'

3.2 Create a Vector Store

Firstly let's load the data from Colab:

embeddings = HuggingFaceEmbeddings()

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader(source_text_file).load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

Then create the vector store:

# Send embedding vectors to Pinecone with Langchain

vstore = Pinecone.from_documents(documents, embeddings, index_name=pinecone_index_nm)

3.3 RAG

We then use RetrievalQA to retrieve the documents from the vector database and give the model more context on Llama 2, thereby increasing its knowledge.

# use another LangChain's chain, RetrievalQA, to associate Llama with the loaded documents stored in the vector db
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vstore.as_retriever(search_kwargs={"k": 1})
)

Then the model is ready for your questions:

question = "Who is the tallest in Atlanta Hawks"
result = qa_chain({"query": question})

The response is like:

{'query': 'Who is the tallest in Atlanta Hawks',
 'result': ' The tallest player on the Atlanta Hawks roster is Saddiq Bey at 6\'7".'}

4 Conclusion

Utilizing the open-source Llama 2 model with RAG, you can create a robust chatbot tailored to your domain knowledge. This capability proves highly beneficial for enterprise users, as it circumvents privacy concerns and data leaks, ensuring everything operates in-house in theory.

However, there's still more to uncover in our quest to construct a secure and responsible GenAI app at the enterprise level. Stay tuned for further updates.

Reference

Langchain - llama.cpp:

Llama.cpp | ๐Ÿฆœ๏ธ๐Ÿ”— Langchain
llama-cpp-python is a