Run Llama 2 with Retrieval Augmented Generation in Google Colab with GPUs
Utilizing GenAI models on Colab with its free GPUs proves advantageous for GenAI developers. It enables faster execution compared to personal computers lacking powerful GPUs, thereby allowing the testing of more ideas within the same timeframe.
This post will show you how you can:
- Load Llama 2 gguf model from HuggingFace
- Run Llam2 2 with GPUs
- Create a vector store using Pinecone
- Perform question and answering using Retrieval Augmented Generation(RAG)
1 Dependencies
Firstly, install Python dependencies as below:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
!pip install huggingface_hub chromadb langchain sentence-transformers pinecone_client
Then import dependencies as below:
import numpy as np
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
from langchain.llms import LlamaCpp
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
Then mount the Google Drive to load the NBA player sample data
shared by Meta in the llama-recipes repo. This dataset will be used to create the vector store:
from google.colab import drive
drive.mount('/content/drive')
source_text_file = '/content/drive/MyDrive/Research/Data/GenAI/nba.txt'
2 Load Llama 2 from HuggingFace
Firstly create a callback manager for the streaming output of text, and specify the model names in the HuggingFace:
# for token-wise streaming so you'll see the answer gets generated token by token when Llama is answering your question
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Download the model
!wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_0.gguf
Then specify the model path to be loaded into LlamaCpp
:
model_path = 'llama-2-7b-chat.Q5_0.gguf'
Specify the GPU settings:
n_gpu_layers = 40 # Change this value based on your model and your GPU VRAM pool.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
Next, let's load the model using langchain
as below:
from langchain.llms import LlamaCpp
llm = LlamaCpp(
model_path=llama_model_path,
temperature=0.0,
top_p=1,
n_ctx=16000,
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True,
)
n_gpu_layers
and n_batch
, it shows BLAS = 1
in the output if it is set up correctly.3 RAG
Retrieval Augmented Generation (RAG) is important because it addresses key limitations of large language models (LLMs). Here's why:
- Factual Accuracy: LLMs can be creative and articulate, but they aren't always truthful. RAG integrates external knowledge sources, ensuring generated responses are grounded in real facts.
- Reduced Hallucinations: LLMs can sometimes invent information or make false claims. RAG combats hallucinations by providing LLMs with reliable context from external sources.
- Domain Expertise: LLMs struggle with specialized topics. RAG allows them access to specific knowledge bases, like medical journals or legal documents, enhancing their responses in niche areas.
- Transparency and Trust: RAG systems can show their work! Users can see the sources used to generate responses, building trust and enabling fact-checking.
In short, RAG makes LLMs more reliable, accurate, and versatile, opening doors for their use in areas like education, legal advice, and scientific research. It's a crucial step towards trustworthy and grounded AI.
3.1 Initialize Pinecone
Let's import a few related packages and initialize Pinecone - a vector store provider.
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
Get your Pinecone API key and env here:
PINECONE_API_KEY = ''
PINECONE_ENV = ''
And initialize it:
import pinecone
from langchain.vectorstores import Pinecone
# Initialize Pinecone
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_ENV
)
pinecone_index_nm = 'qabot'
3.2 Create a Vector Store
Firstly let's load the data from Colab:
embeddings = HuggingFaceEmbeddings()
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader(source_text_file).load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
Then create the vector store:
# Send embedding vectors to Pinecone with Langchain
vstore = Pinecone.from_documents(documents, embeddings, index_name=pinecone_index_nm)
3.3 RAG
We then use RetrievalQA
to retrieve the documents from the vector database and give the model more context on Llama 2, thereby increasing its knowledge.
# use another LangChain's chain, RetrievalQA, to associate Llama with the loaded documents stored in the vector db
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=vstore.as_retriever(search_kwargs={"k": 1})
)
Then the model is ready for your questions:
question = "Who is the tallest in Atlanta Hawks"
result = qa_chain({"query": question})
The response is like:
{'query': 'Who is the tallest in Atlanta Hawks',
'result': ' The tallest player on the Atlanta Hawks roster is Saddiq Bey at 6\'7".'}
4 Conclusion
Utilizing the open-source Llama 2 model with RAG, you can create a robust chatbot tailored to your domain knowledge. This capability proves highly beneficial for enterprise users, as it circumvents privacy concerns and data leaks, ensuring everything operates in-house in theory.
However, there's still more to uncover in our quest to construct a secure and responsible GenAI app at the enterprise level. Stay tuned for further updates.
Reference
Langchain - llama.cpp: