Featured

How to Prompt Correctly with Llama 2?

Vincent Yuan · updated on Jul 7, 2024 · 7 min read

Tech AI GenAI

Uncertain if you've encountered instances where Llama 2 provides irrelevant, redundant, or potentially harmful responses. Such outcomes can be perplexing and may lead users to disengage. A contributing factor to this issue is often the incorrect utilization of prompts. Therefore, this post aims to introduce best practices for prompting when developing GenAI apps with Llama 2.

The sample code can run on Google Colab with GPUs, kindly check below post for the GPU configuration of Llama 2.

This post will show:

Run Llama 2 with GPUs
Comparison of different prompts and the impact to the response of Llama 2
Prompt design for chat, with awareness of historical messages

1 Get Llama 2 Ready

Firstly, install Python dependencies, download the Llama 2 model, and load Llama 2 model. This part is identical to the reference link above so no details are shared repeatedly.

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

!pip install huggingface_hub   chromadb langchain sentence-transformers pinecone_client

import numpy as np
import pandas as pd

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

from langchain.llms import LlamaCpp
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate

# Vector store
from langchain.document_loaders import CSVLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

# Show result
import markdown

!wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_0.gguf

# for token-wise streaming so you'll see the answer gets generated token by token when Llama is answering your question
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llama_model_path = 'llama-2-7b-chat.Q5_0.gguf'

n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

from langchain.llms import LlamaCpp
llm = LlamaCpp(
    model_path=llama_model_path,
    temperature=0.1,
    top_p=1,
    n_ctx=16000,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,
)

2 Impact of Different Prompts

It is pretty amazing that slightly different prompts will lead to quite different response. This can be reflected by simple testing as below.

2.1 Just Ask Questions

For instance, the most straightforward way is just to ask what you want like below:

Testing_message = "The Stoxx Europe 600 index slipped 0.5% at the close, extending a lackluster start to the year."

# Use LangChain's PromptTemplate and LLMChain
prompt = PromptTemplate.from_template(
    "Extract the named entity information from below text: {text}"
)

chain = LLMChain(llm=llm, prompt=prompt)
answer = chain.invoke(Testing_message)

The answer is like below:

 The index has fallen 3.7% since the beginning of January and is down 12.9% from its peak in August last year.
Please provide the named entities as follows:
1. Stoxx Europe 600
2. index
3. Europe
4. January
5. August

As you can see, Llama 2 firstly repeats the sentence and also adds more info, then answers the question, which is not expected by users as it seems to be out of control in a sense.

2.2 Prompt with System Message

By slightly adjusting the prompt, the response will become more normal.

prompt = PromptTemplate.from_template(
    "[INST]Extract the important Named Entity Recoginiton information from this text: {text}, do not add unrelated content in the reply.[/INST]"
)
chain = LLMChain(llm=llm, prompt=prompt)
answer = chain.invoke(Testing_message)

The response becomes:

  Sure! Here are the important named entities recognized in the given text:

1. Stoxx Europe 600 - Index
2. Europe - Continent

So now it does not change the sentence, and only answers the question that user asks. This version makes more sense simply because the addition of [INST] and [/INST] in the prompt. [INST] is part of the token used in the model training process, shared in the Llama 2 paper, which helps model understand the conversation.

Also, there is a more flexible way to do this, also with the addition of customizable system message as below:

# creating prompt for large language model
pre_prompt = """[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.

If you cannot answer the question from the given documents, please state that you do not have an answer.\n
"""

prompt = pre_prompt + "{context}\n" +"Question : {question}" + "[\INST]"
llama_prompt = PromptTemplate(template=prompt, input_variables=["context", "question"])

chain = LLMChain(llm=llm, prompt=llama_prompt)

result = chain({ "context" : "Extract the named entity information from below sentences:",
                "question": Testing_message
                 })

The result is as below:

  Sure, I'd be happy to help! Here is the named entity information extracted from the sentence you provided:

* Stoxx Europe 600 index
* Europe
* year

I hope this helps! Let me know if you have any other questions.

In fact this is the template strictly following the training procedure of Llama 2. And with above template, you can customize the system message more flexibly though the response might look similar to a simplified version as shown above.

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

2.3 Prompt with Chat History

Llama 2 does not have memory, using above prompt would not allow the model to answer your question based on the chat history, in order to address this challenge, we need to embed the chat history into the prompt so that the model is exposed to it.

Below is a vanilla version that works, firstly have the 1st round the question & answering:

def create_chat_prompt(pre_prompt, chat_messages):
  for dict_message in chat_messages:
      if dict_message["role"] == "user":
          pre_prompt += "User: " + dict_message["content"] + "\n\n"
      else:
          pre_prompt += "Assistant: " + dict_message["content"] + "\n\n"


  prompt = pre_prompt + "User : {question}" + "[\INST]"
  return prompt

messages = [{"role": "assistant", "content": "How may I assist you today?"}]

# creating prompt for large language model
pre_prompt = """[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.

If you cannot answer the question from the given documents, please state that you do not have an answer.\n
"""


llama_chat_prompt = PromptTemplate(template=prompt, input_variables=["question"])

chat_chain = LLMChain(llm=llm, prompt=llama_chat_prompt)

chat_question = { "question": "What do you think are the most beautiful things in the Autumn in the U.S.?"}

messages.append({"role": "user", "content": chat_question['question']})

response = chat_chain(chat_question)

messages.append({"role": "Assistant", "content": response['text']})

The response is like:

I'm here to help you with your question. However, I must inform you that the concept of "beauty" is subjective and can vary from person to person. What one person finds beautiful may not be the same for another.
That being said, Autumn in the U.S. is known for its vibrant foliage, with trees changing colors from green to shades of orange, red, and yellow. The most beautiful things about Autumn in the U.S. can be observed through the following:
1. Leaves: The vibrant colors of the leaves during Autumn are a sight to behold. The changing hues of orange, red, and yellow create a picturesque landscape that is both breathtaking and Instagram-worthy.
2. Fall Foliage: The term "fall foliage" refers to the vibrant colors of the leaves during Autumn. The changing colors of the leaves are a result of the trees preparing for winter by storing energy in their roots, stems, and branches.
3. Pumpkins: Pumpkins are a staple of Autumn in the U.S., often used in decorations

This is a pretty standard response, then let's ask a follow-up question based on the response above:

prompt = create_chat_prompt(pre_prompt=pre_prompt, 
                            chat_messages=messages)
                            
llama_chat_prompt = PromptTemplate(template=prompt, input_variables=["question"])

chat_chain = LLMChain(llm=llm, prompt=llama_chat_prompt)


chat_question = { "question": "For the 2nd point you mentioned above, can you please make a 3 day travel plan?"}
messages.append({"role": "user", "content": chat_question})

response = chat_chain(chat_question)

messages.append({"role": "Assistant", "content": response['text']})

As you can see, the question does not specify what 2nd point is, while the model is able to grasp that based on the chat history, and the response is like below:

Of course! I'd be happy to help you create a 3-day travel plan for experiencing the beautiful things about Autumn in the U.S. Here is a sample itinerary:
Day 1:
* Stop 1: Take a scenic drive through the Adirondack Mountains in upstate New York. The mountains offer breathtaking views of the changing leaves, and there are many scenic overlooks and hiking trails to explore.
* Stop 2: Visit the Hudson River Valley, which is known for its picturesque towns, farms, and vineyards. Take a stroll through the charming streets of Cold Spring or Beacon, and enjoy the fall foliage along the riverfront.
Day 2:
* Stop 1: Head to New England, specifically Vermont or New Hampshire, for some of the most spectacular fall foliage in the country. Take a drive through the Green Mountains or White Mountains, and stop at scenic overlooks and hiking trails along the way.
* Stop 2: Visit the coastal towns of Maine, such as Kennebunkport or Camden

3 Summary

Some the snippets are not made into a function just for demo purposes, while you can see by adding system messages and chat history into the prompt, Llama 2 becomes even more intelligent and helpful.

So far, we have covered topics of Llama 2 regarding:

Fast inference using GPUs
Better prompt tactics for reasonable response
Chat with Llama 2
RAG for domain knowledge question & answering

This means that a lot of useful apps powered by Llama 2 can be built using above tech stack. Stay tuned for more valuable sharing!

Reference

How to Prompt Llama 2: