# RAG application running locally on Intel Xeon CPU using langchain and open-source models

Author - Pratool Bharti (pratool.bharti@intel.com)

In this cookbook, we use langchain tools and open source models to execute locally on CPU. This notebook has been validated to run on Intel Xeon 8480+ CPU. Here we implement a RAG pipeline for Llama2 model to answer questions about Intel Q1 2024 earnings release.

**Create a conda or virtualenv environment with python >=3.10 and install following libraries**
<br>

`pip install --upgrade langchain langchain-community langchainhub langchain-chroma bs4 gpt4all pypdf pysqlite3-binary` <br>
`pip install llama-cpp-python   --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu`

**Load pysqlite3 in sys modules since ChromaDB requires sqlite3.**

In [1]:
__import__("pysqlite3")
import sys

sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

**Import essential components from langchain to load and split data**

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

**Download Intel Q1 2024 earnings release**

In [4]:
!wget  'https://d1io3yog0oux5.cloudfront.net/_11d435a500963f99155ee058df09f574/intel/db/887/9014/earnings_release/Q1+24_EarningsRelease_FINAL.pdf' -O intel_q1_2024_earnings.pdf

--2024-07-15 15:04:43--  https://d1io3yog0oux5.cloudfront.net/_11d435a500963f99155ee058df09f574/intel/db/887/9014/earnings_release/Q1+24_EarningsRelease_FINAL.pdf
Resolving proxy-dmz.intel.com (proxy-dmz.intel.com)... 10.7.211.16
Connecting to proxy-dmz.intel.com (proxy-dmz.intel.com)|10.7.211.16|:912... connected.
Proxy request sent, awaiting response... 200 OK
Length: 133510 (130K) [application/pdf]
Saving to: ‘intel_q1_2024_earnings.pdf’


2024-07-15 15:04:44 (24.6 MB/s) - ‘intel_q1_2024_earnings.pdf’ saved [133510/133510]



**Loading earning release pdf document through PyPDFLoader**

In [5]:
loader = PyPDFLoader("intel_q1_2024_earnings.pdf")
data = loader.load()

**Splitting entire document in several chunks with each chunk size is 500 tokens**

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

**Looking at the first split of the document**

In [7]:
all_splits[0]

Document(metadata={'source': 'intel_q1_2024_earnings.pdf', 'page': 0}, page_content='Intel Corporation\n2200 Mission College Blvd.\nSanta Clara, CA 95054-1549\n                                                         \nNews Release\n Intel Reports First -Quarter 2024  Financial Results\nNEWS SUMMARY\n▪First-quarter revenue of $12.7 billion , up 9%  year over year (YoY).\n▪First-quarter GAAP earnings (loss) per share (EPS) attributable to Intel was $(0.09) ; non-GAAP EPS \nattributable to Intel was $0.18 .')

**One of the major step in RAG is to convert each split of document into embeddings and store in a vector database such that searching relevant documents are efficient.** <br>
**For that, importing Chroma vector database from langchain. Also, importing open source GPT4All for embedding models**

In [8]:
from langchain_chroma import Chroma
from langchain_community.embeddings import GPT4AllEmbeddings

**In next step, we will download one of the most popular embedding model "all-MiniLM-L6-v2". Find more details of the model at this link https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2**

In [10]:
model_name = "all-MiniLM-L6-v2.gguf2.f16.gguf"
gpt4all_kwargs = {"allow_download": "True"}
embeddings = GPT4AllEmbeddings(model_name=model_name, gpt4all_kwargs=gpt4all_kwargs)

**Store all the embeddings in the Chroma database**

In [11]:
vectorstore = Chroma.from_documents(documents=all_splits, embedding=embeddings)

**Now, let's find relevant splits from the documents related to the question**

In [12]:
question = "What is Intel CCG revenue in Q1 2024"
docs = vectorstore.similarity_search(question)
print(len(docs))

4


**Look at the first retrieved document from the vector database**

In [13]:
docs[0]

Document(metadata={'page': 1, 'source': 'intel_q1_2024_earnings.pdf'}, page_content='Client Computing Group (CCG) $7.5 billion up31%\nData Center and AI (DCAI) $3.0 billion up5%\nNetwork and Edge (NEX) $1.4 billion down 8%\nTotal Intel Products revenue $11.9 billion up17%\nIntel Foundry $4.4 billion down 10%\nAll other:\nAltera $342 million down 58%\nMobileye $239 million down 48%\nOther $194 million up17%\nTotal all other revenue $775 million down 46%\nIntersegment eliminations $(4.4) billion\nTotal net revenue $12.7 billion up9%\nIntel Products Highlights')

**Download Lllama-2 model from Huggingface and store locally** <br>
**You can download different quantization variant of Lllama-2 model from the link below. We are using Q8 version here (7.16GB).** <br>
https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

In [None]:
!huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q8_0.gguf --local-dir . --local-dir-use-symlinks False

**Import langchain components required to load downloaded LLMs model**

In [14]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp

**Loading the local Lllama-2 model using Llama-cpp library**

In [16]:
llm = LlamaCpp(
    model_path="llama-2-7b-chat.Q8_0.gguf",
    n_gpu_layers=-1,
    n_batch=512,
    n_ctx=2048,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=True,
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32

**Now let's ask the same question to Llama model without showing them the earnings release.**

In [17]:
llm.invoke(question)

?
(NASDAQ:INTC)
Intel's CCG (Client Computing Group) revenue for Q1 2024 was $9.6 billion, a decrease of 35% from the previous quarter and a decrease of 42% from the same period last year.


llama_print_timings:        load time =     131.20 ms
llama_print_timings:      sample time =      16.05 ms /    68 runs   (    0.24 ms per token,  4236.76 tokens per second)
llama_print_timings: prompt eval time =     131.14 ms /    16 tokens (    8.20 ms per token,   122.01 tokens per second)
llama_print_timings:        eval time =    3225.00 ms /    67 runs   (   48.13 ms per token,    20.78 tokens per second)
llama_print_timings:       total time =    3466.40 ms /    83 tokens


"?\n(NASDAQ:INTC)\nIntel's CCG (Client Computing Group) revenue for Q1 2024 was $9.6 billion, a decrease of 35% from the previous quarter and a decrease of 42% from the same period last year."

**As you can see, model is giving wrong information. Correct asnwer is CCG revenue in Q1 2024 is $7.5B. Now let's apply RAG using the earning release document**

**in RAG, we modify the input prompt by adding relevent documents with the question. Here, we use one of the popular RAG prompt**

In [18]:
from langchain import hub

rag_prompt = hub.pull("rlm/rag-prompt")
rag_prompt.messages

[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]

**Appending all retreived documents in a single document**

In [19]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

**The last step is to create a chain using langchain tool that will create an e2e pipeline. It will take question and context as an input.**

In [20]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnablePick

# Chain
chain = (
    RunnablePassthrough.assign(context=RunnablePick("context") | format_docs)
    | rag_prompt
    | llm
    | StrOutputParser()
)

In [21]:
chain.invoke({"context": docs, "question": question})

Llama.generate: prefix-match hit


 Based on the provided context, Intel CCG revenue in Q1 2024 was $7.5 billion up 31%.


llama_print_timings:        load time =     131.20 ms
llama_print_timings:      sample time =       7.74 ms /    31 runs   (    0.25 ms per token,  4004.13 tokens per second)
llama_print_timings: prompt eval time =    2529.41 ms /   674 tokens (    3.75 ms per token,   266.46 tokens per second)
llama_print_timings:        eval time =    1542.94 ms /    30 runs   (   51.43 ms per token,    19.44 tokens per second)
llama_print_timings:       total time =    4123.68 ms /   704 tokens


' Based on the provided context, Intel CCG revenue in Q1 2024 was $7.5 billion up 31%.'

**Now we see the results are correct as it is mentioned in earnings release.** <br>
**To further automate, we will create a chain that will take input as question and retriever so that we don't need to retrieve documents separately**

In [22]:
retriever = vectorstore.as_retriever()
qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

**Now we only need to pass the question to the chain and it will fetch the contexts directly from the vector database to generate the answer**
<br>
**Let's try with another question**

In [26]:
qa_chain.invoke("what is Intel DCAI revenue in Q1 2024?")

Llama.generate: prefix-match hit


 According to the provided context, Intel DCAI revenue in Q1 2024 was $3.0 billion up 5%.


llama_print_timings:        load time =     131.20 ms
llama_print_timings:      sample time =       6.28 ms /    31 runs   (    0.20 ms per token,  4937.88 tokens per second)
llama_print_timings: prompt eval time =    2681.93 ms /   730 tokens (    3.67 ms per token,   272.19 tokens per second)
llama_print_timings:        eval time =    1471.07 ms /    30 runs   (   49.04 ms per token,    20.39 tokens per second)
llama_print_timings:       total time =    4206.77 ms /   760 tokens


' According to the provided context, Intel DCAI revenue in Q1 2024 was $3.0 billion up 5%.'