This commit is contained in:
Mason Daugherty
2025-09-11 15:12:38 -04:00
parent 38afeddcb6
commit 83d938593b

View File

@@ -1,439 +1,455 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3716230e",
"metadata": {},
"source": [
"# RAG Pipeline with MLflow Tracking, Tracing & Evaluation\n",
"\n",
"This notebook demonstrates how to build a complete Retrieval-Augmented Generation (RAG) pipeline using LangChain and integrate it with MLflow for experiment tracking, tracing, and evaluation.\n",
"\n",
"\n",
"- **RAG Pipeline Construction**: Build a complete RAG system using LangChain components\n",
"- **MLflow Integration**: Track experiments, parameters, and artifacts\n",
"- **Tracing**: Monitor inputs, outputs, retrieved documents, scores, prompts, and timings\n",
"- **Evaluation**: Use MLflow's built-in scorers to assess RAG performance\n",
"- **Best Practices**: Implement proper configuration management and reproducible experiments\n",
"\n",
"We'll build a RAG system that can answer questions about academic papers by:\n",
"1. Loading and chunking documents from ArXiv\n",
"2. Creating embeddings and a vector store\n",
"3. Setting up a retrieval-augmented generation chain\n",
"4. Tracking all experiments with MLflow\n",
"5. Evaluating the system's performance\n",
"\n",
"![System Diagram](https://miro.medium.com/v2/resize:fit:720/format:webp/1*eiw86PP4hrBBxhjTjP0JUQ.png)"
]
},
{
"cell_type": "markdown",
"id": "2f7561c4",
"metadata": {},
"source": [
"#### Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0814ebe9",
"metadata": {},
"outputs": [],
"source": [
"%pip install -U langchain mlflow langchain-community arxiv pymupdf langchain-text-splitters langchain-openai"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "747399b6",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import mlflow\n",
"from mlflow.genai.scorers import RelevanceToQuery, Correctness, ExpectationsGuidelines\n",
"from langchain_community.document_loaders import ArxivLoader\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain_core.vectorstores import InMemoryVectorStore\n",
"from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import RunnableLambda, RunnablePassthrough\n",
"from langchain_core.output_parsers import StrOutputParser"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4141ee05",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"OPENAI_API_KEY\"] = \"<YOUR OPENAI API KEY>\"\n",
"\n",
"mlflow.set_experiment(\"LangChain-RAG-MLflow\")\n",
"mlflow.langchain.autolog()"
]
},
{
"cell_type": "markdown",
"id": "dd5eb41b",
"metadata": {},
"source": [
"Define all hyperparameters and configuration in a centralized dictionary. This makes it easy to:\n",
"- Track different experiment configurations\n",
"- Reproduce results\n",
"- Perform hyperparameter tuning\n",
"\n",
"**Key Parameters**:\n",
"- `chunk_size`: Size of text chunks for document splitting\n",
"- `chunk_overlap`: Overlap between consecutive chunks\n",
"- `retriever_k`: Number of documents to retrieve\n",
"- `embeddings_model`: OpenAI embedding model\n",
"- `llm`: Language model for generation\n",
"- `temperature`: Sampling temperature for the LLM"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6dcdc5d8",
"metadata": {},
"outputs": [],
"source": [
"CONFIG = {\n",
" \"chunk_size\": 400,\n",
" \"chunk_overlap\": 80,\n",
" \"retriever_k\": 3,\n",
" \"embeddings_model\": \"text-embedding-3-small\",\n",
" \"system_prompt\": \"You are a helpful assistant. Use the following context to answer the question. Use three sentences maximum and keep the answer concise.\",\n",
" \"llm\": \"gpt-5-nano\",\n",
" \"temperature\": 0,\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "8a2985f1",
"metadata": {},
"source": [
"#### ArXiv Dcoument Loading and Processing"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1f32aa36",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\\nconvolutional neural networks in an encoder-decoder configuration. The best\\nperforming models also connect the encoder and decoder through an attention\\nmechanism. We propose a new simple network architecture, the Transformer, based\\nsolely on attention mechanisms, dispensing with recurrence and convolutions\\nentirely. Experiments on two machine translation tasks show these models to be\\nsuperior in quality while being more parallelizable and requiring significantly\\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\\nEnglish-to-German translation task, improving over the existing best results,\\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\\ntranslation task, our model establishes a new single-model state-of-the-art\\nBLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction\\nof the training costs of the best models from the literature. We show that the\\nTransformer generalizes well to other tasks by applying it successfully to\\nEnglish constituency parsing both with large and limited training data.'}\n"
]
}
],
"source": [
"# Load documents from ArXiv\n",
"loader = ArxivLoader(\n",
" query=\"1706.03762\",\n",
" load_max_docs=1,\n",
")\n",
"docs = loader.load()\n",
"print(docs[0].metadata)\n",
"\n",
"# Split documents into chunks\n",
"splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=CONFIG[\"chunk_size\"],\n",
" chunk_overlap=CONFIG[\"chunk_overlap\"],\n",
")\n",
"chunks = splitter.split_documents(docs)\n",
"\n",
"# Join chunks into a single string\n",
"def join_chunks(chunks):\n",
" return \"\\n\\n\".join([chunk.page_content for chunk in chunks])\n"
]
},
{
"cell_type": "markdown",
"id": "6e194ab4",
"metadata": {},
"source": [
"#### Vector Store and Retriever Setup"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "26dfbeaa",
"metadata": {},
"outputs": [],
"source": [
"# Create embeddings\n",
"embeddings = OpenAIEmbeddings(model=CONFIG[\"embeddings_model\"])\n",
"\n",
"# Create vector store from documents\n",
"vectorstore = InMemoryVectorStore.from_documents(\n",
" chunks,\n",
" embedding=embeddings,\n",
")\n",
"\n",
"# Create retriever\n",
"retriever = vectorstore.as_retriever(search_kwargs={\"k\": CONFIG[\"retriever_k\"]})"
]
},
{
"cell_type": "markdown",
"id": "bc1f181b",
"metadata": {},
"source": [
"#### RAG Chain Construction using [LCEL](https://python.langchain.com/docs/concepts/lcel/)\n",
"\n",
"Flow:\n",
"1. Query → Retriever (finds relevant chunks)\n",
"2. Chunks → join_chunks (creates context)\n",
"3. Context + Query → Prompt Template\n",
"4. Prompt → Language Model → Response\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6a810dc3",
"metadata": {},
"outputs": [],
"source": [
"# Initialize the language model\n",
"llm = ChatOpenAI(model=CONFIG[\"llm\"], temperature=CONFIG[\"temperature\"])\n",
"\n",
"# Create the prompt template\n",
"prompt = ChatPromptTemplate.from_messages([\n",
" (\"system\",CONFIG[\"system_prompt\"] + \"\\n\\nContext:\\n{context}\\n\\n\"),\n",
" (\"human\", \"\\n{question}\\n\"),\n",
"])\n",
"\n",
"# Construct the RAG chain\n",
"rag_chain = (\n",
" {\n",
" \"context\": retriever | RunnableLambda(join_chunks),\n",
" \"question\": RunnablePassthrough(),\n",
" }\n",
" |prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c04bd019",
"metadata": {},
"source": [
"#### Prediction Function with MLflow Tracing\n",
"\n",
"Create a prediction function decorated with `@mlflow.trace` to automatically log:\n",
"- Input queries\n",
"- Retrieved documents\n",
"- Generated responses\n",
"- Execution time\n",
"- Chain intermediate steps"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7b45fc04",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: What is the main idea of the paper?\n",
"Response: The main idea is to replace recurrent/convolutional sequence models with a pure attention-based architecture called the Transformer. It uses self-attention to model dependencies between all positions in the input and output, enabling full parallelization and better handling of long-range relations. This approach achieves strong results on translation and can extend to other modalities.\n"
]
}
],
"source": [
"@mlflow.trace\n",
"def predict_fn(question: str) -> str:\n",
" return rag_chain.invoke(question)\n",
"\n",
"# Test the prediction function\n",
"sample_question = \"What is the main idea of the paper?\"\n",
"response = predict_fn(sample_question)\n",
"print(f\"Question: {sample_question}\")\n",
"print(f\"Response: {response}\")"
]
},
{
"cell_type": "markdown",
"id": "421469de",
"metadata": {},
"source": [
"#### Evaluation Dataset and Scoring\n",
"\n",
"Define an evaluation dataset and run systematic evaluation using [MLflow's built-in scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/#available-scorers):\n",
"\n",
"<u>Evaluation Components:</u>\n",
"- **Dataset**: Questions with expected concepts and facts\n",
"- **Scorers**: \n",
" - `RelevanceToQuery`: Measures how relevant the response is to the question\n",
" - `Correctness`: Evaluates factual accuracy of the response\n",
" - `ExpectationsGuidelines`: Checks that output matches expectation guidelines\n",
"\n",
"<u>Best Practices:</u>\n",
"- Create diverse test cases covering different query types\n",
"- Include expected concepts to guide evaluation\n",
"- Use multiple scoring metrics for comprehensive assessment"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5c1dc4f2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025/08/23 20:14:39 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.\n",
"2025/08/23 20:14:39 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2b6c6687efa24796b39c7951d589d481",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Evaluating: 0%| | 0/3 [Elapsed: 00:00, Remaining: ?] "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"✨ Evaluation completed.\n",
"\n",
"Metrics and evaluation results are logged to the MLflow run:\n",
" Run name: \u001b[94mbaseline_eval\u001b[0m\n",
" Run ID: \u001b[94ma2218d9f24c9415f8040d3b77af103a9\u001b[0m\n",
"\n",
"To view the detailed evaluation results with sample-wise scores,\n",
"open the \u001b[93m\u001b[1mTraces\u001b[0m tab in the Run page in the MLflow UI.\n",
"\n"
]
}
],
"source": [
"# Define evaluation dataset\n",
"eval_dataset = [\n",
" {\n",
" \"inputs\": {\"question\": \"What is the main idea of the paper?\"},\n",
" \"expectations\": {\n",
" \"key_concepts\": [\"attention mechanism\", \"transformer\", \"neural network\"],\n",
" \"expected_facts\": [\"attention mechanism is a key component of the transformer model\"],\n",
" \"guidelines\": [\"The response must be factual and concise\"],\n",
" }\n",
" },\n",
" {\n",
" \"inputs\": {\"question\": \"What's the difference between a transformer and a recurrent neural network?\"},\n",
" \"expectations\": {\n",
" \"key_concepts\": [\"sequential\", \"attention mechanism\", \"hidden state\"],\n",
" \"expected_facts\": [\"transformer processes data in parallel while RNN processes data sequentially\"],\n",
" \"guidelines\": [\"The response must be factual and focus on the difference between the two models\"],\n",
" }\n",
" },\n",
" {\n",
" \"inputs\": {\"question\": \"What does the attention mechanism do?\"},\n",
" \"expectations\": {\n",
" \"key_concepts\": [\"query\", \"key\", \"value\", \"relationship\", \"similarity\"],\n",
" \"expected_facts\": [\"attention allows the model to weigh the importance of different parts of the input sequence when processing it\"],\n",
" \"guidelines\": [\"The response must be factual and explain the concept of attention\"],\n",
" }\n",
" }\n",
"]\n",
"\n",
"# Run evaluation with MLflow\n",
"with mlflow.start_run(run_name=\"baseline_eval\") as run:\n",
" # Log configuration parameters\n",
" mlflow.log_params(CONFIG)\n",
"\n",
" # Run evaluation\n",
" results = mlflow.genai.evaluate(\n",
" data=eval_dataset,\n",
" predict_fn=predict_fn,\n",
" scorers=[RelevanceToQuery(), Correctness(), ExpectationsGuidelines()],\n",
" )\n"
]
},
{
"cell_type": "markdown",
"id": "52b137c7",
"metadata": {},
"source": [
"#### Launch MLflow UI to check out the results\n",
"\n",
"<u>What you'll see in the UI:</u>\n",
"- **Experiments**: Compare different RAG configurations\n",
"- **Runs**: Individual experiment runs with metrics and parameters\n",
"- **Traces**: Detailed execution traces showing retrieval and generation steps\n",
"- **Evaluation Results**: Scoring metrics and detailed comparisons\n",
"- **Artifacts**: Saved models, datasets, and other files\n",
"\n",
"Navigate to `http://localhost:5000` after running the command below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "817c3799",
"metadata": {},
"outputs": [],
"source": [
"!mlflow ui"
]
},
{
"cell_type": "markdown",
"id": "c75861e3",
"metadata": {},
"source": [
"You should see something like this\n",
"\n",
"![MLflow UI image](https://miro.medium.com/v2/resize:fit:720/format:webp/1*Cx7MMy53pAP7150x_hvztA.png)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
"cells": [
{
"cell_type": "markdown",
"id": "3716230e",
"metadata": {},
"source": [
"# RAG Pipeline with MLflow Tracking, Tracing & Evaluation\n",
"\n",
"This notebook demonstrates how to build a complete Retrieval-Augmented Generation (RAG) pipeline using LangChain and integrate it with MLflow for experiment tracking, tracing, and evaluation.\n",
"\n",
"\n",
"- **RAG Pipeline Construction**: Build a complete RAG system using LangChain components\n",
"- **MLflow Integration**: Track experiments, parameters, and artifacts\n",
"- **Tracing**: Monitor inputs, outputs, retrieved documents, scores, prompts, and timings\n",
"- **Evaluation**: Use MLflow's built-in scorers to assess RAG performance\n",
"- **Best Practices**: Implement proper configuration management and reproducible experiments\n",
"\n",
"We'll build a RAG system that can answer questions about academic papers by:\n",
"1. Loading and chunking documents from ArXiv\n",
"2. Creating embeddings and a vector store\n",
"3. Setting up a retrieval-augmented generation chain\n",
"4. Tracking all experiments with MLflow\n",
"5. Evaluating the system's performance\n",
"\n",
"![System Diagram](https://miro.medium.com/v2/resize:fit:720/format:webp/1*eiw86PP4hrBBxhjTjP0JUQ.png)"
]
},
"nbformat": 4,
"nbformat_minor": 5
{
"cell_type": "markdown",
"id": "2f7561c4",
"metadata": {},
"source": [
"#### Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0814ebe9",
"metadata": {},
"outputs": [],
"source": [
"%pip install -U langchain mlflow langchain-community arxiv pymupdf langchain-text-splitters langchain-openai"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "747399b6",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import mlflow\n",
"from mlflow.genai.scorers import RelevanceToQuery, Correctness, ExpectationsGuidelines\n",
"from langchain_community.document_loaders import ArxivLoader\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain_core.vectorstores import InMemoryVectorStore\n",
"from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import RunnableLambda, RunnablePassthrough\n",
"from langchain_core.output_parsers import StrOutputParser"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4141ee05",
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"OPENAI_API_KEY\"] = \"<YOUR OPENAI API KEY>\"\n",
"\n",
"mlflow.set_experiment(\"LangChain-RAG-MLflow\")\n",
"mlflow.langchain.autolog()"
]
},
{
"cell_type": "markdown",
"id": "dd5eb41b",
"metadata": {},
"source": [
"Define all hyperparameters and configuration in a centralized dictionary. This makes it easy to:\n",
"- Track different experiment configurations\n",
"- Reproduce results\n",
"- Perform hyperparameter tuning\n",
"\n",
"**Key Parameters**:\n",
"- `chunk_size`: Size of text chunks for document splitting\n",
"- `chunk_overlap`: Overlap between consecutive chunks\n",
"- `retriever_k`: Number of documents to retrieve\n",
"- `embeddings_model`: OpenAI embedding model\n",
"- `llm`: Language model for generation\n",
"- `temperature`: Sampling temperature for the LLM"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6dcdc5d8",
"metadata": {},
"outputs": [],
"source": [
"CONFIG = {\n",
" \"chunk_size\": 400,\n",
" \"chunk_overlap\": 80,\n",
" \"retriever_k\": 3,\n",
" \"embeddings_model\": \"text-embedding-3-small\",\n",
" \"system_prompt\": \"You are a helpful assistant. Use the following context to answer the question. Use three sentences maximum and keep the answer concise.\",\n",
" \"llm\": \"gpt-5-nano\",\n",
" \"temperature\": 0,\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "8a2985f1",
"metadata": {},
"source": [
"#### ArXiv Dcoument Loading and Processing"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1f32aa36",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\\nconvolutional neural networks in an encoder-decoder configuration. The best\\nperforming models also connect the encoder and decoder through an attention\\nmechanism. We propose a new simple network architecture, the Transformer, based\\nsolely on attention mechanisms, dispensing with recurrence and convolutions\\nentirely. Experiments on two machine translation tasks show these models to be\\nsuperior in quality while being more parallelizable and requiring significantly\\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\\nEnglish-to-German translation task, improving over the existing best results,\\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\\ntranslation task, our model establishes a new single-model state-of-the-art\\nBLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction\\nof the training costs of the best models from the literature. We show that the\\nTransformer generalizes well to other tasks by applying it successfully to\\nEnglish constituency parsing both with large and limited training data.'}\n"
]
}
],
"source": [
"# Load documents from ArXiv\n",
"loader = ArxivLoader(\n",
" query=\"1706.03762\",\n",
" load_max_docs=1,\n",
")\n",
"docs = loader.load()\n",
"print(docs[0].metadata)\n",
"\n",
"# Split documents into chunks\n",
"splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=CONFIG[\"chunk_size\"],\n",
" chunk_overlap=CONFIG[\"chunk_overlap\"],\n",
")\n",
"chunks = splitter.split_documents(docs)\n",
"\n",
"\n",
"# Join chunks into a single string\n",
"def join_chunks(chunks):\n",
" return \"\\n\\n\".join([chunk.page_content for chunk in chunks])"
]
},
{
"cell_type": "markdown",
"id": "6e194ab4",
"metadata": {},
"source": [
"#### Vector Store and Retriever Setup"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "26dfbeaa",
"metadata": {},
"outputs": [],
"source": [
"# Create embeddings\n",
"embeddings = OpenAIEmbeddings(model=CONFIG[\"embeddings_model\"])\n",
"\n",
"# Create vector store from documents\n",
"vectorstore = InMemoryVectorStore.from_documents(\n",
" chunks,\n",
" embedding=embeddings,\n",
")\n",
"\n",
"# Create retriever\n",
"retriever = vectorstore.as_retriever(search_kwargs={\"k\": CONFIG[\"retriever_k\"]})"
]
},
{
"cell_type": "markdown",
"id": "bc1f181b",
"metadata": {},
"source": [
"#### RAG Chain Construction using [LCEL](https://python.langchain.com/docs/concepts/lcel/)\n",
"\n",
"Flow:\n",
"1. Query → Retriever (finds relevant chunks)\n",
"2. Chunks → join_chunks (creates context)\n",
"3. Context + Query → Prompt Template\n",
"4. Prompt → Language Model → Response\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6a810dc3",
"metadata": {},
"outputs": [],
"source": [
"# Initialize the language model\n",
"llm = ChatOpenAI(model=CONFIG[\"llm\"], temperature=CONFIG[\"temperature\"])\n",
"\n",
"# Create the prompt template\n",
"prompt = ChatPromptTemplate.from_messages(\n",
" [\n",
" (\"system\", CONFIG[\"system_prompt\"] + \"\\n\\nContext:\\n{context}\\n\\n\"),\n",
" (\"human\", \"\\n{question}\\n\"),\n",
" ]\n",
")\n",
"\n",
"# Construct the RAG chain\n",
"rag_chain = (\n",
" {\n",
" \"context\": retriever | RunnableLambda(join_chunks),\n",
" \"question\": RunnablePassthrough(),\n",
" }\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c04bd019",
"metadata": {},
"source": [
"#### Prediction Function with MLflow Tracing\n",
"\n",
"Create a prediction function decorated with `@mlflow.trace` to automatically log:\n",
"- Input queries\n",
"- Retrieved documents\n",
"- Generated responses\n",
"- Execution time\n",
"- Chain intermediate steps"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7b45fc04",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: What is the main idea of the paper?\n",
"Response: The main idea is to replace recurrent/convolutional sequence models with a pure attention-based architecture called the Transformer. It uses self-attention to model dependencies between all positions in the input and output, enabling full parallelization and better handling of long-range relations. This approach achieves strong results on translation and can extend to other modalities.\n"
]
}
],
"source": [
"@mlflow.trace\n",
"def predict_fn(question: str) -> str:\n",
" return rag_chain.invoke(question)\n",
"\n",
"\n",
"# Test the prediction function\n",
"sample_question = \"What is the main idea of the paper?\"\n",
"response = predict_fn(sample_question)\n",
"print(f\"Question: {sample_question}\")\n",
"print(f\"Response: {response}\")"
]
},
{
"cell_type": "markdown",
"id": "421469de",
"metadata": {},
"source": [
"#### Evaluation Dataset and Scoring\n",
"\n",
"Define an evaluation dataset and run systematic evaluation using [MLflow's built-in scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/#available-scorers):\n",
"\n",
"<u>Evaluation Components:</u>\n",
"- **Dataset**: Questions with expected concepts and facts\n",
"- **Scorers**: \n",
" - `RelevanceToQuery`: Measures how relevant the response is to the question\n",
" - `Correctness`: Evaluates factual accuracy of the response\n",
" - `ExpectationsGuidelines`: Checks that output matches expectation guidelines\n",
"\n",
"<u>Best Practices:</u>\n",
"- Create diverse test cases covering different query types\n",
"- Include expected concepts to guide evaluation\n",
"- Use multiple scoring metrics for comprehensive assessment"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "5c1dc4f2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025/08/23 20:14:39 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.\n",
"2025/08/23 20:14:39 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2b6c6687efa24796b39c7951d589d481",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Evaluating: 0%| | 0/3 [Elapsed: 00:00, Remaining: ?] "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"✨ Evaluation completed.\n",
"\n",
"Metrics and evaluation results are logged to the MLflow run:\n",
" Run name: \u001b[94mbaseline_eval\u001b[0m\n",
" Run ID: \u001b[94ma2218d9f24c9415f8040d3b77af103a9\u001b[0m\n",
"\n",
"To view the detailed evaluation results with sample-wise scores,\n",
"open the \u001b[93m\u001b[1mTraces\u001b[0m tab in the Run page in the MLflow UI.\n",
"\n"
]
}
],
"source": [
"# Define evaluation dataset\n",
"eval_dataset = [\n",
" {\n",
" \"inputs\": {\"question\": \"What is the main idea of the paper?\"},\n",
" \"expectations\": {\n",
" \"key_concepts\": [\"attention mechanism\", \"transformer\", \"neural network\"],\n",
" \"expected_facts\": [\n",
" \"attention mechanism is a key component of the transformer model\"\n",
" ],\n",
" \"guidelines\": [\"The response must be factual and concise\"],\n",
" },\n",
" },\n",
" {\n",
" \"inputs\": {\n",
" \"question\": \"What's the difference between a transformer and a recurrent neural network?\"\n",
" },\n",
" \"expectations\": {\n",
" \"key_concepts\": [\"sequential\", \"attention mechanism\", \"hidden state\"],\n",
" \"expected_facts\": [\n",
" \"transformer processes data in parallel while RNN processes data sequentially\"\n",
" ],\n",
" \"guidelines\": [\n",
" \"The response must be factual and focus on the difference between the two models\"\n",
" ],\n",
" },\n",
" },\n",
" {\n",
" \"inputs\": {\"question\": \"What does the attention mechanism do?\"},\n",
" \"expectations\": {\n",
" \"key_concepts\": [\"query\", \"key\", \"value\", \"relationship\", \"similarity\"],\n",
" \"expected_facts\": [\n",
" \"attention allows the model to weigh the importance of different parts of the input sequence when processing it\"\n",
" ],\n",
" \"guidelines\": [\n",
" \"The response must be factual and explain the concept of attention\"\n",
" ],\n",
" },\n",
" },\n",
"]\n",
"\n",
"# Run evaluation with MLflow\n",
"with mlflow.start_run(run_name=\"baseline_eval\") as run:\n",
" # Log configuration parameters\n",
" mlflow.log_params(CONFIG)\n",
"\n",
" # Run evaluation\n",
" results = mlflow.genai.evaluate(\n",
" data=eval_dataset,\n",
" predict_fn=predict_fn,\n",
" scorers=[RelevanceToQuery(), Correctness(), ExpectationsGuidelines()],\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "52b137c7",
"metadata": {},
"source": [
"#### Launch MLflow UI to check out the results\n",
"\n",
"<u>What you'll see in the UI:</u>\n",
"- **Experiments**: Compare different RAG configurations\n",
"- **Runs**: Individual experiment runs with metrics and parameters\n",
"- **Traces**: Detailed execution traces showing retrieval and generation steps\n",
"- **Evaluation Results**: Scoring metrics and detailed comparisons\n",
"- **Artifacts**: Saved models, datasets, and other files\n",
"\n",
"Navigate to `http://localhost:5000` after running the command below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "817c3799",
"metadata": {},
"outputs": [],
"source": [
"!mlflow ui"
]
},
{
"cell_type": "markdown",
"id": "c75861e3",
"metadata": {},
"source": [
"You should see something like this\n",
"\n",
"![MLflow UI image](https://miro.medium.com/v2/resize:fit:720/format:webp/1*Cx7MMy53pAP7150x_hvztA.png)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}