community[minor]: New documents loader for visio files (with extension .vsdx) (#16171)

**Description** : New documents loader for visio files (with extension .vsdx) A [visio file](https://fr.wikipedia.org/wiki/Microsoft_Visio) (with extension .vsdx) is associated with Microsoft Visio, a diagram creation software. It stores information about the structure, layout, and graphical elements of a diagram. This format facilitates the creation and sharing of visualizations in areas such as business, engineering, and computer science. A Visio file can contain multiple pages. Some of them may serve as the background for others, and this can occur across multiple layers. This loader extracts the textual content from each page and its associated pages, enabling the extraction of all visible text from each page, similar to what an OCR algorithm would do. **Dependencies** : xmltodict package
2025-08-29 14:37:21 +00:00 · 2024-01-23 07:07:03 +01:00 · 2024-01-23 07:07:03 +01:00 · 4b7969efc5
commit 4b7969efc5
parent fb41b68ea1
10 changed files with 801 additions and 0 deletions
--- a/docs/docs/integrations/document_loaders/example_data/fake.vsdx
+++ b/docs/docs/integrations/document_loaders/example_data/fake.vsdx
--- a/docs/docs/integrations/document_loaders/vsdx.ipynb
+++ b/docs/docs/integrations/document_loaders/vsdx.ipynb
@ -0,0 +1,486 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Vsdx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> A [visio file](https://fr.wikipedia.org/wiki/Microsoft_Visio) (with extension .vsdx) is associated with Microsoft Visio, a diagram creation software. It stores information about the structure, layout, and graphical elements of a diagram. This format facilitates the creation and sharing of visualizations in areas such as business, engineering, and computer science."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A Visio file can contain multiple pages. Some of them may serve as the background for others, and this can occur across multiple layers. This **loader** extracts the textual content from each page and its associated pages, enabling the extraction of all visible text from each page, similar to what an OCR algorithm would do."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**WARNING** : Only Visio files with the **.vsdx** extension are compatible with this loader. Files with extensions such as .vsd, ... are not compatible because they cannot be converted to compressed XML."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import VsdxLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = VsdxLoader(file_path=\"./example_data/fake.vsdx\")\n",
    "documents = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Display loaded documents**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "------ Page 0 ------\n",
      "Title page : Summary\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Best Caption of the worl\n",
      "This is an arrow\n",
      "This is Earth\n",
      "This is a bounded arrow\n",
      "\n",
      "------ Page 1 ------\n",
      "Title page : Glossary\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "\n",
      "------ Page 2 ------\n",
      "Title page : blanket page\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "This file is a vsdx file\n",
      "First text\n",
      "Second text\n",
      "Third text\n",
      "\n",
      "------ Page 3 ------\n",
      "Title page : BLABLABLA\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Another RED arrow wow\n",
      "Arrow with point but red\n",
      "Green line\n",
      "User\n",
      "Captions\n",
      "Red arrow magic !\n",
      "Something white\n",
      "Something Red\n",
      "This a a completly useless diagramm, cool !!\n",
      "\n",
      "But this is for example !\n",
      "This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
      "This is a page with something...\n",
      "\n",
      "WAW I have learned something !\n",
      "This is a page with something...\n",
      "\n",
      "WAW I have learned something !\n",
      "\n",
      "X2\n",
      "\n",
      "------ Page 4 ------\n",
      "Title page : What a page !!\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Something white\n",
      "Something Red\n",
      "This a a completly useless diagramm, cool !!\n",
      "\n",
      "But this is for example !\n",
      "This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
      "Another RED arrow wow\n",
      "Arrow with point but red\n",
      "Green line\n",
      "User\n",
      "Captions\n",
      "Red arrow magic !\n",
      "\n",
      "------ Page 5 ------\n",
      "Title page : next page after previous one\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Another RED arrow wow\n",
      "Arrow with point but red\n",
      "Green line\n",
      "User\n",
      "Captions\n",
      "Red arrow magic !\n",
      "Something white\n",
      "Something Red\n",
      "This a a completly useless diagramm, cool !!\n",
      "\n",
      "But this is for example !\n",
      "This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor\n",
      "\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0-\\u00a0incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in\n",
      "\n",
      "\n",
      "voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa\n",
      "*\n",
      "\n",
      "\n",
      "qui officia deserunt mollit anim id est laborum.\n",
      "\n",
      "------ Page 6 ------\n",
      "Title page : Connector Page\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Something white\n",
      "Something Red\n",
      "This a a completly useless diagramm, cool !!\n",
      "\n",
      "But this is for example !\n",
      "This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
      "\n",
      "------ Page 7 ------\n",
      "Title page : Useful ↔ Useless page\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Something white\n",
      "Something Red\n",
      "This a a completly useless diagramm, cool !!\n",
      "\n",
      "But this is for example !\n",
      "This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
      "Title of this document : BLABLABLA\n",
      "\n",
      "------ Page 8 ------\n",
      "Title page : Alone page\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Black cloud\n",
      "Unidirectional traffic primary path\n",
      "Unidirectional traffic backup path\n",
      "Encapsulation\n",
      "User\n",
      "Captions\n",
      "Bidirectional traffic\n",
      "Alone, sad\n",
      "Test of another page\n",
      "This is a \\\"bannier\\\"\n",
      "Tests of some exotics characters :\\u00a0\\u00e3\\u00e4\\u00e5\\u0101\\u0103 \\u00fc\\u2554\\u00a0 \\u00a0\\u00bc \\u00c7 \\u25d8\\u25cb\\u2642\\u266b\\u2640\\u00ee\\u2665\n",
      "This is ethernet\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
      "This is an empty case\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor\n",
      "\\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0-\\u00a0 incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in\n",
      "\n",
      "\n",
      " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa \n",
      "*\n",
      "\n",
      "\n",
      "qui officia deserunt mollit anim id est laborum.\n",
      "\n",
      "------ Page 9 ------\n",
      "Title page : BG\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Best Caption of the worl\n",
      "This is an arrow\n",
      "This is Earth\n",
      "This is a bounded arrow\n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "\n",
      "------ Page 10 ------\n",
      "Title page : BG  + caption1\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Another RED arrow wow\n",
      "Arrow with point but red\n",
      "Green line\n",
      "User\n",
      "Captions\n",
      "Red arrow magic !\n",
      "Something white\n",
      "Something Red\n",
      "This a a completly useless diagramm, cool !!\n",
      "\n",
      "But this is for example !\n",
      "This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
      "Useful\\u2194 Useless page\\u00a0\n",
      "\n",
      "Tests of some exotics characters :\\u00a0\\u00e3\\u00e4\\u00e5\\u0101\\u0103 \\u00fc\\u2554\\u00a0\\u00a0\\u00bc \\u00c7 \\u25d8\\u25cb\\u2642\\u266b\\u2640\\u00ee\\u2665\n",
      "\n",
      "------ Page 11 ------\n",
      "Title page : BG+\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "\n",
      "------ Page 12 ------\n",
      "Title page : BG WITH CONTENT\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
      "\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
      "\n",
      "\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. - Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
      "\n",
      "\n",
      "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
      "This is a page with a lot of text\n",
      "\n",
      "------ Page 13 ------\n",
      "Title page : 2nd caption with ____________________________________________________________________ content\n",
      "Source : ./example_data/fake.vsdx\n",
      "\n",
      "==> CONTENT <== \n",
      "Created by\n",
      "Created the\n",
      "Modified by\n",
      "Modified the\n",
      "Version\n",
      "Title\n",
      "Florian MOREL\n",
      "2024-01-14\n",
      "FLORIAN Morel\n",
      "Today\n",
      "0.0.0.0.0.1\n",
      "This is a title\n",
      "Another RED arrow wow\n",
      "Arrow with point but red\n",
      "Green line\n",
      "User\n",
      "Captions\n",
      "Red arrow magic !\n",
      "Something white\n",
      "Something Red\n",
      "This a a completly useless diagramm, cool !!\n",
      "\n",
      "But this is for example !\n",
      "This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
      "Only connectors on this page. This is the CoNNeCtor page\n"
     ]
    }
   ],
   "source": [
    "for i, doc in enumerate(documents):\n",
    "    print(f\"\\n------ Page {doc.metadata['page']} ------\")\n",
    "    print(f\"Title page : {doc.metadata['page_name']}\")\n",
    "    print(f\"Source : {doc.metadata['source']}\")\n",
    "    print(\"\\n==> CONTENT <== \")\n",
    "    print(doc.page_content)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/libs/community/langchain_community/document_loaders/init.py
+++ b/libs/community/langchain_community/document_loaders/init.py
@ -207,6 +207,7 @@ from langchain_community.document_loaders.unstructured import (
 from langchain_community.document_loaders.url import UnstructuredURLLoader
 from langchain_community.document_loaders.url_playwright import PlaywrightURLLoader
 from langchain_community.document_loaders.url_selenium import SeleniumURLLoader
 from langchain_community.document_loaders.vsdx import VsdxLoader
 from langchain_community.document_loaders.weather import WeatherDataLoader
 from langchain_community.document_loaders.web_base import WebBaseLoader
 from langchain_community.document_loaders.whatsapp_chat import WhatsAppChatLoader
@ -394,6 +395,7 @@ __all__ = [
    "UnstructuredURLLoader",
    "UnstructuredWordDocumentLoader",
    "UnstructuredXMLLoader",
    "VsdxLoader",
    "WeatherDataLoader",
    "WebBaseLoader",
    "WhatsAppChatLoader",
--- a/libs/community/langchain_community/document_loaders/parsers/init.py
+++ b/libs/community/langchain_community/document_loaders/parsers/init.py
@ -13,6 +13,7 @@ from langchain_community.document_loaders.parsers.pdf import (
    PyPDFium2Parser,
    PyPDFParser,
 )
 from langchain_community.document_loaders.parsers.vsdx import VsdxParser
 __all__ = [
    "AzureAIDocumentIntelligenceParser",
@ -26,4 +27,5 @@ __all__ = [
    "PyMuPDFParser",
    "PyPDFium2Parser",
    "PyPDFParser",
    "VsdxParser",
 ]
--- a/libs/community/langchain_community/document_loaders/parsers/vsdx.py
+++ b/libs/community/langchain_community/document_loaders/parsers/vsdx.py
@ -0,0 +1,205 @@
 import json
 import re
 import zipfile
 from abc import ABC
 from pathlib import Path
 from typing import Iterator, List, Set, Tuple
 from langchain_community.docstore.document import Document
 from langchain_community.document_loaders.base import BaseBlobParser
 from langchain_community.document_loaders.blob_loaders import Blob
 class VsdxParser(BaseBlobParser, ABC):
    def parse(self, blob: Blob) -> Iterator[Document]:
        """Parse a vsdx file."""
        return self.lazy_parse(blob)
    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Retrieve the contents of pages from a .vsdx file
        and insert them into documents, one document per page."""
        with blob.as_bytes_io() as pdf_file_obj:
            with zipfile.ZipFile(pdf_file_obj, "r") as zfile:
                pages = self.get_pages_content(zfile, blob.source)
        yield from [
            Document(
                page_content=page_content,
                metadata={
                    "source": blob.source,
                    "page": page_number,
                    "page_name": page_name,
                },
            )
            for page_number, page_name, page_content in pages
        ]
    def get_pages_content(
        self, zfile: zipfile.ZipFile, source: str
    ) -> List[Tuple[int, str, str]]:
        """Get the content of the pages of a vsdx file.
        Attributes:
            zfile (zipfile.ZipFile): The vsdx file under zip format.
            source (str): The path of the vsdx file.
        Returns:
            list[tuple[int, str, str]]: A list of tuples containing the page number,
            the name of the page and the content of the page
            for each page of the vsdx file.
        """
        try:
            import xmltodict
        except ImportError:
            raise ImportError(
                "The xmltodict library is required to parse vsdx files. "
                "Please install it with `pip install xmltodict`."
            )
        if "visio/pages/pages.xml" not in zfile.namelist():
            print("WARNING - No pages.xml file found in {}".format(source))
            return
        if "visio/pages/_rels/pages.xml.rels" not in zfile.namelist():
            print("WARNING - No pages.xml.rels file found in {}".format(source))
            return
        if "docProps/app.xml" not in zfile.namelist():
            print("WARNING - No app.xml file found in {}".format(source))
            return
        pagesxml_content: dict = xmltodict.parse(zfile.read("visio/pages/pages.xml"))
        appxml_content: dict = xmltodict.parse(zfile.read("docProps/app.xml"))
        pagesxmlrels_content: dict = xmltodict.parse(
            zfile.read("visio/pages/_rels/pages.xml.rels")
        )
        if isinstance(pagesxml_content["Pages"]["Page"], list):
            disordered_names: List[str] = [
                rel["@Name"].strip() for rel in pagesxml_content["Pages"]["Page"]
            ]
        else:
            disordered_names: List[str] = [
                pagesxml_content["Pages"]["Page"]["@Name"].strip()
            ]
        if isinstance(pagesxmlrels_content["Relationships"]["Relationship"], list):
            disordered_paths: List[str] = [
                "visio/pages/" + rel["@Target"]
                for rel in pagesxmlrels_content["Relationships"]["Relationship"]
            ]
        else:
            disordered_paths: List[str] = [
                "visio/pages/"
                + pagesxmlrels_content["Relationships"]["Relationship"]["@Target"]
            ]
        ordered_names: List[str] = appxml_content["Properties"]["TitlesOfParts"][
            "vt:vector"
        ]["vt:lpstr"][: len(disordered_names)]
        ordered_names = [name.strip() for name in ordered_names]
        ordered_paths = [
            disordered_paths[disordered_names.index(name.strip())]
            for name in ordered_names
        ]
        # Pages out of order and without content of their relationships
        disordered_pages = []
        for path in ordered_paths:
            content = zfile.read(path)
            string_content = json.dumps(xmltodict.parse(content))
            samples = re.findall(
                r'"#text"\s*:\s*"([^\\"]*(?:\\.[^\\"]*)*)"', string_content
            )
            if len(samples) > 0:
                page_content = "\n".join(samples)
                map_symboles = {
                    "\\n": "\n",
                    "\\t": "\t",
                    "\\u2013": "-",
                    "\\u2019": "'",
                    "\\u00e9r": "é",
                    "\\u00f4me": "ô",
                }
                for key, value in map_symboles.items():
                    page_content = page_content.replace(key, value)
                disordered_pages.append({"page": path, "page_content": page_content})
        # Direct relationships of each page in a dict format
        pagexml_rels = [
            {
                "path": page_path,
                "content": xmltodict.parse(
                    zfile.read(f"visio/pages/_rels/{Path(page_path).stem}.xml.rels")
                ),
            }
            for page_path in ordered_paths
            if f"visio/pages/_rels/{Path(page_path).stem}.xml.rels" in zfile.namelist()
        ]
        # Pages in order and with content of their relationships (direct and indirect)
        ordered_pages: List[Tuple[int, str, str]] = []
        for page_number, (path, page_name) in enumerate(
            zip(ordered_paths, ordered_names)
        ):
            relationships = self.get_relationships(
                path, zfile, ordered_paths, pagexml_rels
            )
            page_content = "\n".join(
                [
                    page_["page_content"]
                    for page_ in disordered_pages
                    if page_["page"] in relationships
                ]
                + [
                    page_["page_content"]
                    for page_ in disordered_pages
                    if page_["page"] == path
                ]
            )
            ordered_pages.append((page_number, page_name, page_content))
        return ordered_pages
    def get_relationships(
        self,
        page: str,
        zfile: zipfile.ZipFile,
        filelist: List[str],
        pagexml_rels: List[dict],
    ) -> Set[str]:
        """Get the relationships of a page and the relationships of its relationships,
        etc... recursively.
        Pages are based on other pages (ex: background page),
        so we need to get all the relationships to get all the content of a single page.
        """
        name_path = Path(page).name
        parent_path = Path(page).parent
        rels_path = parent_path / f"_rels/{name_path}.rels"
        if str(rels_path) not in zfile.namelist():
            return set()
        pagexml_rels_content = next(
            page_["content"] for page_ in pagexml_rels if page_["path"] == page
        )
        if isinstance(pagexml_rels_content["Relationships"]["Relationship"], list):
            targets = [
                rel["@Target"]
                for rel in pagexml_rels_content["Relationships"]["Relationship"]
            ]
        else:
            targets = [pagexml_rels_content["Relationships"]["Relationship"]["@Target"]]
        relationships = set(
            [str(parent_path / target) for target in targets]
        ).intersection(filelist)
        for rel in relationships:
            relationships = relationships | self.get_relationships(
                rel, zfile, filelist, pagexml_rels
            )
        return relationships
--- a/libs/community/langchain_community/document_loaders/vsdx.py
+++ b/libs/community/langchain_community/document_loaders/vsdx.py
@ -0,0 +1,53 @@
 import os
 import tempfile
 from abc import ABC
 from typing import List
 from urllib.parse import urlparse
 import requests
 from langchain_community.docstore.document import Document
 from langchain_community.document_loaders.base import BaseLoader
 from langchain_community.document_loaders.blob_loaders import Blob
 from langchain_community.document_loaders.parsers import VsdxParser
 class VsdxLoader(BaseLoader, ABC):
    def __init__(self, file_path: str):
        """Initialize with file path."""
        self.file_path = file_path
        if "~" in self.file_path:
            self.file_path = os.path.expanduser(self.file_path)
        # If the file is a web path, download it to a temporary file, and use that
        if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
            r = requests.get(self.file_path)
            if r.status_code != 200:
                raise ValueError(
                    "Check the url of your file; returned status code %s"
                    % r.status_code
                )
            self.web_path = self.file_path
            self.temp_file = tempfile.NamedTemporaryFile()
            self.temp_file.write(r.content)
            self.file_path = self.temp_file.name
        elif not os.path.isfile(self.file_path):
            raise ValueError("File path %s is not a valid file or url" % self.file_path)
        self.parser = VsdxParser()
    def __del__(self) -> None:
        if hasattr(self, "temp_file"):
            self.temp_file.close()
    @staticmethod
    def _is_valid_url(url: str) -> bool:
        """Check if the url is valid."""
        parsed = urlparse(url)
        return bool(parsed.netloc) and bool(parsed.scheme)
    def load(self) -> List[Document]:
        blob = Blob.from_path(self.file_path)
        return list(self.parser.parse(blob))
--- a/libs/community/tests/examples/fake.vsdx
+++ b/libs/community/tests/examples/fake.vsdx
--- a/libs/community/tests/unit_tests/document_loaders/parsers/test_public_api.py
+++ b/libs/community/tests/unit_tests/document_loaders/parsers/test_public_api.py
@ -15,4 +15,5 @@ def test_parsers_public_api_correct() -> None:
        "PyMuPDFParser",
        "PyPDFium2Parser",
        "PDFPlumberParser",
        "VsdxParser",
    }
--- a/libs/community/tests/unit_tests/document_loaders/parsers/test_vsdx_parser.py
+++ b/libs/community/tests/unit_tests/document_loaders/parsers/test_vsdx_parser.py
@ -0,0 +1,51 @@
 """Tests for the VSDX parsers."""
 from pathlib import Path
 from typing import Iterator
 import pytest
 from langchain_community.document_loaders.base import BaseBlobParser
 from langchain_community.document_loaders.blob_loaders import Blob
 from langchain_community.document_loaders.parsers import VsdxParser
 _THIS_DIR = Path(__file__).parents[3]
 _EXAMPLES_DIR = _THIS_DIR / "examples"
 # Paths to test VSDX file
 FAKE_FILE = _EXAMPLES_DIR / "fake.vsdx"
 def _assert_with_parser(parser: BaseBlobParser, splits_by_page: bool = True) -> None:
    """Standard tests to verify that the given parser works.
    Args:
        parser (BaseBlobParser): The parser to test.
        splits_by_page (bool): Whether the parser splits by page or not by default.
    """
    blob = Blob.from_path(FAKE_FILE)
    doc_generator = parser.lazy_parse(blob)
    assert isinstance(doc_generator, Iterator)
    docs = list(doc_generator)
    if splits_by_page:
        assert len(docs) == 14
    else:
        assert len(docs) == 1
    # Test is imprecise since the parsers yield different parse information depending
    # on configuration. Each parser seems to yield a slightly different result
    # for this page!
    assert "This is a title" in docs[0].page_content
    metadata = docs[0].metadata
    assert metadata["source"] == str(FAKE_FILE)
    if splits_by_page:
        assert int(metadata["page"]) == 0
@pytest.mark.requires("xmltodict")
 def test_vsdx_parser() -> None:
    """Test the VSDX parser."""
    _assert_with_parser(VsdxParser())
--- a/libs/community/tests/unit_tests/document_loaders/test_imports.py
+++ b/libs/community/tests/unit_tests/document_loaders/test_imports.py
@ -165,6 +165,7 @@ EXPECTED_ALL = [
    "UnstructuredURLLoader",
    "UnstructuredWordDocumentLoader",
    "UnstructuredXMLLoader",
    "VsdxLoader",
    "WeatherDataLoader",
    "WebBaseLoader",
    "WhatsAppChatLoader",