mirror of
https://github.com/hwchase17/langchain.git
synced 2025-05-15 20:12:30 +00:00
**Description** : New documents loader for visio files (with extension .vsdx) A [visio file](https://fr.wikipedia.org/wiki/Microsoft_Visio) (with extension .vsdx) is associated with Microsoft Visio, a diagram creation software. It stores information about the structure, layout, and graphical elements of a diagram. This format facilitates the creation and sharing of visualizations in areas such as business, engineering, and computer science. A Visio file can contain multiple pages. Some of them may serve as the background for others, and this can occur across multiple layers. This loader extracts the textual content from each page and its associated pages, enabling the extraction of all visible text from each page, similar to what an OCR algorithm would do. **Dependencies** : xmltodict package
487 lines
18 KiB
Plaintext
487 lines
18 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Vsdx"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"> A [visio file](https://fr.wikipedia.org/wiki/Microsoft_Visio) (with extension .vsdx) is associated with Microsoft Visio, a diagram creation software. It stores information about the structure, layout, and graphical elements of a diagram. This format facilitates the creation and sharing of visualizations in areas such as business, engineering, and computer science."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A Visio file can contain multiple pages. Some of them may serve as the background for others, and this can occur across multiple layers. This **loader** extracts the textual content from each page and its associated pages, enabling the extraction of all visible text from each page, similar to what an OCR algorithm would do."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**WARNING** : Only Visio files with the **.vsdx** extension are compatible with this loader. Files with extensions such as .vsd, ... are not compatible because they cannot be converted to compressed XML."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_community.document_loaders import VsdxLoader"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = VsdxLoader(file_path=\"./example_data/fake.vsdx\")\n",
|
|
"documents = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Display loaded documents**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n",
|
|
"------ Page 0 ------\n",
|
|
"Title page : Summary\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Best Caption of the worl\n",
|
|
"This is an arrow\n",
|
|
"This is Earth\n",
|
|
"This is a bounded arrow\n",
|
|
"\n",
|
|
"------ Page 1 ------\n",
|
|
"Title page : Glossary\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"\n",
|
|
"------ Page 2 ------\n",
|
|
"Title page : blanket page\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"This file is a vsdx file\n",
|
|
"First text\n",
|
|
"Second text\n",
|
|
"Third text\n",
|
|
"\n",
|
|
"------ Page 3 ------\n",
|
|
"Title page : BLABLABLA\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Another RED arrow wow\n",
|
|
"Arrow with point but red\n",
|
|
"Green line\n",
|
|
"User\n",
|
|
"Captions\n",
|
|
"Red arrow magic !\n",
|
|
"Something white\n",
|
|
"Something Red\n",
|
|
"This a a completly useless diagramm, cool !!\n",
|
|
"\n",
|
|
"But this is for example !\n",
|
|
"This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
|
|
"This is a page with something...\n",
|
|
"\n",
|
|
"WAW I have learned something !\n",
|
|
"This is a page with something...\n",
|
|
"\n",
|
|
"WAW I have learned something !\n",
|
|
"\n",
|
|
"X2\n",
|
|
"\n",
|
|
"------ Page 4 ------\n",
|
|
"Title page : What a page !!\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Something white\n",
|
|
"Something Red\n",
|
|
"This a a completly useless diagramm, cool !!\n",
|
|
"\n",
|
|
"But this is for example !\n",
|
|
"This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
|
|
"Another RED arrow wow\n",
|
|
"Arrow with point but red\n",
|
|
"Green line\n",
|
|
"User\n",
|
|
"Captions\n",
|
|
"Red arrow magic !\n",
|
|
"\n",
|
|
"------ Page 5 ------\n",
|
|
"Title page : next page after previous one\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Another RED arrow wow\n",
|
|
"Arrow with point but red\n",
|
|
"Green line\n",
|
|
"User\n",
|
|
"Captions\n",
|
|
"Red arrow magic !\n",
|
|
"Something white\n",
|
|
"Something Red\n",
|
|
"This a a completly useless diagramm, cool !!\n",
|
|
"\n",
|
|
"But this is for example !\n",
|
|
"This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor\n",
|
|
"\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0\\u00a0-\\u00a0incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in\n",
|
|
"\n",
|
|
"\n",
|
|
"voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa\n",
|
|
"*\n",
|
|
"\n",
|
|
"\n",
|
|
"qui officia deserunt mollit anim id est laborum.\n",
|
|
"\n",
|
|
"------ Page 6 ------\n",
|
|
"Title page : Connector Page\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Something white\n",
|
|
"Something Red\n",
|
|
"This a a completly useless diagramm, cool !!\n",
|
|
"\n",
|
|
"But this is for example !\n",
|
|
"This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
|
|
"\n",
|
|
"------ Page 7 ------\n",
|
|
"Title page : Useful ↔ Useless page\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Something white\n",
|
|
"Something Red\n",
|
|
"This a a completly useless diagramm, cool !!\n",
|
|
"\n",
|
|
"But this is for example !\n",
|
|
"This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
|
|
"Title of this document : BLABLABLA\n",
|
|
"\n",
|
|
"------ Page 8 ------\n",
|
|
"Title page : Alone page\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Black cloud\n",
|
|
"Unidirectional traffic primary path\n",
|
|
"Unidirectional traffic backup path\n",
|
|
"Encapsulation\n",
|
|
"User\n",
|
|
"Captions\n",
|
|
"Bidirectional traffic\n",
|
|
"Alone, sad\n",
|
|
"Test of another page\n",
|
|
"This is a \\\"bannier\\\"\n",
|
|
"Tests of some exotics characters :\\u00a0\\u00e3\\u00e4\\u00e5\\u0101\\u0103 \\u00fc\\u2554\\u00a0 \\u00a0\\u00bc \\u00c7 \\u25d8\\u25cb\\u2642\\u266b\\u2640\\u00ee\\u2665\n",
|
|
"This is ethernet\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
|
|
"This is an empty case\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor\n",
|
|
"\\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0 \\u00a0-\\u00a0 incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in\n",
|
|
"\n",
|
|
"\n",
|
|
" voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa \n",
|
|
"*\n",
|
|
"\n",
|
|
"\n",
|
|
"qui officia deserunt mollit anim id est laborum.\n",
|
|
"\n",
|
|
"------ Page 9 ------\n",
|
|
"Title page : BG\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Best Caption of the worl\n",
|
|
"This is an arrow\n",
|
|
"This is Earth\n",
|
|
"This is a bounded arrow\n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"\n",
|
|
"------ Page 10 ------\n",
|
|
"Title page : BG + caption1\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Another RED arrow wow\n",
|
|
"Arrow with point but red\n",
|
|
"Green line\n",
|
|
"User\n",
|
|
"Captions\n",
|
|
"Red arrow magic !\n",
|
|
"Something white\n",
|
|
"Something Red\n",
|
|
"This a a completly useless diagramm, cool !!\n",
|
|
"\n",
|
|
"But this is for example !\n",
|
|
"This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
|
|
"Useful\\u2194 Useless page\\u00a0\n",
|
|
"\n",
|
|
"Tests of some exotics characters :\\u00a0\\u00e3\\u00e4\\u00e5\\u0101\\u0103 \\u00fc\\u2554\\u00a0\\u00a0\\u00bc \\u00c7 \\u25d8\\u25cb\\u2642\\u266b\\u2640\\u00ee\\u2665\n",
|
|
"\n",
|
|
"------ Page 11 ------\n",
|
|
"Title page : BG+\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"\n",
|
|
"------ Page 12 ------\n",
|
|
"Title page : BG WITH CONTENT\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
|
|
"\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
|
|
"\n",
|
|
"\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. - Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
|
|
"\n",
|
|
"\n",
|
|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
|
|
"This is a page with a lot of text\n",
|
|
"\n",
|
|
"------ Page 13 ------\n",
|
|
"Title page : 2nd caption with ____________________________________________________________________ content\n",
|
|
"Source : ./example_data/fake.vsdx\n",
|
|
"\n",
|
|
"==> CONTENT <== \n",
|
|
"Created by\n",
|
|
"Created the\n",
|
|
"Modified by\n",
|
|
"Modified the\n",
|
|
"Version\n",
|
|
"Title\n",
|
|
"Florian MOREL\n",
|
|
"2024-01-14\n",
|
|
"FLORIAN Morel\n",
|
|
"Today\n",
|
|
"0.0.0.0.0.1\n",
|
|
"This is a title\n",
|
|
"Another RED arrow wow\n",
|
|
"Arrow with point but red\n",
|
|
"Green line\n",
|
|
"User\n",
|
|
"Captions\n",
|
|
"Red arrow magic !\n",
|
|
"Something white\n",
|
|
"Something Red\n",
|
|
"This a a completly useless diagramm, cool !!\n",
|
|
"\n",
|
|
"But this is for example !\n",
|
|
"This diagramm is a base of many pages in this file. But it is editable in file \\\"BG WITH CONTENT\\\"\n",
|
|
"Only connectors on this page. This is the CoNNeCtor page\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for i, doc in enumerate(documents):\n",
|
|
" print(f\"\\n------ Page {doc.metadata['page']} ------\")\n",
|
|
" print(f\"Title page : {doc.metadata['page_name']}\")\n",
|
|
" print(f\"Source : {doc.metadata['source']}\")\n",
|
|
" print(\"\\n==> CONTENT <== \")\n",
|
|
" print(doc.page_content)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|