mirror of
https://github.com/hwchase17/langchain.git
synced 2025-07-03 19:57:51 +00:00
langchain[minor]: Updated DocugamiLoader, includes breaking changes (#13265)
There are the following main changes in this PR: 1. Rewrite of the DocugamiLoader to not do any XML parsing of the DGML format internally, and instead use the `dgml-utils` library we are separately working on. This is a very lightweight dependency. 2. Added MMR search type as an option to multi-vector retriever, similar to other retrievers. MMR is especially useful when using Docugami for RAG since we deal with large sets of documents within which a few might be duplicates and straight similarity based search doesn't give great results in many cases. We are @docugami on twitter, and I am @tjaffri --------- Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
This commit is contained in:
parent
a20e8f8bb0
commit
144710ad9a
@ -21,8 +21,8 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# You need the lxml package to use the DocugamiLoader (run pip install directly without \"poetry run\" if you are not using poetry)\n",
|
"# You need the dgml-utils package to use the DocugamiLoader (run pip install directly without \"poetry run\" if you are not using poetry)\n",
|
||||||
"!poetry run pip install lxml --quiet"
|
"!poetry run pip install dgml-utils==0.3.0 --upgrade --quiet"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -43,8 +43,8 @@
|
|||||||
"Appropriate chunking of your documents is critical for retrieval from documents. Many chunking techniques exist, including simple ones that rely on whitespace and recursive chunk splitting based on character length. Docugami offers a different approach:\n",
|
"Appropriate chunking of your documents is critical for retrieval from documents. Many chunking techniques exist, including simple ones that rely on whitespace and recursive chunk splitting based on character length. Docugami offers a different approach:\n",
|
||||||
"\n",
|
"\n",
|
||||||
"1. **Intelligent Chunking:** Docugami breaks down every document into a hierarchical semantic XML tree of chunks of varying sizes, from single words or numerical values to entire sections. These chunks follow the semantic contours of the document, providing a more meaningful representation than arbitrary length or simple whitespace-based chunking.\n",
|
"1. **Intelligent Chunking:** Docugami breaks down every document into a hierarchical semantic XML tree of chunks of varying sizes, from single words or numerical values to entire sections. These chunks follow the semantic contours of the document, providing a more meaningful representation than arbitrary length or simple whitespace-based chunking.\n",
|
||||||
"2. **Structured Representation:** In addition, the XML tree indicates the structural contours of every document, using attributes denoting headings, paragraphs, lists, tables, and other common elements, and does that consistently across all supported document formats, such as scanned PDFs or DOCX files. It appropriately handles long-form document characteristics like page headers/footers or multi-column flows for clean text extraction.\n",
|
"2. **Semantic Annotations:** Chunks are annotated with semantic tags that are coherent across the document set, facilitating consistent hierarchical queries across multiple documents, even if they are written and formatted differently. For example, in set of lease agreements, you can easily identify key provisions like the Landlord, Tenant, or Renewal Date, as well as more complex information such as the wording of any sub-lease provision or whether a specific jurisdiction has an exception section within a Termination Clause.\n",
|
||||||
"3. **Semantic Annotations:** Chunks are annotated with semantic tags that are coherent across the document set, facilitating consistent hierarchical queries across multiple documents, even if they are written and formatted differently. For example, in set of lease agreements, you can easily identify key provisions like the Landlord, Tenant, or Renewal Date, as well as more complex information such as the wording of any sub-lease provision or whether a specific jurisdiction has an exception section within a Termination Clause.\n",
|
"3. **Structured Representation:** In addition, the XML tree indicates the structural contours of every document, using attributes denoting headings, paragraphs, lists, tables, and other common elements, and does that consistently across all supported document formats, such as scanned PDFs or DOCX files. It appropriately handles long-form document characteristics like page headers/footers or multi-column flows for clean text extraction.\n",
|
||||||
"4. **Additional Metadata:** Chunks are also annotated with additional metadata, if a user has been using Docugami. This additional metadata can be used for high-accuracy Document QA without context window restrictions. See detailed code walk-through below.\n"
|
"4. **Additional Metadata:** Chunks are also annotated with additional metadata, if a user has been using Docugami. This additional metadata can be used for high-accuracy Document QA without context window restrictions. See detailed code walk-through below.\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@ -65,52 +65,42 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"## Load Documents\n",
|
"## Load Documents\n",
|
||||||
"\n",
|
"\n",
|
||||||
"If the DOCUGAMI_API_KEY environment variable is set, there is no need to pass it in to the loader explicitly otherwise you can pass it in as the `access_token` parameter.\n",
|
"If the DOCUGAMI_API_KEY environment variable is set, there is no need to pass it in to the loader explicitly otherwise you can pass it in as the `access_token` parameter."
|
||||||
"\n",
|
|
||||||
"The DocugamiLoader has a default minimum chunk size of 32. Chunks smaller than that are appended to subsequent chunks. Set min_chunk_size to 0 to get all structural chunks regardless of size."
|
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 3,
|
"execution_count": 3,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"DOCUGAMI_API_KEY = os.environ.get(\"DOCUGAMI_API_KEY\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"[Document(page_content='MUTUAL NON-DISCLOSURE AGREEMENT This Mutual Non-Disclosure Agreement (this “ Agreement ”) is entered into and made effective as of April 4 , 2018 between Docugami Inc. , a Delaware corporation , whose address is 150 Lake Street South , Suite 221 , Kirkland , Washington 98033 , and Caleb Divine , an individual, whose address is 1201 Rt 300 , Newburgh NY 12550 .', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:ThisMutualNon-disclosureAgreement', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'p', 'tag': 'ThisMutualNon-disclosureAgreement'}),\n",
|
"120"
|
||||||
" Document(page_content='The above named parties desire to engage in discussions regarding a potential agreement or other transaction between the parties (the “Purpose”). In connection with such discussions, it may be necessary for the parties to disclose to each other certain confidential information or materials to enable them to evaluate whether to enter into such agreement or transaction.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Discussions', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'p', 'tag': 'Discussions'}),\n",
|
|
||||||
" Document(page_content='In consideration of the foregoing, the parties agree as follows:', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Consideration', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'p', 'tag': 'Consideration'}),\n",
|
|
||||||
" Document(page_content='1. Confidential Information . For purposes of this Agreement , “ Confidential Information ” means any information or materials disclosed by one party to the other party that: (i) if disclosed in writing or in the form of tangible materials, is marked “confidential” or “proprietary” at the time of such disclosure; (ii) if disclosed orally or by visual presentation, is identified as “confidential” or “proprietary” at the time of such disclosure, and is summarized in a writing sent by the disclosing party to the receiving party within thirty ( 30 ) days after any such disclosure; or (iii) due to its nature or the circumstances of its disclosure, a person exercising reasonable business judgment would understand to be confidential or proprietary.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:Purposes/docset:ConfidentialInformation-section/docset:ConfidentialInformation[2]', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'ConfidentialInformation'}),\n",
|
|
||||||
" Document(page_content=\"2. Obligations and Restrictions . Each party agrees: (i) to maintain the other party's Confidential Information in strict confidence; (ii) not to disclose such Confidential Information to any third party; and (iii) not to use such Confidential Information for any purpose except for the Purpose. Each party may disclose the other party’s Confidential Information to its employees and consultants who have a bona fide need to know such Confidential Information for the Purpose, but solely to the extent necessary to pursue the Purpose and for no other purpose; provided, that each such employee and consultant first executes a written agreement (or is otherwise already bound by a written agreement) that contains use and nondisclosure restrictions at least as protective of the other party’s Confidential Information as those set forth in this Agreement .\", metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:Obligations/docset:ObligationsAndRestrictions-section/docset:ObligationsAndRestrictions', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'ObligationsAndRestrictions'}),\n",
|
|
||||||
" Document(page_content='3. Exceptions. The obligations and restrictions in Section 2 will not apply to any information or materials that:', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:Exceptions/docset:Exceptions-section/docset:Exceptions[2]', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'Exceptions'}),\n",
|
|
||||||
" Document(page_content='(i) were, at the date of disclosure, or have subsequently become, generally known or available to the public through no act or failure to act by the receiving party;', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:TheDate/docset:TheDate/docset:TheDate', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'p', 'tag': 'TheDate'}),\n",
|
|
||||||
" Document(page_content='(ii) were rightfully known by the receiving party prior to receiving such information or materials from the disclosing party;', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:TheDate/docset:SuchInformation/docset:TheReceivingParty', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'p', 'tag': 'TheReceivingParty'}),\n",
|
|
||||||
" Document(page_content='(iii) are rightfully acquired by the receiving party from a third party who has the right to disclose such information or materials without breach of any confidentiality obligation to the disclosing party;', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:TheDate/docset:TheReceivingParty/docset:TheReceivingParty', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'p', 'tag': 'TheReceivingParty'}),\n",
|
|
||||||
" Document(page_content='4. Compelled Disclosure . Nothing in this Agreement will be deemed to restrict a party from disclosing the other party’s Confidential Information to the extent required by any order, subpoena, law, statute or regulation; provided, that the party required to make such a disclosure uses reasonable efforts to give the other party reasonable advance notice of such required disclosure in order to enable the other party to prevent or limit such disclosure.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:Disclosure/docset:CompelledDisclosure-section/docset:CompelledDisclosure', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'CompelledDisclosure'}),\n",
|
|
||||||
" Document(page_content='5. Return of Confidential Information . Upon the completion or abandonment of the Purpose, and in any event upon the disclosing party’s request, the receiving party will promptly return to the disclosing party all tangible items and embodiments containing or consisting of the disclosing party’s Confidential Information and all copies thereof (including electronic copies), and any notes, analyses, compilations, studies, interpretations, memoranda or other documents (regardless of the form thereof) prepared by or on behalf of the receiving party that contain or are based upon the disclosing party’s Confidential Information .', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:TheCompletion/docset:ReturnofConfidentialInformation-section/docset:ReturnofConfidentialInformation', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'ReturnofConfidentialInformation'}),\n",
|
|
||||||
" Document(page_content='6. No Obligations . Each party retains the right to determine whether to disclose any Confidential Information to the other party.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:NoObligations/docset:NoObligations-section/docset:NoObligations[2]', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'NoObligations'}),\n",
|
|
||||||
" Document(page_content='7. No Warranty. ALL CONFIDENTIAL INFORMATION IS PROVIDED BY THE DISCLOSING PARTY “AS IS ”.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:NoWarranty/docset:NoWarranty-section/docset:NoWarranty[2]', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'NoWarranty'}),\n",
|
|
||||||
" Document(page_content='8. Term. This Agreement will remain in effect for a period of seven ( 7 ) years from the date of last disclosure of Confidential Information by either party, at which time it will terminate.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:ThisAgreement/docset:Term-section/docset:Term', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'Term'}),\n",
|
|
||||||
" Document(page_content='9. Equitable Relief . Each party acknowledges that the unauthorized use or disclosure of the disclosing party’s Confidential Information may cause the disclosing party to incur irreparable harm and significant damages, the degree of which may be difficult to ascertain. Accordingly, each party agrees that the disclosing party will have the right to seek immediate equitable relief to enjoin any unauthorized use or disclosure of its Confidential Information , in addition to any other rights and remedies that it may have at law or otherwise.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:EquitableRelief/docset:EquitableRelief-section/docset:EquitableRelief[2]', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'EquitableRelief'}),\n",
|
|
||||||
" Document(page_content='10. Non-compete. To the maximum extent permitted by applicable law, during the Term of this Agreement and for a period of one ( 1 ) year thereafter, Caleb Divine may not market software products or do business that directly or indirectly competes with Docugami software products .', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:TheMaximumExtent/docset:Non-compete-section/docset:Non-compete', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'Non-compete'}),\n",
|
|
||||||
" Document(page_content='11. Miscellaneous. This Agreement will be governed and construed in accordance with the laws of the State of Washington , excluding its body of law controlling conflict of laws. This Agreement is the complete and exclusive understanding and agreement between the parties regarding the subject matter of this Agreement and supersedes all prior agreements, understandings and communications, oral or written, between the parties regarding the subject matter of this Agreement . If any provision of this Agreement is held invalid or unenforceable by a court of competent jurisdiction, that provision of this Agreement will be enforced to the maximum extent permissible and the other provisions of this Agreement will remain in full force and effect. Neither party may assign this Agreement , in whole or in part, by operation of law or otherwise, without the other party’s prior written consent, and any attempted assignment without such consent will be void. This Agreement may be executed in counterparts, each of which will be deemed an original, but all of which together will constitute one and the same instrument.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:Consideration/docset:Purposes/docset:Accordance/docset:Miscellaneous-section/docset:Miscellaneous', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'div', 'tag': 'Miscellaneous'}),\n",
|
|
||||||
" Document(page_content='[SIGNATURE PAGE FOLLOWS] IN WITNESS WHEREOF, the parties hereto have executed this Mutual Non-Disclosure Agreement by their duly authorized officers or representatives as of the date first set forth above.', metadata={'xpath': '/docset:MutualNon-disclosure/docset:Witness/docset:TheParties/docset:TheParties', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': 'p', 'tag': 'TheParties'}),\n",
|
|
||||||
" Document(page_content='DOCUGAMI INC . : \\n\\n Caleb Divine : \\n\\n Signature: Signature: Name: \\n\\n Jean Paoli Name: Title: \\n\\n CEO Title:', metadata={'xpath': '/docset:MutualNon-disclosure/docset:Witness/docset:TheParties/docset:DocugamiInc/docset:DocugamiInc/xhtml:table', 'id': '43rj0ds7s0ur', 'source': 'NDA simple layout.docx', 'structure': '', 'tag': 'table'})]"
|
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 3,
|
"execution_count": 4,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result"
|
"output_type": "execute_result"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"DOCUGAMI_API_KEY = os.environ.get(\"DOCUGAMI_API_KEY\")\n",
|
"docset_id = \"26xpy3aes7xp\"\n",
|
||||||
|
"document_ids = [\"d7jqdzcj50sj\", \"cgd1eacfkchw\"]\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# To load all docs in the given docset ID, just don't provide document_ids\n",
|
"# To load all docs in the given docset ID, just don't provide document_ids\n",
|
||||||
"loader = DocugamiLoader(docset_id=\"ecxqpipcoe2p\", document_ids=[\"43rj0ds7s0ur\"])\n",
|
"loader = DocugamiLoader(docset_id=docset_id, document_ids=document_ids)\n",
|
||||||
"docs = loader.load()\n",
|
"chunks = loader.load()\n",
|
||||||
"docs"
|
"len(chunks)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -122,7 +112,39 @@
|
|||||||
"1. **id and source:** ID and Name of the file (PDF, DOC or DOCX) the chunk is sourced from within Docugami.\n",
|
"1. **id and source:** ID and Name of the file (PDF, DOC or DOCX) the chunk is sourced from within Docugami.\n",
|
||||||
"2. **xpath:** XPath inside the XML representation of the document, for the chunk. Useful for source citations directly to the actual chunk inside the document XML.\n",
|
"2. **xpath:** XPath inside the XML representation of the document, for the chunk. Useful for source citations directly to the actual chunk inside the document XML.\n",
|
||||||
"3. **structure:** Structural attributes of the chunk, e.g. h1, h2, div, table, td, etc. Useful to filter out certain kinds of chunks if needed by the caller.\n",
|
"3. **structure:** Structural attributes of the chunk, e.g. h1, h2, div, table, td, etc. Useful to filter out certain kinds of chunks if needed by the caller.\n",
|
||||||
"4. **tag:** Semantic tag for the chunk, using various generative and extractive techniques. More details here: https://github.com/docugami/DFM-benchmarks"
|
"4. **tag:** Semantic tag for the chunk, using various generative and extractive techniques. More details here: https://github.com/docugami/DFM-benchmarks\n",
|
||||||
|
"\n",
|
||||||
|
"You can control chunking behavior by setting the following properties on the `DocugamiLoader` instance:\n",
|
||||||
|
"\n",
|
||||||
|
"1. You can set min and max chunk size, which the system tries to adhere to with minimal truncation. You can set `loader.min_text_length` and `loader.max_text_length` to control these.\n",
|
||||||
|
"2. By default, only the text for chunks is returned. However, Docugami's XML knowledge graph has additional rich information including semantic tags for entities inside the chunk. Set `loader.include_xml_tags = True` if you want the additional xml metadata on the returned chunks.\n",
|
||||||
|
"3. In addition, you can set `loader.parent_hierarchy_levels` if you want Docugami to return parent chunks in the chunks it returns. The child chunks point to the parent chunks via the `loader.parent_id_key` value. This is useful e.g. with the [MultiVector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) for [small-to-big](https://www.youtube.com/watch?v=ihSiRrOUwmg) retrieval. See detailed example later in this notebook."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"page_content='MASTER SERVICES AGREEMENT\\n <ThisServicesAgreement> This Services Agreement (the “Agreement”) sets forth terms under which <Company>MagicSoft, Inc. </Company>a <Org><USState>Washington </USState>Corporation </Org>(“Company”) located at <CompanyAddress><CompanyStreetAddress><Company>600 </Company><Company>4th Ave</Company></CompanyStreetAddress>, <Company>Seattle</Company>, <Client>WA </Client><ProvideServices>98104 </ProvideServices></CompanyAddress>shall provide services to <Client>Daltech, Inc.</Client>, a <Company><USState>Washington </USState>Corporation </Company>(the “Client”) located at <ClientAddress><ClientStreetAddress><Client>701 </Client><Client>1st St</Client></ClientStreetAddress>, <Client>Kirkland</Client>, <State>WA </State><Client>98033</Client></ClientAddress>. This Agreement is effective as of <EffectiveDate>February 15, 2021 </EffectiveDate>(“Effective Date”). </ThisServicesAgreement>' metadata={'xpath': '/dg:chunk/docset:MASTERSERVICESAGREEMENT-section/dg:chunk', 'id': 'c28554d0af5114e2b102e6fc4dcbbde5', 'name': 'Master Services Agreement - Daltech.docx', 'source': 'Master Services Agreement - Daltech.docx', 'structure': 'h1 p', 'tag': 'chunk ThisServicesAgreement', 'Liability': '', 'Workers Compensation Insurance': '$1,000,000', 'Limit': '$1,000,000', 'Commercial General Liability Insurance': '$2,000,000', 'Technology Professional Liability Errors Omissions Policy': '$5,000,000', 'Excess Liability Umbrella Coverage': '$9,000,000', 'Client': 'Daltech, Inc.', 'Services Agreement Date': 'INITIAL STATEMENT OF WORK (SOW) The purpose of this SOW is to describe the Software and Services that Company will initially provide to Daltech, Inc. the “Client”) under the terms and conditions of the Services Agreement entered into between the parties on June 15, 2021', 'Completion of the Services by Company Date': 'February 15, 2022', 'Charge': 'one hundred percent (100%)', 'Company': 'MagicSoft, Inc.', 'Effective Date': 'February 15, 2021', 'Start Date': '03/15/2021', 'Scheduled Onsite Visits Are Cancelled': 'ten (10) working days', 'Limit on Liability': '', 'Liability Cap': '', 'Business Automobile Liability': 'Business Automobile Liability covering all vehicles that Company owns, hires or leases with a limit of no less than $1,000,000 (combined single limit for bodily injury and property damage) for each accident.', 'Contractual Liability Coverage': 'Commercial General Liability insurance including Contractual Liability Coverage , with coverage for products liability, completed operations, property damage and bodily injury, including death , with an aggregate limit of no less than $2,000,000 . This policy shall name Client as an additional insured with respect to the provision of services provided under this Agreement. This policy shall include a waiver of subrogation against Client.', 'Technology Professional Liability Errors Omissions': 'Technology Professional Liability Errors & Omissions policy (which includes Cyber Risk coverage and Computer Security and Privacy Liability coverage) with a limit of no less than $5,000,000 per occurrence and in the aggregate.'}\n",
|
||||||
|
"page_content='A. STANDARD SOFTWARE AND SERVICES AGREEMENT\\n 1. Deliverables.\\n Company shall provide Client with software, technical support, product management, development, and <_testRef>testing </_testRef>services (“Services”) to the Client as described on one or more Statements of Work signed by Company and Client that reference this Agreement (“SOW” or “Statement of Work”). Company shall perform Services in a prompt manner and have the final product or service (“Deliverable”) ready for Client no later than the due date specified in the applicable SOW (“Completion Date”). This due date is subject to change in accordance with the Change Order process defined in the applicable SOW. Client shall assist Company by promptly providing all information requests known or available and relevant to the Services in a timely manner.' metadata={'xpath': '/dg:chunk/docset:MASTERSERVICESAGREEMENT-section/docset:MASTERSERVICESAGREEMENT/dg:chunk[1]/docset:Standard/dg:chunk[1]/dg:chunk[1]', 'id': 'de60160d328df10fa2637637c803d2d4', 'name': 'Master Services Agreement - Daltech.docx', 'source': 'Master Services Agreement - Daltech.docx', 'structure': 'lim h1 lim h1 div', 'tag': 'chunk', 'Liability': '', 'Workers Compensation Insurance': '$1,000,000', 'Limit': '$1,000,000', 'Commercial General Liability Insurance': '$2,000,000', 'Technology Professional Liability Errors Omissions Policy': '$5,000,000', 'Excess Liability Umbrella Coverage': '$9,000,000', 'Client': 'Daltech, Inc.', 'Services Agreement Date': 'INITIAL STATEMENT OF WORK (SOW) The purpose of this SOW is to describe the Software and Services that Company will initially provide to Daltech, Inc. the “Client”) under the terms and conditions of the Services Agreement entered into between the parties on June 15, 2021', 'Completion of the Services by Company Date': 'February 15, 2022', 'Charge': 'one hundred percent (100%)', 'Company': 'MagicSoft, Inc.', 'Effective Date': 'February 15, 2021', 'Start Date': '03/15/2021', 'Scheduled Onsite Visits Are Cancelled': 'ten (10) working days', 'Limit on Liability': '', 'Liability Cap': '', 'Business Automobile Liability': 'Business Automobile Liability covering all vehicles that Company owns, hires or leases with a limit of no less than $1,000,000 (combined single limit for bodily injury and property damage) for each accident.', 'Contractual Liability Coverage': 'Commercial General Liability insurance including Contractual Liability Coverage , with coverage for products liability, completed operations, property damage and bodily injury, including death , with an aggregate limit of no less than $2,000,000 . This policy shall name Client as an additional insured with respect to the provision of services provided under this Agreement. This policy shall include a waiver of subrogation against Client.', 'Technology Professional Liability Errors Omissions': 'Technology Professional Liability Errors & Omissions policy (which includes Cyber Risk coverage and Computer Security and Privacy Liability coverage) with a limit of no less than $5,000,000 per occurrence and in the aggregate.'}\n",
|
||||||
|
"page_content='2. Onsite Services.\\n 2.1 Onsite visits will be charged on a <Frequency>daily </Frequency>basis (minimum <OnsiteVisits>8 hours</OnsiteVisits>).' metadata={'xpath': '/dg:chunk/docset:MASTERSERVICESAGREEMENT-section/docset:MASTERSERVICESAGREEMENT/dg:chunk[1]/docset:Standard/dg:chunk[3]/dg:chunk[1]', 'id': 'db18315b437ac2de6b555d2d8ef8f893', 'name': 'Master Services Agreement - Daltech.docx', 'source': 'Master Services Agreement - Daltech.docx', 'structure': 'lim h1 lim p', 'tag': 'chunk', 'Liability': '', 'Workers Compensation Insurance': '$1,000,000', 'Limit': '$1,000,000', 'Commercial General Liability Insurance': '$2,000,000', 'Technology Professional Liability Errors Omissions Policy': '$5,000,000', 'Excess Liability Umbrella Coverage': '$9,000,000', 'Client': 'Daltech, Inc.', 'Services Agreement Date': 'INITIAL STATEMENT OF WORK (SOW) The purpose of this SOW is to describe the Software and Services that Company will initially provide to Daltech, Inc. the “Client”) under the terms and conditions of the Services Agreement entered into between the parties on June 15, 2021', 'Completion of the Services by Company Date': 'February 15, 2022', 'Charge': 'one hundred percent (100%)', 'Company': 'MagicSoft, Inc.', 'Effective Date': 'February 15, 2021', 'Start Date': '03/15/2021', 'Scheduled Onsite Visits Are Cancelled': 'ten (10) working days', 'Limit on Liability': '', 'Liability Cap': '', 'Business Automobile Liability': 'Business Automobile Liability covering all vehicles that Company owns, hires or leases with a limit of no less than $1,000,000 (combined single limit for bodily injury and property damage) for each accident.', 'Contractual Liability Coverage': 'Commercial General Liability insurance including Contractual Liability Coverage , with coverage for products liability, completed operations, property damage and bodily injury, including death , with an aggregate limit of no less than $2,000,000 . This policy shall name Client as an additional insured with respect to the provision of services provided under this Agreement. This policy shall include a waiver of subrogation against Client.', 'Technology Professional Liability Errors Omissions': 'Technology Professional Liability Errors & Omissions policy (which includes Cyber Risk coverage and Computer Security and Privacy Liability coverage) with a limit of no less than $5,000,000 per occurrence and in the aggregate.'}\n",
|
||||||
|
"page_content='2.2 <Expenses>Time and expenses will be charged based on actuals unless otherwise described in an Order Form or accompanying SOW. </Expenses>' metadata={'xpath': '/dg:chunk/docset:MASTERSERVICESAGREEMENT-section/docset:MASTERSERVICESAGREEMENT/dg:chunk[1]/docset:Standard/dg:chunk[3]/dg:chunk[2]/docset:ADailyBasis/dg:chunk[2]/dg:chunk', 'id': '506220fa472d5c48c8ee3db78c1122c1', 'name': 'Master Services Agreement - Daltech.docx', 'source': 'Master Services Agreement - Daltech.docx', 'structure': 'lim p', 'tag': 'chunk Expenses', 'Liability': '', 'Workers Compensation Insurance': '$1,000,000', 'Limit': '$1,000,000', 'Commercial General Liability Insurance': '$2,000,000', 'Technology Professional Liability Errors Omissions Policy': '$5,000,000', 'Excess Liability Umbrella Coverage': '$9,000,000', 'Client': 'Daltech, Inc.', 'Services Agreement Date': 'INITIAL STATEMENT OF WORK (SOW) The purpose of this SOW is to describe the Software and Services that Company will initially provide to Daltech, Inc. the “Client”) under the terms and conditions of the Services Agreement entered into between the parties on June 15, 2021', 'Completion of the Services by Company Date': 'February 15, 2022', 'Charge': 'one hundred percent (100%)', 'Company': 'MagicSoft, Inc.', 'Effective Date': 'February 15, 2021', 'Start Date': '03/15/2021', 'Scheduled Onsite Visits Are Cancelled': 'ten (10) working days', 'Limit on Liability': '', 'Liability Cap': '', 'Business Automobile Liability': 'Business Automobile Liability covering all vehicles that Company owns, hires or leases with a limit of no less than $1,000,000 (combined single limit for bodily injury and property damage) for each accident.', 'Contractual Liability Coverage': 'Commercial General Liability insurance including Contractual Liability Coverage , with coverage for products liability, completed operations, property damage and bodily injury, including death , with an aggregate limit of no less than $2,000,000 . This policy shall name Client as an additional insured with respect to the provision of services provided under this Agreement. This policy shall include a waiver of subrogation against Client.', 'Technology Professional Liability Errors Omissions': 'Technology Professional Liability Errors & Omissions policy (which includes Cyber Risk coverage and Computer Security and Privacy Liability coverage) with a limit of no less than $5,000,000 per occurrence and in the aggregate.'}\n",
|
||||||
|
"page_content='2.3 <RegularWorkingHours>All work will be executed during regular working hours <RegularWorkingHours>Monday</RegularWorkingHours>-<Weekday>Friday </Weekday><RegularWorkingHours><RegularWorkingHours>0800</RegularWorkingHours>-<Number>1900</Number></RegularWorkingHours>. For work outside of these hours on weekdays, Company will charge <Charge>one hundred percent (100%) </Charge>of the regular hourly rate and <Charge>two hundred percent (200%) </Charge>for Saturdays, Sundays and public holidays applicable to Company. </RegularWorkingHours>' metadata={'xpath': '/dg:chunk/docset:MASTERSERVICESAGREEMENT-section/docset:MASTERSERVICESAGREEMENT/dg:chunk[1]/docset:Standard/dg:chunk[3]/dg:chunk[2]/docset:ADailyBasis/dg:chunk[3]/dg:chunk', 'id': 'dac7a3ded61b5c4f3e59771243ea46c1', 'name': 'Master Services Agreement - Daltech.docx', 'source': 'Master Services Agreement - Daltech.docx', 'structure': 'lim p', 'tag': 'chunk RegularWorkingHours', 'Liability': '', 'Workers Compensation Insurance': '$1,000,000', 'Limit': '$1,000,000', 'Commercial General Liability Insurance': '$2,000,000', 'Technology Professional Liability Errors Omissions Policy': '$5,000,000', 'Excess Liability Umbrella Coverage': '$9,000,000', 'Client': 'Daltech, Inc.', 'Services Agreement Date': 'INITIAL STATEMENT OF WORK (SOW) The purpose of this SOW is to describe the Software and Services that Company will initially provide to Daltech, Inc. the “Client”) under the terms and conditions of the Services Agreement entered into between the parties on June 15, 2021', 'Completion of the Services by Company Date': 'February 15, 2022', 'Charge': 'one hundred percent (100%)', 'Company': 'MagicSoft, Inc.', 'Effective Date': 'February 15, 2021', 'Start Date': '03/15/2021', 'Scheduled Onsite Visits Are Cancelled': 'ten (10) working days', 'Limit on Liability': '', 'Liability Cap': '', 'Business Automobile Liability': 'Business Automobile Liability covering all vehicles that Company owns, hires or leases with a limit of no less than $1,000,000 (combined single limit for bodily injury and property damage) for each accident.', 'Contractual Liability Coverage': 'Commercial General Liability insurance including Contractual Liability Coverage , with coverage for products liability, completed operations, property damage and bodily injury, including death , with an aggregate limit of no less than $2,000,000 . This policy shall name Client as an additional insured with respect to the provision of services provided under this Agreement. This policy shall include a waiver of subrogation against Client.', 'Technology Professional Liability Errors Omissions': 'Technology Professional Liability Errors & Omissions policy (which includes Cyber Risk coverage and Computer Security and Privacy Liability coverage) with a limit of no less than $5,000,000 per occurrence and in the aggregate.'}\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"loader.min_text_length = 64\n",
|
||||||
|
"loader.include_xml_tags = True\n",
|
||||||
|
"chunks = loader.load()\n",
|
||||||
|
"\n",
|
||||||
|
"for chunk in chunks[:5]:\n",
|
||||||
|
" print(chunk)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -136,27 +158,41 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 4,
|
"execution_count": 6,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"!poetry run pip -q install openai tiktoken chromadb"
|
"!poetry run pip install --upgrade openai tiktoken chromadb hnswlib --quiet"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 5,
|
"execution_count": 7,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"4674\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"from langchain.chains import RetrievalQA\n",
|
|
||||||
"from langchain.embeddings import OpenAIEmbeddings\n",
|
|
||||||
"from langchain.llms import OpenAI\n",
|
|
||||||
"from langchain.vectorstores import Chroma\n",
|
|
||||||
"\n",
|
|
||||||
"# For this example, we already have a processed docset for a set of lease documents\n",
|
"# For this example, we already have a processed docset for a set of lease documents\n",
|
||||||
"loader = DocugamiLoader(docset_id=\"wh2kned25uqm\")\n",
|
"loader = DocugamiLoader(docset_id=\"zo954yqy53wp\")\n",
|
||||||
"documents = loader.load()"
|
"chunks = loader.load()\n",
|
||||||
|
"\n",
|
||||||
|
"# strip semantic metadata intentionally, to test how things work without semantic metadata\n",
|
||||||
|
"for chunk in chunks:\n",
|
||||||
|
" stripped_metadata = chunk.metadata.copy()\n",
|
||||||
|
" for key in chunk.metadata:\n",
|
||||||
|
" if key not in [\"name\", \"xpath\", \"id\", \"structure\"]:\n",
|
||||||
|
" # remove semantic metadata\n",
|
||||||
|
" del stripped_metadata[key]\n",
|
||||||
|
" chunk.metadata = stripped_metadata\n",
|
||||||
|
"\n",
|
||||||
|
"print(len(chunks))"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -170,12 +206,17 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 6,
|
"execution_count": 8,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
|
"from langchain.chains import RetrievalQA\n",
|
||||||
|
"from langchain.embeddings import OpenAIEmbeddings\n",
|
||||||
|
"from langchain.llms.openai import OpenAI\n",
|
||||||
|
"from langchain.vectorstores.chroma import Chroma\n",
|
||||||
|
"\n",
|
||||||
"embedding = OpenAIEmbeddings()\n",
|
"embedding = OpenAIEmbeddings()\n",
|
||||||
"vectordb = Chroma.from_documents(documents=documents, embedding=embedding)\n",
|
"vectordb = Chroma.from_documents(documents=chunks, embedding=embedding)\n",
|
||||||
"retriever = vectordb.as_retriever()\n",
|
"retriever = vectordb.as_retriever()\n",
|
||||||
"qa_chain = RetrievalQA.from_chain_type(\n",
|
"qa_chain = RetrievalQA.from_chain_type(\n",
|
||||||
" llm=OpenAI(), chain_type=\"stuff\", retriever=retriever, return_source_documents=True\n",
|
" llm=OpenAI(), chain_type=\"stuff\", retriever=retriever, return_source_documents=True\n",
|
||||||
@ -184,21 +225,21 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 7,
|
"execution_count": 9,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"{'query': 'What can tenants do with signage on their properties?',\n",
|
"{'query': 'What can tenants do with signage on their properties?',\n",
|
||||||
" 'result': \" Tenants can place or attach signs (digital or otherwise) to their premises with written permission from the landlord. The signs must conform to all applicable laws, ordinances, etc. governing the same. Tenants can also have their name listed in the building's directory at the landlord's cost.\",\n",
|
" 'result': ' Tenants can place or attach signage (digital or otherwise) to their property after receiving written permission from the landlord, which permission shall not be unreasonably withheld. The signage must conform to all applicable laws, ordinances, etc. governing the same, and tenants must remove all such signs by the termination of the lease.',\n",
|
||||||
" 'source_documents': [Document(page_content='ARTICLE VI SIGNAGE 6.01 Signage . Tenant may place or attach to the Premises signs (digital or otherwise) or other such identification as needed after receiving written permission from the Landlord , which permission shall not be unreasonably withheld. Any damage caused to the Premises by the Tenant ’s erecting or removing such signs shall be repaired promptly by the Tenant at the Tenant ’s expense . Any signs or other form of identification allowed must conform to all applicable laws, ordinances, etc. governing the same. Tenant also agrees to have any window or glass identification completely removed and cleaned at its expense promptly upon vacating the Premises.', metadata={'Landlord': 'BUBBA CENTER PARTNERSHIP', 'Lease Date': 'April 24 \\n\\n ,', 'Lease Parties': 'This OFFICE LEASE AGREEMENT (this \"Lease\") is made and entered into by and between BUBBA CENTER PARTNERSHIP (\" Landlord \"), and Truetone Lane LLC , a Delaware limited liability company (\" Tenant \").', 'Tenant': 'Truetone Lane LLC', 'id': 'v1bvgaozfkak', 'source': 'TruTone Lane 2.docx', 'structure': 'div', 'tag': '_601Signage', 'xpath': '/docset:OFFICELEASEAGREEMENT-section/docset:OFFICELEASEAGREEMENT/docset:Article/docset:ARTICLEVISIGNAGE-section/docset:_601Signage-section/docset:_601Signage'}),\n",
|
" 'source_documents': [Document(page_content='6.01 Signage. Tenant may place or attach to the Premises signs (digital or otherwise) or other such identification as needed after receiving written permission from the Landlord, which permission shall not be unreasonably withheld. Any damage caused to the Premises by the Tenant’s erecting or removing such signs shall be repaired promptly by the Tenant at the Tenant’s expense. Any signs or other form of identification allowed must conform to all applicable laws, ordinances, etc. governing the same. Tenant also agrees to have any window or glass identification completely removed and cleaned at its expense promptly upon vacating the Premises. ARTICLE VII UTILITIES', metadata={'id': '1c290eea05915ba0f24c4a1ffc05d6f3', 'name': 'Sample Commercial Leases/TruTone Lane 6.pdf', 'structure': 'lim h1', 'xpath': '/dg:chunk/dg:chunk/dg:chunk[2]/dg:chunk[1]/docset:TheApprovedUse/dg:chunk[12]/dg:chunk[1]'}),\n",
|
||||||
" Document(page_content='Signage. Tenant may place or attach to the Premises signs (digital or otherwise) or other such identification as needed after receiving written permission from the Landlord , which permission shall not be unreasonably withheld. Any damage caused to the Premises by the Tenant ’s erecting or removing such signs shall be repaired promptly by the Tenant at the Tenant ’s expense . Any signs or other form of identification allowed must conform to all applicable laws, ordinances, etc. governing the same. Tenant also agrees to have any window or glass identification completely removed and cleaned at its expense promptly upon vacating the Premises. \\n\\n ARTICLE VII UTILITIES 7.01', metadata={'Landlord': 'GLORY ROAD LLC', 'Lease Date': 'April 30 , 2020', 'Lease Parties': 'This OFFICE LEASE AGREEMENT (this \"Lease\") is made and entered into by and between GLORY ROAD LLC (\" Landlord \"), and Truetone Lane LLC , a Delaware limited liability company (\" Tenant \").', 'Tenant': 'Truetone Lane LLC', 'id': 'g2fvhekmltza', 'source': 'TruTone Lane 6.pdf', 'structure': 'lim', 'tag': 'chunk', 'xpath': '/docset:OFFICELEASEAGREEMENT-section/docset:OFFICELEASEAGREEMENT/docset:Article/docset:ArticleIiiUse/docset:ARTICLEIIIUSEANDCAREOFPREMISES-section/docset:ARTICLEIIIUSEANDCAREOFPREMISES/docset:AnyTime/docset:Addition/dg:chunk'}),\n",
|
" Document(page_content='6.01 Signage. Tenant may place or attach to the Premises signs (digital or otherwise) or other such identification as needed after receiving written permission from the Landlord, which permission shall not be unreasonably withheld. Any damage caused to the Premises by the Tenant’s erecting or removing such signs shall be repaired promptly by the Tenant at the Tenant’s expense. Any signs or other form of identification allowed must conform to all applicable laws, ordinances, etc. governing the same. Tenant also agrees to have any window or glass identification completely removed and cleaned at its expense promptly upon vacating the Premises. ARTICLE VII UTILITIES', metadata={'id': '1c290eea05915ba0f24c4a1ffc05d6f3', 'name': 'Sample Commercial Leases/TruTone Lane 2.pdf', 'structure': 'lim h1', 'xpath': '/dg:chunk/dg:chunk/dg:chunk[2]/dg:chunk[1]/docset:TheApprovedUse/dg:chunk[12]/dg:chunk[1]'}),\n",
|
||||||
" Document(page_content='Landlord , its agents, servants, employees, licensees, invitees, and contractors during the last year of the term of this Lease at any and all times during regular business hours, after 24 hour notice to tenant, to pass and repass on and through the Premises, or such portion thereof as may be necessary, in order that they or any of them may gain access to the Premises for the purpose of showing the Premises to potential new tenants or real estate brokers. In addition, Landlord shall be entitled to place a \"FOR RENT \" or \"FOR LEASE\" sign (not exceeding 8.5 ” x 11 ”) in the front window of the Premises during the last six months of the term of this Lease .', metadata={'Landlord': 'BIRCH STREET , LLC', 'Lease Date': 'October 15 , 2021', 'Lease Parties': 'The provisions of this rider are hereby incorporated into and made a part of the Lease dated as of October 15 , 2021 between BIRCH STREET , LLC , having an address at c/o Birch Palace , 6 Grace Avenue Suite 200 , Great Neck , New York 11021 (\" Landlord \"), and Trutone Lane LLC , having an address at 4 Pearl Street , New York , New York 10012 (\" Tenant \") of Premises known as the ground floor space and lower level space, as per floor plan annexed hereto and made a part hereof as Exhibit A (“Premises”) at 4 Pearl Street , New York , New York 10012 in the City of New York , Borough of Manhattan , to which this rider is annexed. If there is any conflict between the provisions of this rider and the remainder of this Lease , the provisions of this rider shall govern.', 'Tenant': 'Trutone Lane LLC', 'id': 'omvs4mysdk6b', 'source': 'TruTone Lane 1.docx', 'structure': 'p', 'tag': 'Landlord', 'xpath': '/docset:Rider/docset:RIDERTOLEASE-section/docset:RIDERTOLEASE/docset:FixedRent/docset:TermYearPeriod/docset:Lease/docset:_42FLandlordSAccess-section/docset:_42FLandlordSAccess/docset:LandlordsRights/docset:Landlord'}),\n",
|
" Document(page_content='Tenant may place or attach to the Premises signs (digital or otherwise) or other such identification as needed after receiving written permission from the Landlord, which permission shall not be unreasonably withheld. Any damage caused to the Premises by the Tenant’s erecting or removing such signs shall be repaired promptly by the Tenant at the Tenant’s expense. Any signs or other form of identification allowed must conform to all applicable laws, ordinances, etc. governing the same. Tenant also agrees to have any window or glass identification completely removed and cleaned at its expense promptly upon vacating the Premises.', metadata={'id': '58d268162ecc36d8633b7bc364afcb8c', 'name': 'Sample Commercial Leases/TruTone Lane 2.docx', 'structure': 'div', 'xpath': '/docset:OFFICELEASEAGREEMENT-section/docset:OFFICELEASEAGREEMENT/dg:chunk/docset:ARTICLEVISIGNAGE-section/docset:ARTICLEVISIGNAGE/docset:_601Signage'}),\n",
|
||||||
" Document(page_content=\"24. SIGNS . No signage shall be placed by Tenant on any portion of the Project . However, Tenant shall be permitted to place a sign bearing its name in a location approved by Landlord near the entrance to the Premises (at Tenant's cost ) and will be furnished a single listing of its name in the Building's directory (at Landlord 's cost ), all in accordance with the criteria adopted from time to time by Landlord for the Project . Any changes or additional listings in the directory shall be furnished (subject to availability of space) for the then Building Standard charge .\", metadata={'Landlord': 'Perry & Blair LLC', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'dsyfhh4vpeyf', 'source': 'Shorebucks LLC_CO.pdf', 'structure': 'div', 'tag': 'SIGNS', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:GrossRentCreditTheRentCredit-section/docset:GrossRentCreditTheRentCredit/docset:ThisLease-section/docset:ThisLease/docset:Guaranty-section/docset:Guaranty[2]/docset:TheTransfer/docset:TheTerms/docset:Indemnification/docset:INDEMNIFICATION-section/docset:INDEMNIFICATION/docset:Waiver/docset:Waiver/docset:Signs/docset:SIGNS-section/docset:SIGNS'})]}"
|
" Document(page_content='8. SIGNS:\\n Tenant shall not install signs upon the Premises without Landlord’s prior written approval, which approval shall not be unreasonably withheld or delayed, and any such signage shall be subject to any applicable governmental laws, ordinances, regulations, and other requirements. Tenant shall remove all such signs by the terminations of this Lease. Such installations and removals shall be made in such a manner as to avoid injury or defacement of the Building and other improvements, and Tenant shall repair any injury or defacement, including without limitation discoloration caused by such installations and/or removal.', metadata={'id': '6b7d88f0c979c65d5db088fc177fa81f', 'name': 'Lease Agreements/Bioplex, Inc.pdf', 'structure': 'lim h1 div', 'xpath': '/dg:chunk/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/docset:TheObligation/dg:chunk[8]/dg:chunk'})]}"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 7,
|
"execution_count": 9,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result"
|
"output_type": "execute_result"
|
||||||
}
|
}
|
||||||
@ -212,7 +253,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Using Docugami to Add Metadata to Chunks for High Accuracy Document QA\n",
|
"## Using Docugami Knowledge Graph for High Accuracy Document QA\n",
|
||||||
"\n",
|
"\n",
|
||||||
"One issue with large documents is that the correct answer to your question may depend on chunks that are far apart in the document. Typical chunking techniques, even with overlap, will struggle with providing the LLM sufficent context to answer such questions. With upcoming very large context LLMs, it may be possible to stuff a lot of tokens, perhaps even entire documents, inside the context but this will still hit limits at some point with very long documents, or a lot of documents.\n",
|
"One issue with large documents is that the correct answer to your question may depend on chunks that are far apart in the document. Typical chunking techniques, even with overlap, will struggle with providing the LLM sufficent context to answer such questions. With upcoming very large context LLMs, it may be possible to stuff a lot of tokens, perhaps even entire documents, inside the context but this will still hit limits at some point with very long documents, or a lot of documents.\n",
|
||||||
"\n",
|
"\n",
|
||||||
@ -221,16 +262,16 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 8,
|
"execution_count": 10,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"' 9,753 square feet.'"
|
"\" I don't know.\""
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 8,
|
"execution_count": 10,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result"
|
"output_type": "execute_result"
|
||||||
}
|
}
|
||||||
@ -240,28 +281,21 @@
|
|||||||
"chain_response[\"result\"] # correct answer should be 13,500 sq ft"
|
"chain_response[\"result\"] # correct answer should be 13,500 sq ft"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
|
||||||
"cell_type": "markdown",
|
|
||||||
"metadata": {},
|
|
||||||
"source": [
|
|
||||||
"At first glance the answer may seem reasonable, but if you review the source chunks carefully for this answer, you will see that the chunking of the document did not end up putting the Landlord name and the rentable area in the same context, since they are far apart in the document. The retriever therefore ends up finding unrelated chunks from other documents not even related to the **DHA Group** landlord. That landlord happens to be mentioned on the first page of the file **Shorebucks LLC_NJ.pdf** file, and while one of the source chunks used by the chain is indeed from that doc that contains the correct answer (**13,500**), other source chunks from different docs are included, and the answer is therefore incorrect."
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 13,
|
"execution_count": 11,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"[Document(page_content='1.1 Landlord . DHA Group , a Delaware limited liability company authorized to transact business in New Jersey .', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'md8rieecquyv', 'source': 'Shorebucks LLC_NJ.pdf', 'structure': 'div', 'tag': 'DhaGroup', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:TheTerms/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:DhaGroup/docset:DhaGroup/docset:DhaGroup/docset:Landlord-section/docset:DhaGroup'}),\n",
|
"[Document(page_content='1.6 Rentable Area of the Premises.', metadata={'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'lim h1', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:CatalystGroup/dg:chunk[6]/dg:chunk'}),\n",
|
||||||
" Document(page_content='WITNESSES: LANDLORD: DHA Group , a Delaware limited liability company', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'md8rieecquyv', 'source': 'Shorebucks LLC_NJ.pdf', 'structure': 'p', 'tag': 'DhaGroup', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:GrossRentCreditTheRentCredit-section/docset:GrossRentCreditTheRentCredit/docset:Guaranty-section/docset:Guaranty[2]/docset:SIGNATURESONNEXTPAGE-section/docset:INWITNESSWHEREOF-section/docset:INWITNESSWHEREOF/docset:Behalf/docset:Witnesses/xhtml:table/xhtml:tbody/xhtml:tr[3]/xhtml:td[2]/docset:DhaGroup'}),\n",
|
" Document(page_content='1.6 Rentable Area of the Premises.', metadata={'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'structure': 'lim h1', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:MenloGroup/dg:chunk[6]/dg:chunk'}),\n",
|
||||||
" Document(page_content=\"1.16 Landlord 's Notice Address . DHA Group , Suite 1010 , 111 Bauer Dr , Oakland , New Jersey , 07436 , with a copy to the Building Management Office at the Project , Attention: On - Site Property Manager .\", metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'md8rieecquyv', 'source': 'Shorebucks LLC_NJ.pdf', 'structure': 'div', 'tag': 'LandlordsNoticeAddress', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:GrossRentCreditTheRentCredit-section/docset:GrossRentCreditTheRentCredit/docset:Period/docset:ApplicableSalesTax/docset:PercentageRent/docset:PercentageRent/docset:NoticeAddress[2]/docset:LandlordsNoticeAddress-section/docset:LandlordsNoticeAddress[2]'}),\n",
|
" Document(page_content='1.6 Rentable Area of the Premises.', metadata={'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_FL.pdf', 'structure': 'lim h1', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:Florida-section/docset:Florida/docset:Shorebucks/dg:chunk[5]/dg:chunk'}),\n",
|
||||||
" Document(page_content='1.6 Rentable Area of the Premises. 9,753 square feet . This square footage figure includes an add-on factor for Common Areas in the Building and has been agreed upon by the parties as final and correct and is not subject to challenge or dispute by either party.', metadata={'Landlord': 'Perry & Blair LLC', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'dsyfhh4vpeyf', 'source': 'Shorebucks LLC_CO.pdf', 'structure': 'div', 'tag': 'RentableAreaofthePremises', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:TheTerms/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:PerryBlair/docset:PerryBlair/docset:Premises[2]/docset:RentableAreaofthePremises-section/docset:RentableAreaofthePremises'})]"
|
" Document(page_content='1.6 Rentable Area of the Premises.', metadata={'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_TX.pdf', 'structure': 'lim h1', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:LandmarkLlc/dg:chunk[6]/dg:chunk'})]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 13,
|
"execution_count": 11,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result"
|
"output_type": "execute_result"
|
||||||
}
|
}
|
||||||
@ -270,43 +304,42 @@
|
|||||||
"chain_response[\"source_documents\"]"
|
"chain_response[\"source_documents\"]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"At first glance the answer may seem reasonable, but it is incorrect. If you review the source chunks carefully for this answer, you will see that the chunking of the document did not end up putting the Landlord name and the rentable area in the same context, and produced irrelevant chunks therefore the answer is incorrect (should be **13,500 sq ft**)"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Docugami can help here. Chunks are annotated with additional metadata created using different techniques if a user has been [using Docugami](https://help.docugami.com/home/reports). More technical approaches will be added later.\n",
|
"Docugami can help here. Chunks are annotated with additional metadata created using different techniques if a user has been [using Docugami](https://help.docugami.com/home/reports). More technical approaches will be added later.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Specifically, let's look at the additional metadata that is returned on the documents returned by docugami, in the form of some simple key/value pairs on all the text chunks:"
|
"Specifically, let's ask Docugami to return XML tags on its output, as well as additional metadata:"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 14,
|
"execution_count": 12,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"name": "stdout",
|
||||||
"text/plain": [
|
"output_type": "stream",
|
||||||
"{'xpath': '/docset:OFFICELEASEAGREEMENT-section/docset:OFFICELEASEAGREEMENT/docset:LeaseParties',\n",
|
"text": [
|
||||||
" 'id': 'v1bvgaozfkak',\n",
|
"{'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': '47297e277e556f3ce8b570047304560b', 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'structure': 'h1 h1 p', 'tag': 'chunk Lease', 'Lease Date': 'March 29th , 2019', 'Landlord': 'Menlo Group', 'Tenant': 'Shorebucks LLC', 'Premises Address': '1564 E Broadway Rd , Tempe , Arizona 85282', 'Term of Lease': '96 full calendar months', 'Square Feet': '16,159'}\n"
|
||||||
" 'source': 'TruTone Lane 2.docx',\n",
|
|
||||||
" 'structure': 'p',\n",
|
|
||||||
" 'tag': 'LeaseParties',\n",
|
|
||||||
" 'Lease Date': 'April 24 \\n\\n ,',\n",
|
|
||||||
" 'Landlord': 'BUBBA CENTER PARTNERSHIP',\n",
|
|
||||||
" 'Tenant': 'Truetone Lane LLC',\n",
|
|
||||||
" 'Lease Parties': 'This OFFICE LEASE AGREEMENT (this \"Lease\") is made and entered into by and between BUBBA CENTER PARTNERSHIP (\" Landlord \"), and Truetone Lane LLC , a Delaware limited liability company (\" Tenant \").'}"
|
|
||||||
]
|
]
|
||||||
},
|
|
||||||
"execution_count": 14,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"loader = DocugamiLoader(docset_id=\"wh2kned25uqm\")\n",
|
"loader = DocugamiLoader(docset_id=\"zo954yqy53wp\")\n",
|
||||||
"documents = loader.load()\n",
|
"loader.include_xml_tags = (\n",
|
||||||
"documents[0].metadata"
|
" True # for additional semantics from the Docugami knowledge graph\n",
|
||||||
|
")\n",
|
||||||
|
"chunks = loader.load()\n",
|
||||||
|
"print(chunks[0].metadata)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -318,12 +351,22 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 15,
|
"execution_count": 13,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"!poetry run pip install --upgrade lark --quiet"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 14,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"from langchain.chains.query_constructor.schema import AttributeInfo\n",
|
"from langchain.chains.query_constructor.schema import AttributeInfo\n",
|
||||||
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
|
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
|
||||||
|
"from langchain.vectorstores.chroma import Chroma\n",
|
||||||
"\n",
|
"\n",
|
||||||
"EXCLUDE_KEYS = [\"id\", \"xpath\", \"structure\"]\n",
|
"EXCLUDE_KEYS = [\"id\", \"xpath\", \"structure\"]\n",
|
||||||
"metadata_field_info = [\n",
|
"metadata_field_info = [\n",
|
||||||
@ -332,19 +375,23 @@
|
|||||||
" description=f\"The {key} for this chunk\",\n",
|
" description=f\"The {key} for this chunk\",\n",
|
||||||
" type=\"string\",\n",
|
" type=\"string\",\n",
|
||||||
" )\n",
|
" )\n",
|
||||||
" for key in documents[0].metadata\n",
|
" for key in chunks[0].metadata\n",
|
||||||
" if key.lower() not in EXCLUDE_KEYS\n",
|
" if key.lower() not in EXCLUDE_KEYS\n",
|
||||||
"]\n",
|
"]\n",
|
||||||
"\n",
|
"\n",
|
||||||
"\n",
|
|
||||||
"document_content_description = \"Contents of this chunk\"\n",
|
"document_content_description = \"Contents of this chunk\"\n",
|
||||||
"llm = OpenAI(temperature=0)\n",
|
"llm = OpenAI(temperature=0)\n",
|
||||||
"vectordb = Chroma.from_documents(documents=documents, embedding=embedding)\n",
|
"\n",
|
||||||
|
"vectordb = Chroma.from_documents(documents=chunks, embedding=embedding)\n",
|
||||||
"retriever = SelfQueryRetriever.from_llm(\n",
|
"retriever = SelfQueryRetriever.from_llm(\n",
|
||||||
" llm, vectordb, document_content_description, metadata_field_info, verbose=True\n",
|
" llm, vectordb, document_content_description, metadata_field_info, verbose=True\n",
|
||||||
")\n",
|
")\n",
|
||||||
"qa_chain = RetrievalQA.from_chain_type(\n",
|
"qa_chain = RetrievalQA.from_chain_type(\n",
|
||||||
" llm=OpenAI(), chain_type=\"stuff\", retriever=retriever, return_source_documents=True\n",
|
" llm=OpenAI(),\n",
|
||||||
|
" chain_type=\"stuff\",\n",
|
||||||
|
" retriever=retriever,\n",
|
||||||
|
" return_source_documents=True,\n",
|
||||||
|
" verbose=True,\n",
|
||||||
")"
|
")"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@ -357,36 +404,32 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 12,
|
"execution_count": 15,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
|
||||||
"name": "stderr",
|
|
||||||
"output_type": "stream",
|
|
||||||
"text": [
|
|
||||||
"/root/Source/github/docugami.langchain/libs/langchain/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
|
|
||||||
" warnings.warn(\n"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
"output_type": "stream",
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"query='rentable area' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Landlord', value='DHA Group') limit=None\n"
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\u001b[1m> Entering new RetrievalQA chain...\u001b[0m\n",
|
||||||
|
"\n",
|
||||||
|
"\u001b[1m> Finished chain.\u001b[0m\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"{'query': 'What is rentable area for the property owned by DHA Group?',\n",
|
"{'query': 'What is rentable area for the property owned by DHA Group?',\n",
|
||||||
" 'result': ' The rentable area for the property owned by DHA Group is 13,500 square feet.',\n",
|
" 'result': ' The rentable area of the property owned by DHA Group is 13,500 square feet.',\n",
|
||||||
" 'source_documents': [Document(page_content='1.6 Rentable Area of the Premises. 13,500 square feet . This square footage figure includes an add-on factor for Common Areas in the Building and has been agreed upon by the parties as final and correct and is not subject to challenge or dispute by either party.', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'md8rieecquyv', 'source': 'Shorebucks LLC_NJ.pdf', 'structure': 'div', 'tag': 'RentableAreaofthePremises', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:TheTerms/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:DhaGroup/docset:DhaGroup/docset:Premises[2]/docset:RentableAreaofthePremises-section/docset:RentableAreaofthePremises'}),\n",
|
" 'source_documents': [Document(page_content='1.6 Rentable Area of the Premises.', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Premises Address': '111 Bauer Dr , Oakland , New Jersey , 07436', 'Square Feet': '13,500', 'Tenant': 'Shorebucks LLC', 'Term of Lease': '84 full calendar months', 'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'lim h1', 'tag': 'chunk', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:DhaGroup/dg:chunk[6]/dg:chunk'}),\n",
|
||||||
" Document(page_content='1.6 Rentable Area of the Premises. 13,500 square feet . This square footage figure includes an add-on factor for Common Areas in the Building and has been agreed upon by the parties as final and correct and is not subject to challenge or dispute by either party.', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'md8rieecquyv', 'source': 'Shorebucks LLC_NJ.pdf', 'structure': 'div', 'tag': 'RentableAreaofthePremises', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:TheTerms/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:DhaGroup/docset:DhaGroup/docset:Premises[2]/docset:RentableAreaofthePremises-section/docset:RentableAreaofthePremises'}),\n",
|
" Document(page_content='<RentableAreaofthePremises><SquareFeet>13,500 </SquareFeet>square feet. This square footage figure includes an add-on factor for Common Areas in the Building and has been agreed upon by the parties as final and correct and is not subject to challenge or dispute by either party. </RentableAreaofthePremises>', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Premises Address': '111 Bauer Dr , Oakland , New Jersey , 07436', 'Square Feet': '13,500', 'Tenant': 'Shorebucks LLC', 'Term of Lease': '84 full calendar months', 'id': '4c06903d087f5a83e486ee42cd702d31', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'div', 'tag': 'RentableAreaofthePremises', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:DhaGroup/dg:chunk[6]/docset:RentableAreaofthePremises-section/docset:RentableAreaofthePremises'}),\n",
|
||||||
" Document(page_content='1.11 Percentage Rent . (a) 55 % of Gross Revenue to Landlord until Landlord receives Percentage Rent in an amount equal to the Annual Market Rent Hurdle (as escalated); and', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'md8rieecquyv', 'source': 'Shorebucks LLC_NJ.pdf', 'structure': 'p', 'tag': 'GrossRevenue', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:GrossRentCreditTheRentCredit-section/docset:GrossRentCreditTheRentCredit/docset:Period/docset:ApplicableSalesTax/docset:PercentageRent/docset:PercentageRent/docset:PercentageRent/docset:PercentageRent-section/docset:PercentageRent[2]/docset:PercentageRent/docset:GrossRevenue[1]/docset:GrossRevenue'}),\n",
|
" Document(page_content='<TheTermAnnualMarketRent>shall mean (i) for the initial Lease Year (“Year 1”) <Money>$2,239,748.00 </Money>per year (i.e., the product of the Rentable Area of the Premises multiplied by <Money>$82.00</Money>) (the “Year 1 Market Rent Hurdle”); (ii) for the Lease Year thereafter, <Percent>one hundred three percent (103%) </Percent>of the Year 1 Market Rent Hurdle, and (iii) for each Lease Year thereafter until the termination or expiration of this Lease, the Annual Market Rent Threshold shall be <AnnualMarketRentThreshold>one hundred three percent (103%) </AnnualMarketRentThreshold>of the Annual Market Rent Threshold for the immediately prior Lease Year. </TheTermAnnualMarketRent>', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Premises Address': '111 Bauer Dr , Oakland , New Jersey , 07436', 'Square Feet': '13,500', 'Tenant': 'Shorebucks LLC', 'Term of Lease': '84 full calendar months', 'id': '6b90beeadace5d4d12b25706fb48e631', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'div', 'tag': 'TheTermAnnualMarketRent', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:GrossRentCredit-section/docset:GrossRentCredit/dg:chunk/dg:chunk/dg:chunk/dg:chunk[2]/docset:PercentageRent/dg:chunk[2]/dg:chunk[2]/docset:TenantSRevenue/dg:chunk[2]/docset:TenantSRevenue/dg:chunk[3]/docset:TheTermAnnualMarketRent-section/docset:TheTermAnnualMarketRent'}),\n",
|
||||||
" Document(page_content='1.11 Percentage Rent . (a) 55 % of Gross Revenue to Landlord until Landlord receives Percentage Rent in an amount equal to the Annual Market Rent Hurdle (as escalated); and', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Lease Parties': 'THIS OFFICE LEASE (the \"Lease\") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant . \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease .', 'Tenant': 'Shorebucks LLC', 'id': 'md8rieecquyv', 'source': 'Shorebucks LLC_NJ.pdf', 'structure': 'p', 'tag': 'GrossRevenue', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:THISOFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:GrossRentCreditTheRentCredit-section/docset:GrossRentCreditTheRentCredit/docset:Period/docset:ApplicableSalesTax/docset:PercentageRent/docset:PercentageRent/docset:PercentageRent/docset:PercentageRent-section/docset:PercentageRent[2]/docset:PercentageRent/docset:GrossRevenue[1]/docset:GrossRevenue'})]}"
|
" Document(page_content='1.11 Percentage Rent.\\n (a) <GrossRevenue><Percent>55% </Percent>of Gross Revenue to Landlord until Landlord receives Percentage Rent in an amount equal to the Annual Market Rent Hurdle (as escalated); and </GrossRevenue>', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March 29th , 2019', 'Premises Address': '111 Bauer Dr , Oakland , New Jersey , 07436', 'Square Feet': '13,500', 'Tenant': 'Shorebucks LLC', 'Term of Lease': '84 full calendar months', 'id': 'c8bb9cbedf65a578d9db3f25f519dd3d', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'lim h1 lim p', 'tag': 'chunk GrossRevenue', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/docset:GrossRentCredit-section/docset:GrossRentCredit/dg:chunk/dg:chunk/dg:chunk/docset:PercentageRent/dg:chunk[1]/dg:chunk[1]'})]}"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 12,
|
"execution_count": 15,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result"
|
"output_type": "execute_result"
|
||||||
}
|
}
|
||||||
@ -403,6 +446,198 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"This time the answer is correct, since the self-querying retriever created a filter on the landlord attribute of the metadata, correctly filtering to document that specifically is about the DHA Group landlord. The resulting source chunks are all relevant to this landlord, and this improves answer accuracy even though the landlord is not directly mentioned in the specific chunk that contains the correct answer."
|
"This time the answer is correct, since the self-querying retriever created a filter on the landlord attribute of the metadata, correctly filtering to document that specifically is about the DHA Group landlord. The resulting source chunks are all relevant to this landlord, and this improves answer accuracy even though the landlord is not directly mentioned in the specific chunk that contains the correct answer."
|
||||||
]
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Advanced Topic: Small-to-Big Retrieval with Document Knowledge Graph Hierarchy"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Documents are inherently semi-structured and the DocugamiLoader is able to navigate the semantic and structural contours of the document to provide parent chunk references on the chunks it returns. This is useful e.g. with the [MultiVector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) for [small-to-big](https://www.youtube.com/watch?v=ihSiRrOUwmg) retrieval.\n",
|
||||||
|
"\n",
|
||||||
|
"To get parent chunk references, you can set `loader.parent_hierarchy_levels` to a non-zero value."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 16,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from typing import Dict, List\n",
|
||||||
|
"\n",
|
||||||
|
"from langchain.document_loaders import DocugamiLoader\n",
|
||||||
|
"from langchain.schema.document import Document\n",
|
||||||
|
"\n",
|
||||||
|
"loader = DocugamiLoader(docset_id=\"zo954yqy53wp\")\n",
|
||||||
|
"loader.include_xml_tags = (\n",
|
||||||
|
" True # for additional semantics from the Docugami knowledge graph\n",
|
||||||
|
")\n",
|
||||||
|
"loader.parent_hierarchy_levels = 3 # for expanded context\n",
|
||||||
|
"loader.max_text_length = (\n",
|
||||||
|
" 1024 * 8\n",
|
||||||
|
") # 8K chars are roughly 2K tokens (ref: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)\n",
|
||||||
|
"loader.include_project_metadata_in_doc_metadata = (\n",
|
||||||
|
" False # Not filtering on vector metadata, so remove to lighten the vectors\n",
|
||||||
|
")\n",
|
||||||
|
"chunks: List[Document] = loader.load()\n",
|
||||||
|
"\n",
|
||||||
|
"# build separate maps of parent and child chunks\n",
|
||||||
|
"parents_by_id: Dict[str, Document] = {}\n",
|
||||||
|
"children_by_id: Dict[str, Document] = {}\n",
|
||||||
|
"for chunk in chunks:\n",
|
||||||
|
" chunk_id = chunk.metadata.get(\"id\")\n",
|
||||||
|
" parent_chunk_id = chunk.metadata.get(loader.parent_id_key)\n",
|
||||||
|
" if not parent_chunk_id:\n",
|
||||||
|
" # parent chunk\n",
|
||||||
|
" parents_by_id[chunk_id] = chunk\n",
|
||||||
|
" else:\n",
|
||||||
|
" # child chunk\n",
|
||||||
|
" children_by_id[chunk_id] = chunk"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 17,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"PARENT CHUNK 7df09fbfc65bb8377054808aac2d16fd: page_content='OFFICE LEASE\\n THIS OFFICE LEASE\\n <Lease>(the \"Lease\") is made and entered into as of <LeaseDate>March 29th, 2019</LeaseDate>, by and between Landlord and Tenant. \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease. </Lease>\\nW I T N E S S E T H\\n <TheTerms> Subject to and on the terms and conditions of this Lease, Landlord leases to Tenant and Tenant hires from Landlord the Premises. </TheTerms>\\n1. BASIC LEASE INFORMATION AND DEFINED TERMS.\\nThe key business terms of this Lease and the defined terms used in this Lease are as follows:' metadata={'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': '7df09fbfc65bb8377054808aac2d16fd', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'h1 h1 p h1 p lim h1 p', 'tag': 'chunk Lease chunk TheTerms'}\n",
|
||||||
|
"CHUNK 47297e277e556f3ce8b570047304560b: page_content='OFFICE LEASE\\n THIS OFFICE LEASE\\n <Lease>(the \"Lease\") is made and entered into as of <LeaseDate>March 29th, 2019</LeaseDate>, by and between Landlord and Tenant. \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease. </Lease>' metadata={'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': '47297e277e556f3ce8b570047304560b', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'h1 h1 p', 'tag': 'chunk Lease', 'doc_id': '7df09fbfc65bb8377054808aac2d16fd'}\n",
|
||||||
|
"PARENT CHUNK bb84925da3bed22c30ea1bdc173ff54f: page_content='OFFICE LEASE\\n THIS OFFICE LEASE\\n <Lease>(the \"Lease\") is made and entered into as of <LeaseDate>January 8th, 2018</LeaseDate>, by and between Landlord and Tenant. \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease. </Lease>\\nW I T N E S S E T H\\n <TheTerms> Subject to and on the terms and conditions of this Lease, Landlord leases to Tenant and Tenant hires from Landlord the Premises. </TheTerms>\\n1. BASIC LEASE INFORMATION AND DEFINED TERMS.\\nThe key business terms of this Lease and the defined terms used in this Lease are as follows:\\n1.1 Landlord.\\n <Landlord>Catalyst Group LLC </Landlord>' metadata={'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': 'bb84925da3bed22c30ea1bdc173ff54f', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'h1 h1 p h1 p lim h1 p lim h1 div', 'tag': 'chunk Lease chunk TheTerms chunk Landlord'}\n",
|
||||||
|
"CHUNK 2f1746cbd546d1d61a9250c50de7a7fa: page_content='W I T N E S S E T H\\n <TheTerms> Subject to and on the terms and conditions of this Lease, Landlord leases to Tenant and Tenant hires from Landlord the Premises. </TheTerms>' metadata={'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/dg:chunk', 'id': '2f1746cbd546d1d61a9250c50de7a7fa', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'h1 p', 'tag': 'chunk TheTerms', 'doc_id': 'bb84925da3bed22c30ea1bdc173ff54f'}\n",
|
||||||
|
"PARENT CHUNK 0b0d765b6e504a6ba54fa76b203e62ec: page_content='OFFICE LEASE\\n THIS OFFICE LEASE\\n <Lease>(the \"Lease\") is made and entered into as of <LeaseDate>January 8th, 2018</LeaseDate>, by and between Landlord and Tenant. \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease. </Lease>\\nW I T N E S S E T H\\n <TheTerms> Subject to and on the terms and conditions of this Lease, Landlord leases to Tenant and Tenant hires from Landlord the Premises. </TheTerms>\\n1. BASIC LEASE INFORMATION AND DEFINED TERMS.\\nThe key business terms of this Lease and the defined terms used in this Lease are as follows:\\n1.1 Landlord.\\n <Landlord>Catalyst Group LLC </Landlord>\\n1.2 Tenant.\\n <Tenant>Shorebucks LLC </Tenant>' metadata={'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': '0b0d765b6e504a6ba54fa76b203e62ec', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'h1 h1 p h1 p lim h1 p lim h1 div lim h1 div', 'tag': 'chunk Lease chunk TheTerms chunk Landlord chunk Tenant'}\n",
|
||||||
|
"CHUNK b362dfe776ec5a7a66451a8c7c220b59: page_content='1. BASIC LEASE INFORMATION AND DEFINED TERMS.' metadata={'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/dg:chunk', 'id': 'b362dfe776ec5a7a66451a8c7c220b59', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'lim h1', 'tag': 'chunk', 'doc_id': '0b0d765b6e504a6ba54fa76b203e62ec'}\n",
|
||||||
|
"PARENT CHUNK c942010baaf76aa4d4657769492f6edb: page_content='OFFICE LEASE\\n THIS OFFICE LEASE\\n <Lease>(the \"Lease\") is made and entered into as of <LeaseDate>January 8th, 2018</LeaseDate>, by and between Landlord and Tenant. \"Date of this Lease\" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease. </Lease>\\nW I T N E S S E T H\\n <TheTerms> Subject to and on the terms and conditions of this Lease, Landlord leases to Tenant and Tenant hires from Landlord the Premises. </TheTerms>\\n1. BASIC LEASE INFORMATION AND DEFINED TERMS.\\nThe key business terms of this Lease and the defined terms used in this Lease are as follows:\\n1.1 Landlord.\\n <Landlord>Catalyst Group LLC </Landlord>\\n1.2 Tenant.\\n <Tenant>Shorebucks LLC </Tenant>\\n1.3 Building.\\n <Building>The building containing the Premises located at <PremisesAddress><PremisesStreetAddress><MainStreet>600 </MainStreet><StreetName>Main Street</StreetName></PremisesStreetAddress>, <City>Bellevue</City>, <State>WA</State>, <Premises>98004</Premises></PremisesAddress>. The Building is located within the Project. </Building>' metadata={'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': 'c942010baaf76aa4d4657769492f6edb', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'h1 h1 p h1 p lim h1 p lim h1 div lim h1 div lim h1 div', 'tag': 'chunk Lease chunk TheTerms chunk Landlord chunk Tenant chunk Building'}\n",
|
||||||
|
"CHUNK a95971d693b7aa0f6640df1fbd18c2ba: page_content='The key business terms of this Lease and the defined terms used in this Lease are as follows:' metadata={'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/dg:chunk', 'id': 'a95971d693b7aa0f6640df1fbd18c2ba', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'p', 'tag': 'chunk', 'doc_id': 'c942010baaf76aa4d4657769492f6edb'}\n",
|
||||||
|
"PARENT CHUNK f34b649cde7fc4ae156849a56d690495: page_content='W I T N E S S E T H\\n <TheTerms> Subject to and on the terms and conditions of this Lease, Landlord leases to Tenant and Tenant hires from Landlord the Premises. </TheTerms>\\n1. BASIC LEASE INFORMATION AND DEFINED TERMS.\\n<BASICLEASEINFORMATIONANDDEFINEDTERMS>The key business terms of this Lease and the defined terms used in this Lease are as follows: </BASICLEASEINFORMATIONANDDEFINEDTERMS>\\n1.1 Landlord.\\n <Landlord><Landlord>Menlo Group</Landlord>, a <USState>Delaware </USState>limited liability company authorized to transact business in <USState>Arizona</USState>. </Landlord>\\n1.2 Tenant.\\n <Tenant>Shorebucks LLC </Tenant>\\n1.3 Building.\\n <Building>The building containing the Premises located at <PremisesAddress><PremisesStreetAddress><Premises>1564 </Premises><Premises>E Broadway Rd</Premises></PremisesStreetAddress>, <City>Tempe</City>, <USState>Arizona </USState><Premises>85282</Premises></PremisesAddress>. The Building is located within the Project. </Building>\\n1.4 Project.\\n <Project>The parcel of land and the buildings and improvements located on such land known as Shorebucks Office <ShorebucksOfficeAddress><ShorebucksOfficeStreetAddress><ShorebucksOffice>6 </ShorebucksOffice><ShorebucksOffice6>located at <Number>1564 </Number>E Broadway Rd</ShorebucksOffice6></ShorebucksOfficeStreetAddress>, <City>Tempe</City>, <USState>Arizona </USState><Number>85282</Number></ShorebucksOfficeAddress>. The Project is legally described in EXHIBIT \"A\" to this Lease. </Project>' metadata={'xpath': '/dg:chunk/docset:WITNESSETH-section/dg:chunk', 'id': 'f34b649cde7fc4ae156849a56d690495', 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.docx', 'source': 'Sample Commercial Leases/Shorebucks LLC_AZ.docx', 'structure': 'h1 p lim h1 div lim h1 div lim h1 div lim h1 div lim h1 div', 'tag': 'chunk TheTerms BASICLEASEINFORMATIONANDDEFINEDTERMS chunk Landlord chunk Tenant chunk Building chunk Project'}\n",
|
||||||
|
"CHUNK 21b4d9517f7ccdc0e3a028ce5043a2a0: page_content='1.1 Landlord.\\n <Landlord><Landlord>Menlo Group</Landlord>, a <USState>Delaware </USState>limited liability company authorized to transact business in <USState>Arizona</USState>. </Landlord>' metadata={'xpath': '/dg:chunk/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk[1]/dg:chunk[1]/dg:chunk/dg:chunk[2]/dg:chunk', 'id': '21b4d9517f7ccdc0e3a028ce5043a2a0', 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.docx', 'source': 'Sample Commercial Leases/Shorebucks LLC_AZ.docx', 'structure': 'lim h1 div', 'tag': 'chunk Landlord', 'doc_id': 'f34b649cde7fc4ae156849a56d690495'}\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Explore some of the parent chunk relationships\n",
|
||||||
|
"for id, chunk in list(children_by_id.items())[:5]:\n",
|
||||||
|
" parent_chunk_id = chunk.metadata.get(loader.parent_id_key)\n",
|
||||||
|
" if parent_chunk_id:\n",
|
||||||
|
" # child chunks have the parent chunk id set\n",
|
||||||
|
" print(f\"PARENT CHUNK {parent_chunk_id}: {parents_by_id[parent_chunk_id]}\")\n",
|
||||||
|
" print(f\"CHUNK {id}: {chunk}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 18,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from langchain.embeddings import OpenAIEmbeddings\n",
|
||||||
|
"from langchain.retrievers.multi_vector import MultiVectorRetriever, SearchType\n",
|
||||||
|
"from langchain.storage import InMemoryStore\n",
|
||||||
|
"from langchain.vectorstores.chroma import Chroma\n",
|
||||||
|
"\n",
|
||||||
|
"# The vectorstore to use to index the child chunks\n",
|
||||||
|
"vectorstore = Chroma(collection_name=\"big2small\", embedding_function=OpenAIEmbeddings())\n",
|
||||||
|
"\n",
|
||||||
|
"# The storage layer for the parent documents\n",
|
||||||
|
"store = InMemoryStore()\n",
|
||||||
|
"\n",
|
||||||
|
"# The retriever (empty to start)\n",
|
||||||
|
"retriever = MultiVectorRetriever(\n",
|
||||||
|
" vectorstore=vectorstore,\n",
|
||||||
|
" docstore=store,\n",
|
||||||
|
" search_type=SearchType.mmr, # use max marginal relevance search\n",
|
||||||
|
" search_kwargs={\"k\": 2},\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"# Add child chunks to vector store\n",
|
||||||
|
"retriever.vectorstore.add_documents(list(children_by_id.values()))\n",
|
||||||
|
"\n",
|
||||||
|
"# Add parent chunks to docstore\n",
|
||||||
|
"retriever.docstore.mset(parents_by_id.items())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"24. SIGNS.\n",
|
||||||
|
" <SIGNS>No signage shall be placed by Tenant on any portion of the Project. However, Tenant shall be permitted to place a sign bearing its name in a location approved by Landlord near the entrance to the Premises (at Tenant's cost) and will be furnished a single listing of its name in the Building's directory (at Landlord's cost), all in accordance with the criteria adopted <Frequency>from time to time </Frequency>by Landlord for the Project. Any changes or additional listings in the directory shall be furnished (subject to availability of space) for the then Building Standard charge. </SIGNS>\n",
|
||||||
|
"43090337ed2409e0da24ee07e2adbe94\n",
|
||||||
|
"<TheExterior> Tenant agrees that all signs, awnings, protective gates, security devices and other installations visible from the exterior of the Premises shall be subject to Landlord's prior written approval, shall be subject to the prior approval of the <Org>Landmarks </Org><Landmarks>Preservation Commission </Landmarks>of the City of <USState>New <Org>York</Org></USState>, if required, and shall not interfere with or block either of the adjacent stores, provided, however, that Landlord shall not unreasonably withhold consent for signs that Tenant desires to install. Tenant agrees that any permitted signs, awnings, protective gates, security devices, and other installations shall be installed at Tenant’s sole cost and expense professionally prepared and dignified and subject to Landlord's prior written approval, which shall not be unreasonably withheld, delayed or conditioned, and subject to such reasonable rules and restrictions as Landlord <Frequency>from time to time </Frequency>may impose. Tenant shall submit to Landlord drawings of the proposed signs and other installations, showing the size, color, illumination and general appearance thereof, together with a statement of the manner in which the same are to be affixed to the Premises. Tenant shall not commence the installation of the proposed signs and other installations unless and until Landlord shall have approved the same in writing. . Tenant shall not install any neon sign. The aforesaid signs shall be used solely for the purpose of identifying Tenant's business. No changes shall be made in the signs and other installations without first obtaining Landlord's prior written consent thereto, which consent shall not be unreasonably withheld, delayed or conditioned. Tenant shall, at its own cost and expense, obtain and exhibit to Landlord such permits or certificates of approval as Tenant may be required to obtain from any and all City, State and other authorities having jurisdiction covering the erection, installation, maintenance or use of said signs or other installations, and Tenant shall maintain the said signs and other installations together with any appurtenances thereto in good order and condition and to the satisfaction of the Landlord and in accordance with any and all orders, regulations, requirements and rules of any public authorities having jurisdiction thereover. Landlord consents to Tenant’s Initial Signage described in annexed Exhibit D. </TheExterior>\n",
|
||||||
|
"54ddfc3e47f41af7e747b2bc439ea96b\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Query vector store directly, should return chunks\n",
|
||||||
|
"found_chunks = vectorstore.similarity_search(\n",
|
||||||
|
" \"what signs does Birch Street allow on their property?\", k=2\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"for chunk in found_chunks:\n",
|
||||||
|
" print(chunk.page_content)\n",
|
||||||
|
" print(chunk.metadata[loader.parent_id_key])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 20,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"21. SERVICES AND UTILITIES.\n",
|
||||||
|
" <SERVICESANDUTILITIES>Landlord shall have no obligation to provide any utilities or services to the Premises other than passenger elevator service to the Premises. Tenant shall be solely responsible for and shall promptly pay all charges for water, electricity, or any other utility used or consumed in the Premises, including all costs associated with separately metering for the Premises. Tenant shall be responsible for repairs and maintenance to exit lighting, emergency lighting, and fire extinguishers for the Premises. Tenant is responsible for interior janitorial, pest control, and waste removal services. Landlord may at any time change the electrical utility provider for the Building. Tenant’s use of electrical, HVAC, or other services furnished by Landlord shall not exceed, either in voltage, rated capacity, use, or overall load, that which Landlord deems to be standard for the Building. In no event shall Landlord be liable for damages resulting from the failure to furnish any service, and any interruption or failure shall in no manner entitle Tenant to any remedies including abatement of Rent. If at any time during the Lease Term the Project has any type of card access system for the Parking Areas or the Building, Tenant shall purchase access cards for all occupants of the Premises from Landlord at a Building Standard charge and shall comply with Building Standard terms relating to access to the Parking Areas and the Building. </SERVICESANDUTILITIES>\n",
|
||||||
|
"22. SECURITY DEPOSIT.\n",
|
||||||
|
" <SECURITYDEPOSIT>The Security Deposit shall be held by Landlord as security for Tenant's full and faithful performance of this Lease including the payment of Rent. Tenant grants Landlord a security interest in the Security Deposit. The Security Deposit may be commingled with other funds of Landlord and Landlord shall have no liability for payment of any interest on the Security Deposit. Landlord may apply the Security Deposit to the extent required to cure any default by Tenant. If Landlord so applies the Security Deposit, Tenant shall deliver to Landlord the amount necessary to replenish the Security Deposit to its original sum within <Deliver>five days </Deliver>after notice from Landlord. The Security Deposit shall not be deemed an advance payment of Rent or a measure of damages for any default by Tenant, nor shall it be a defense to any action that Landlord may bring against Tenant. </SECURITYDEPOSIT>\n",
|
||||||
|
"23. GOVERNMENTAL REGULATIONS.\n",
|
||||||
|
" <GOVERNMENTALREGULATIONS>Tenant, at Tenant's sole cost and expense, shall promptly comply (and shall cause all subtenants and licensees to comply) with all laws, codes, and ordinances of governmental authorities, including the Americans with Disabilities Act of <AmericanswithDisabilitiesActDate>1990 </AmericanswithDisabilitiesActDate>as amended (the \"ADA\"), and all recorded covenants and restrictions affecting the Project, pertaining to Tenant, its conduct of business, and its use and occupancy of the Premises, including the performance of any work to the Common Areas required because of Tenant's specific use (as opposed to general office use) of the Premises or Alterations to the Premises made by Tenant. </GOVERNMENTALREGULATIONS>\n",
|
||||||
|
"24. SIGNS.\n",
|
||||||
|
" <SIGNS>No signage shall be placed by Tenant on any portion of the Project. However, Tenant shall be permitted to place a sign bearing its name in a location approved by Landlord near the entrance to the Premises (at Tenant's cost) and will be furnished a single listing of its name in the Building's directory (at Landlord's cost), all in accordance with the criteria adopted <Frequency>from time to time </Frequency>by Landlord for the Project. Any changes or additional listings in the directory shall be furnished (subject to availability of space) for the then Building Standard charge. </SIGNS>\n",
|
||||||
|
"25. BROKER.\n",
|
||||||
|
" <BROKER>Landlord and Tenant each represent and warrant that they have neither consulted nor negotiated with any broker or finder regarding the Premises, except the Landlord's Broker and Tenant's Broker. Tenant shall indemnify, defend, and hold Landlord harmless from and against any claims for commissions from any real estate broker other than Landlord's Broker and Tenant's Broker with whom Tenant has dealt in connection with this Lease. Landlord shall indemnify, defend, and hold Tenant harmless from and against payment of any leasing commission due Landlord's Broker and Tenant's Broker in connection with this Lease and any claims for commissions from any real estate broker other than Landlord's Broker and Tenant's Broker with whom Landlord has dealt in connection with this Lease. The terms of this article shall survive the expiration or earlier termination of this Lease. </BROKER>\n",
|
||||||
|
"26. END OF TERM.\n",
|
||||||
|
" <ENDOFTERM>Tenant shall surrender the Premises to Landlord at the expiration or sooner termination of this Lease or Tenant's right of possession in good order and condition, broom-clean, except for reasonable wear and tear. All Alterations made by Landlord or Tenant to the Premises shall become Landlord's property on the expiration or sooner termination of the Lease Term. On the expiration or sooner termination of the Lease Term, Tenant, at its expense, shall remove from the Premises all of Tenant's personal property, all computer and telecommunications wiring, and all Alterations that Landlord designates by notice to Tenant. Tenant shall also repair any damage to the Premises caused by the removal. Any items of Tenant's property that shall remain in the Premises after the expiration or sooner termination of the Lease Term, may, at the option of Landlord and without notice, be deemed to have been abandoned, and in that case, those items may be retained by Landlord as its property to be disposed of by Landlord, without accountability or notice to Tenant or any other party, in the manner Landlord shall determine, at Tenant's expense. </ENDOFTERM>\n",
|
||||||
|
"27. ATTORNEYS' FEES.\n",
|
||||||
|
" <ATTORNEYSFEES>Except as otherwise provided in this Lease, the prevailing party in any litigation or other dispute resolution proceeding, including arbitration, arising out of or in any manner based on or relating to this Lease, including tort actions and actions for injunctive, declaratory, and provisional relief, shall be entitled to recover from the losing party actual attorneys' fees and costs, including fees for litigating the entitlement to or amount of fees or costs owed under this provision, and fees in connection with bankruptcy, appellate, or collection proceedings. No person or entity other than Landlord or Tenant has any right to recover fees under this paragraph. In addition, if Landlord becomes a party to any suit or proceeding affecting the Premises or involving this Lease or Tenant's interest under this Lease, other than a suit between Landlord and Tenant, or if Landlord engages counsel to collect any of the amounts owed under this Lease, or to enforce performance of any of the agreements, conditions, covenants, provisions, or stipulations of this Lease, without commencing litigation, then the costs, expenses, and reasonable attorneys' fees and disbursements incurred by Landlord shall be paid to Landlord by Tenant. </ATTORNEYSFEES>\n",
|
||||||
|
"43090337ed2409e0da24ee07e2adbe94\n",
|
||||||
|
"<TenantsSoleCost> Tenant, at Tenant's sole cost and expense, shall be responsible for the removal and disposal of all of garbage, waste, and refuse from the Premises on a <Frequency>daily </Frequency>basis. Tenant shall cause all garbage, waste and refuse to be stored within the Premises until <Stored>thirty (30) minutes </Stored>before closing, except that Tenant shall be permitted, to the extent permitted by law, to place garbage outside the Premises after the time specified in the immediately preceding sentence for pick up prior to <PickUp>6:00 A.M. </PickUp>next following. Garbage shall be placed at the edge of the sidewalk in front of the Premises at the location furthest from he main entrance to the Building or such other location in front of the Building as may be specified by Landlord. </TenantsSoleCost>\n",
|
||||||
|
"<ItsSoleCost> Tenant, at its sole cost and expense, agrees to use all reasonable diligence in accordance with the best prevailing methods for the prevention and extermination of vermin, rats, and mice, mold, fungus, allergens, <Bacterium>bacteria </Bacterium>and all other similar conditions in the Premises. Tenant, at Tenant's expense, shall cause the Premises to be exterminated <Exterminated>from time to time </Exterminated>to the reasonable satisfaction of Landlord and shall employ licensed exterminating companies. Landlord shall not be responsible for any cleaning, waste removal, janitorial, or similar services for the Premises, and Tenant sha ll not be entitled to seek any abatement, setoff or credit from the Landlord in the event any conditions described in this Article are found to exist in the Premises. </ItsSoleCost>\n",
|
||||||
|
"42B. Sidewalk Use and Maintenance\n",
|
||||||
|
"<TheSidewalk> Tenant shall, at its sole cost and expense, keep the sidewalk in front of the Premises 18 inches into the street from the curb clean free of garbage, waste, refuse, excess water, snow, and ice and Tenant shall pay, as additional rent, any fine, cost, or expense caused by Tenant's failure to do so. In the event Tenant operates a sidewalk café, Tenant shall, at its sole cost and expense, maintain, repair, and replace as necessary, the sidewalk in front of the Premises and the metal trapdoor leading to the basement of the Premises, if any. Tenant shall post warning signs and cones on all sides of any side door when in use and attach a safety bar across any such door at all times when open. </TheSidewalk>\n",
|
||||||
|
"<Display> In no event shall Tenant use, or permit to be used, the space adjacent to or any other space outside of the Premises, for display, sale or any other similar undertaking; except [1] in the event of a legal and licensed “street fair” type program or [<Number>2</Number>] if the local zoning, Community Board [if applicable] and other municipal laws, rules and regulations, allow for sidewalk café use and, if such I s the case, said operation shall be in strict accordance with all of the aforesaid requirements and conditions. . In no event shall Tenant use, or permit to be used, any advertising medium and/or loud speaker and/or sound amplifier and/or radio or television broadcast which may be heard outside of the Premises or which does not comply with the reasonable rules and regulations of Landlord which then will be in effect. </Display>\n",
|
||||||
|
"42C. Store Front Maintenance\n",
|
||||||
|
" <TheBulkheadAndSecurityGate> Tenant agrees to wash the storefront, including the bulkhead and security gate, from the top to the ground, monthly or more often as Landlord reasonably requests and make all repairs and replacements as and when deemed necessary by Landlord, to all windows and plate and ot her glass in or about the Premises and the security gate, if any. In case of any default by Tenant in maintaining the storefront as herein provided, Landlord may do so at its own expense and bill the cost thereof to Tenant as additional rent. </TheBulkheadAndSecurityGate>\n",
|
||||||
|
"42D. Music, Noise, and Vibration\n",
|
||||||
|
"4474c92ae7ccec9184ed2fef9f072734\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Query retriever, should return parents (using MMR since that was set as search_type above)\n",
|
||||||
|
"retrieved_parent_docs = retriever.get_relevant_documents(\n",
|
||||||
|
" \"what signs does Birch Street allow on their property?\"\n",
|
||||||
|
")\n",
|
||||||
|
"for chunk in retrieved_parent_docs:\n",
|
||||||
|
" print(chunk.page_content)\n",
|
||||||
|
" print(chunk.metadata[\"id\"])"
|
||||||
|
]
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"metadata": {
|
"metadata": {
|
||||||
|
@ -8,7 +8,7 @@
|
|||||||
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install lxml
|
pip install dgml-utils
|
||||||
```
|
```
|
||||||
|
|
||||||
## Document Loader
|
## Document Loader
|
||||||
|
@ -143,7 +143,7 @@
|
|||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '10e9cbc0-4ba5-4d79-a09b-c033d1ba7b01', 'source': '../../state_of_the_union.txt'})"
|
"Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '455205f7-bb7d-4c36-b442-d1d6f9f701ed', 'source': '../../state_of_the_union.txt'})"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 8,
|
"execution_count": 8,
|
||||||
@ -165,7 +165,7 @@
|
|||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": [
|
"text/plain": [
|
||||||
"9874"
|
"9875"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"execution_count": 9,
|
"execution_count": 9,
|
||||||
@ -178,6 +178,39 @@
|
|||||||
"len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)"
|
"len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "cdef8339-f9fa-4b3b-955f-ad9dbdf2734f",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"The default search type the retriever performs on the vector database is a similarity search. LangChain Vector Stores also support searching via [Max Marginal Relevance](https://api.python.langchain.com/en/latest/schema/langchain.schema.vectorstore.VectorStore.html#langchain.schema.vectorstore.VectorStore.max_marginal_relevance_search) so if you want this instead you can just set the `search_type` property as follows:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 15,
|
||||||
|
"id": "36739460-a737-4a8e-b70f-50bf8c8eaae7",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"9875"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 15,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from langchain.retrievers.multi_vector import SearchType\n",
|
||||||
|
"\n",
|
||||||
|
"retriever.search_type = SearchType.mmr\n",
|
||||||
|
"\n",
|
||||||
|
"len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"id": "d6a7ae0d",
|
"id": "d6a7ae0d",
|
||||||
@ -576,7 +609,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.10.1"
|
"version": "3.9.16"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
|
import hashlib
|
||||||
import io
|
import io
|
||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
import re
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Any, Dict, List, Mapping, Optional, Sequence, Union
|
from typing import Any, Dict, List, Mapping, Optional, Sequence, Union
|
||||||
|
|
||||||
@ -11,11 +11,10 @@ from langchain_core.pydantic_v1 import BaseModel, root_validator
|
|||||||
|
|
||||||
from langchain.document_loaders.base import BaseLoader
|
from langchain.document_loaders.base import BaseLoader
|
||||||
|
|
||||||
TD_NAME = "{http://www.w3.org/1999/xhtml}td"
|
|
||||||
TABLE_NAME = "{http://www.w3.org/1999/xhtml}table"
|
TABLE_NAME = "{http://www.w3.org/1999/xhtml}table"
|
||||||
|
|
||||||
XPATH_KEY = "xpath"
|
XPATH_KEY = "xpath"
|
||||||
DOCUMENT_ID_KEY = "id"
|
ID_KEY = "id"
|
||||||
DOCUMENT_SOURCE_KEY = "source"
|
DOCUMENT_SOURCE_KEY = "source"
|
||||||
DOCUMENT_NAME_KEY = "name"
|
DOCUMENT_NAME_KEY = "name"
|
||||||
STRUCTURE_KEY = "structure"
|
STRUCTURE_KEY = "structure"
|
||||||
@ -30,7 +29,7 @@ logger = logging.getLogger(__name__)
|
|||||||
class DocugamiLoader(BaseLoader, BaseModel):
|
class DocugamiLoader(BaseLoader, BaseModel):
|
||||||
"""Load from `Docugami`.
|
"""Load from `Docugami`.
|
||||||
|
|
||||||
To use, you should have the ``lxml`` python package installed.
|
To use, you should have the ``dgml-utils`` python package installed.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
api: str = DEFAULT_API_ENDPOINT
|
api: str = DEFAULT_API_ENDPOINT
|
||||||
@ -38,14 +37,43 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
|
|
||||||
access_token: Optional[str] = os.environ.get("DOCUGAMI_API_KEY")
|
access_token: Optional[str] = os.environ.get("DOCUGAMI_API_KEY")
|
||||||
"""The Docugami API access token to use."""
|
"""The Docugami API access token to use."""
|
||||||
|
|
||||||
|
max_text_length = 4096
|
||||||
|
"""Max length of chunk text returned."""
|
||||||
|
|
||||||
|
min_text_length: int = 32
|
||||||
|
"""Threshold under which chunks are appended to next to avoid over-chunking."""
|
||||||
|
|
||||||
|
max_metadata_length = 512
|
||||||
|
"""Max length of metadata text returned."""
|
||||||
|
|
||||||
|
include_xml_tags: bool = False
|
||||||
|
"""Set to true for XML tags in chunk output text."""
|
||||||
|
|
||||||
|
parent_hierarchy_levels: int = 0
|
||||||
|
"""Set appropriately to get parent chunks using the chunk hierarchy."""
|
||||||
|
|
||||||
|
parent_id_key: str = "doc_id"
|
||||||
|
"""Metadata key for parent doc ID."""
|
||||||
|
|
||||||
|
sub_chunk_tables: bool = False
|
||||||
|
"""Set to True to return sub-chunks within tables."""
|
||||||
|
|
||||||
|
whitespace_normalize_text: bool = True
|
||||||
|
"""Set to False if you want to full whitespace formatting in the original
|
||||||
|
XML doc, including indentation."""
|
||||||
|
|
||||||
docset_id: Optional[str]
|
docset_id: Optional[str]
|
||||||
"""The Docugami API docset ID to use."""
|
"""The Docugami API docset ID to use."""
|
||||||
|
|
||||||
document_ids: Optional[Sequence[str]]
|
document_ids: Optional[Sequence[str]]
|
||||||
"""The Docugami API document IDs to use."""
|
"""The Docugami API document IDs to use."""
|
||||||
|
|
||||||
file_paths: Optional[Sequence[Union[Path, str]]]
|
file_paths: Optional[Sequence[Union[Path, str]]]
|
||||||
"""The local file paths to use."""
|
"""The local file paths to use."""
|
||||||
min_chunk_size: int = 32 # appended to the next chunk to avoid over-chunking
|
|
||||||
"""The minimum chunk size to use when parsing DGML. Defaults to 32."""
|
include_project_metadata_in_doc_metadata: bool = True
|
||||||
|
"""Set to True if you want to include the project metadata in the doc metadata."""
|
||||||
|
|
||||||
@root_validator
|
@root_validator
|
||||||
def validate_local_or_remote(cls, values: Dict[str, Any]) -> Dict[str, Any]:
|
def validate_local_or_remote(cls, values: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
@ -69,7 +97,10 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
return values
|
return values
|
||||||
|
|
||||||
def _parse_dgml(
|
def _parse_dgml(
|
||||||
self, document: Mapping, content: bytes, doc_metadata: Optional[Mapping] = None
|
self,
|
||||||
|
content: bytes,
|
||||||
|
document_name: Optional[str] = None,
|
||||||
|
additional_doc_metadata: Optional[Mapping] = None,
|
||||||
) -> List[Document]:
|
) -> List[Document]:
|
||||||
"""Parse a single DGML document into a list of Documents."""
|
"""Parse a single DGML document into a list of Documents."""
|
||||||
try:
|
try:
|
||||||
@ -80,108 +111,65 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
"Please install it with `pip install lxml`."
|
"Please install it with `pip install lxml`."
|
||||||
)
|
)
|
||||||
|
|
||||||
# helpers
|
try:
|
||||||
def _xpath_qname_for_chunk(chunk: Any) -> str:
|
from dgml_utils.models import Chunk
|
||||||
"""Get the xpath qname for a chunk."""
|
from dgml_utils.segmentation import get_chunks
|
||||||
qname = f"{chunk.prefix}:{chunk.tag.split('}')[-1]}"
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
parent = chunk.getparent()
|
"Could not import from dgml-utils python package. "
|
||||||
if parent is not None:
|
"Please install it with `pip install dgml-utils`."
|
||||||
doppelgangers = [x for x in parent if x.tag == chunk.tag]
|
|
||||||
if len(doppelgangers) > 1:
|
|
||||||
idx_of_self = doppelgangers.index(chunk)
|
|
||||||
qname = f"{qname}[{idx_of_self + 1}]"
|
|
||||||
|
|
||||||
return qname
|
|
||||||
|
|
||||||
def _xpath_for_chunk(chunk: Any) -> str:
|
|
||||||
"""Get the xpath for a chunk."""
|
|
||||||
ancestor_chain = chunk.xpath("ancestor-or-self::*")
|
|
||||||
return "/" + "/".join(_xpath_qname_for_chunk(x) for x in ancestor_chain)
|
|
||||||
|
|
||||||
def _structure_value(node: Any) -> str:
|
|
||||||
"""Get the structure value for a node."""
|
|
||||||
structure = (
|
|
||||||
"table"
|
|
||||||
if node.tag == TABLE_NAME
|
|
||||||
else node.attrib["structure"]
|
|
||||||
if "structure" in node.attrib
|
|
||||||
else None
|
|
||||||
)
|
)
|
||||||
return structure
|
|
||||||
|
|
||||||
def _is_structural(node: Any) -> bool:
|
def _build_framework_chunk(dg_chunk: Chunk) -> Document:
|
||||||
"""Check if a node is structural."""
|
# Stable IDs for chunks with the same text.
|
||||||
return _structure_value(node) is not None
|
_hashed_id = hashlib.md5(dg_chunk.text.encode()).hexdigest()
|
||||||
|
|
||||||
def _is_heading(node: Any) -> bool:
|
|
||||||
"""Check if a node is a heading."""
|
|
||||||
structure = _structure_value(node)
|
|
||||||
return structure is not None and structure.lower().startswith("h")
|
|
||||||
|
|
||||||
def _get_text(node: Any) -> str:
|
|
||||||
"""Get the text of a node."""
|
|
||||||
return " ".join(node.itertext()).strip()
|
|
||||||
|
|
||||||
def _has_structural_descendant(node: Any) -> bool:
|
|
||||||
"""Check if a node has a structural descendant."""
|
|
||||||
for child in node:
|
|
||||||
if _is_structural(child) or _has_structural_descendant(child):
|
|
||||||
return True
|
|
||||||
return False
|
|
||||||
|
|
||||||
def _leaf_structural_nodes(node: Any) -> List:
|
|
||||||
"""Get the leaf structural nodes of a node."""
|
|
||||||
if _is_structural(node) and not _has_structural_descendant(node):
|
|
||||||
return [node]
|
|
||||||
else:
|
|
||||||
leaf_nodes = []
|
|
||||||
for child in node:
|
|
||||||
leaf_nodes.extend(_leaf_structural_nodes(child))
|
|
||||||
return leaf_nodes
|
|
||||||
|
|
||||||
def _create_doc(node: Any, text: str) -> Document:
|
|
||||||
"""Create a Document from a node and text."""
|
|
||||||
metadata = {
|
metadata = {
|
||||||
XPATH_KEY: _xpath_for_chunk(node),
|
XPATH_KEY: dg_chunk.xpath,
|
||||||
DOCUMENT_ID_KEY: document[DOCUMENT_ID_KEY],
|
ID_KEY: _hashed_id,
|
||||||
DOCUMENT_NAME_KEY: document[DOCUMENT_NAME_KEY],
|
DOCUMENT_NAME_KEY: document_name,
|
||||||
DOCUMENT_SOURCE_KEY: document[DOCUMENT_NAME_KEY],
|
DOCUMENT_SOURCE_KEY: document_name,
|
||||||
STRUCTURE_KEY: node.attrib.get("structure", ""),
|
STRUCTURE_KEY: dg_chunk.structure,
|
||||||
TAG_KEY: re.sub(r"\{.*\}", "", node.tag),
|
TAG_KEY: dg_chunk.tag,
|
||||||
}
|
}
|
||||||
|
|
||||||
if doc_metadata:
|
text = dg_chunk.text
|
||||||
metadata.update(doc_metadata)
|
if additional_doc_metadata:
|
||||||
|
if self.include_project_metadata_in_doc_metadata:
|
||||||
|
metadata.update(additional_doc_metadata)
|
||||||
|
|
||||||
return Document(
|
return Document(
|
||||||
page_content=text,
|
page_content=text[: self.max_text_length],
|
||||||
metadata=metadata,
|
metadata=metadata,
|
||||||
)
|
)
|
||||||
|
|
||||||
# parse the tree and return chunks
|
# Parse the tree and return chunks
|
||||||
tree = etree.parse(io.BytesIO(content))
|
tree = etree.parse(io.BytesIO(content))
|
||||||
root = tree.getroot()
|
root = tree.getroot()
|
||||||
|
|
||||||
chunks: List[Document] = []
|
dg_chunks = get_chunks(
|
||||||
prev_small_chunk_text = None
|
root,
|
||||||
for node in _leaf_structural_nodes(root):
|
min_text_length=self.min_text_length,
|
||||||
text = _get_text(node)
|
max_text_length=self.max_text_length,
|
||||||
if prev_small_chunk_text:
|
whitespace_normalize_text=self.whitespace_normalize_text,
|
||||||
text = prev_small_chunk_text + " " + text
|
sub_chunk_tables=self.sub_chunk_tables,
|
||||||
prev_small_chunk_text = None
|
include_xml_tags=self.include_xml_tags,
|
||||||
|
parent_hierarchy_levels=self.parent_hierarchy_levels,
|
||||||
|
)
|
||||||
|
|
||||||
if _is_heading(node) or len(text) < self.min_chunk_size:
|
framework_chunks: Dict[str, Document] = {}
|
||||||
# Save headings or other small chunks to be appended to the next chunk
|
for dg_chunk in dg_chunks:
|
||||||
prev_small_chunk_text = text
|
framework_chunk = _build_framework_chunk(dg_chunk)
|
||||||
else:
|
chunk_id = framework_chunk.metadata.get(ID_KEY)
|
||||||
chunks.append(_create_doc(node, text))
|
if chunk_id:
|
||||||
|
framework_chunks[chunk_id] = framework_chunk
|
||||||
|
if dg_chunk.parent:
|
||||||
|
framework_parent_chunk = _build_framework_chunk(dg_chunk.parent)
|
||||||
|
parent_id = framework_parent_chunk.metadata.get(ID_KEY)
|
||||||
|
if parent_id and framework_parent_chunk.page_content:
|
||||||
|
framework_chunk.metadata[self.parent_id_key] = parent_id
|
||||||
|
framework_chunks[parent_id] = framework_parent_chunk
|
||||||
|
|
||||||
if prev_small_chunk_text and len(chunks) > 0:
|
return list(framework_chunks.values())
|
||||||
# small chunk at the end left over, just append to last chunk
|
|
||||||
chunks[-1].page_content += " " + prev_small_chunk_text
|
|
||||||
|
|
||||||
return chunks
|
|
||||||
|
|
||||||
def _document_details_for_docset_id(self, docset_id: str) -> List[Dict]:
|
def _document_details_for_docset_id(self, docset_id: str) -> List[Dict]:
|
||||||
"""Gets all document details for the given docset ID"""
|
"""Gets all document details for the given docset ID"""
|
||||||
@ -229,11 +217,12 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
|
|
||||||
def _metadata_for_project(self, project: Dict) -> Dict:
|
def _metadata_for_project(self, project: Dict) -> Dict:
|
||||||
"""Gets project metadata for all files"""
|
"""Gets project metadata for all files"""
|
||||||
project_id = project.get("id")
|
project_id = project.get(ID_KEY)
|
||||||
|
|
||||||
url = f"{self.api}/projects/{project_id}/artifacts/latest"
|
url = f"{self.api}/projects/{project_id}/artifacts/latest"
|
||||||
all_artifacts = []
|
all_artifacts = []
|
||||||
|
|
||||||
|
per_file_metadata: Dict = {}
|
||||||
while url:
|
while url:
|
||||||
response = requests.request(
|
response = requests.request(
|
||||||
"GET",
|
"GET",
|
||||||
@ -245,22 +234,24 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
data = response.json()
|
data = response.json()
|
||||||
all_artifacts.extend(data["artifacts"])
|
all_artifacts.extend(data["artifacts"])
|
||||||
url = data.get("next", None)
|
url = data.get("next", None)
|
||||||
|
elif response.status_code == 404:
|
||||||
|
# Not found is ok, just means no published projects
|
||||||
|
return per_file_metadata
|
||||||
else:
|
else:
|
||||||
raise Exception(
|
raise Exception(
|
||||||
f"Failed to download {url} (status: {response.status_code})"
|
f"Failed to download {url} (status: {response.status_code})"
|
||||||
)
|
)
|
||||||
|
|
||||||
per_file_metadata = {}
|
|
||||||
for artifact in all_artifacts:
|
for artifact in all_artifacts:
|
||||||
artifact_name = artifact.get("name")
|
artifact_name = artifact.get("name")
|
||||||
artifact_url = artifact.get("url")
|
artifact_url = artifact.get("url")
|
||||||
artifact_doc = artifact.get("document")
|
artifact_doc = artifact.get("document")
|
||||||
|
|
||||||
if artifact_name == "report-values.xml" and artifact_url and artifact_doc:
|
if artifact_name == "report-values.xml" and artifact_url and artifact_doc:
|
||||||
doc_id = artifact_doc["id"]
|
doc_id = artifact_doc[ID_KEY]
|
||||||
metadata: Dict = {}
|
metadata: Dict = {}
|
||||||
|
|
||||||
# the evaluated XML for each document is named after the project
|
# The evaluated XML for each document is named after the project
|
||||||
response = requests.request(
|
response = requests.request(
|
||||||
"GET",
|
"GET",
|
||||||
f"{artifact_url}/content",
|
f"{artifact_url}/content",
|
||||||
@ -285,7 +276,7 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
value = " ".join(
|
value = " ".join(
|
||||||
entry.xpath("./pr:Value", namespaces=ns)[0].itertext()
|
entry.xpath("./pr:Value", namespaces=ns)[0].itertext()
|
||||||
).strip()
|
).strip()
|
||||||
metadata[heading] = value
|
metadata[heading] = value[: self.max_metadata_length]
|
||||||
per_file_metadata[doc_id] = metadata
|
per_file_metadata[doc_id] = metadata
|
||||||
else:
|
else:
|
||||||
raise Exception(
|
raise Exception(
|
||||||
@ -296,10 +287,13 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
return per_file_metadata
|
return per_file_metadata
|
||||||
|
|
||||||
def _load_chunks_for_document(
|
def _load_chunks_for_document(
|
||||||
self, docset_id: str, document: Dict, doc_metadata: Optional[Dict] = None
|
self,
|
||||||
|
document_id: str,
|
||||||
|
docset_id: str,
|
||||||
|
document_name: Optional[str] = None,
|
||||||
|
additional_metadata: Optional[Mapping] = None,
|
||||||
) -> List[Document]:
|
) -> List[Document]:
|
||||||
"""Load chunks for a document."""
|
"""Load chunks for a document."""
|
||||||
document_id = document["id"]
|
|
||||||
url = f"{self.api}/docsets/{docset_id}/documents/{document_id}/dgml"
|
url = f"{self.api}/docsets/{docset_id}/documents/{document_id}/dgml"
|
||||||
|
|
||||||
response = requests.request(
|
response = requests.request(
|
||||||
@ -310,7 +304,11 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
)
|
)
|
||||||
|
|
||||||
if response.ok:
|
if response.ok:
|
||||||
return self._parse_dgml(document, response.content, doc_metadata)
|
return self._parse_dgml(
|
||||||
|
content=response.content,
|
||||||
|
document_name=document_name,
|
||||||
|
additional_doc_metadata=additional_metadata,
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
raise Exception(
|
raise Exception(
|
||||||
f"Failed to download {url} (status: {response.status_code})"
|
f"Failed to download {url} (status: {response.status_code})"
|
||||||
@ -321,37 +319,44 @@ class DocugamiLoader(BaseLoader, BaseModel):
|
|||||||
chunks: List[Document] = []
|
chunks: List[Document] = []
|
||||||
|
|
||||||
if self.access_token and self.docset_id:
|
if self.access_token and self.docset_id:
|
||||||
# remote mode
|
# Remote mode
|
||||||
_document_details = self._document_details_for_docset_id(self.docset_id)
|
_document_details = self._document_details_for_docset_id(self.docset_id)
|
||||||
if self.document_ids:
|
if self.document_ids:
|
||||||
_document_details = [
|
_document_details = [
|
||||||
d for d in _document_details if d["id"] in self.document_ids
|
d for d in _document_details if d[ID_KEY] in self.document_ids
|
||||||
]
|
]
|
||||||
|
|
||||||
_project_details = self._project_details_for_docset_id(self.docset_id)
|
_project_details = self._project_details_for_docset_id(self.docset_id)
|
||||||
combined_project_metadata = {}
|
combined_project_metadata: Dict[str, Dict] = {}
|
||||||
if _project_details:
|
if _project_details and self.include_project_metadata_in_doc_metadata:
|
||||||
# if there are any projects for this docset, load project metadata
|
# If there are any projects for this docset and the caller requested
|
||||||
|
# project metadata, load it.
|
||||||
for project in _project_details:
|
for project in _project_details:
|
||||||
metadata = self._metadata_for_project(project)
|
metadata = self._metadata_for_project(project)
|
||||||
combined_project_metadata.update(metadata)
|
for file_id in metadata:
|
||||||
|
if file_id not in combined_project_metadata:
|
||||||
|
combined_project_metadata[file_id] = metadata[file_id]
|
||||||
|
else:
|
||||||
|
combined_project_metadata[file_id].update(metadata[file_id])
|
||||||
|
|
||||||
for doc in _document_details:
|
for doc in _document_details:
|
||||||
doc_metadata = combined_project_metadata.get(doc["id"])
|
doc_id = doc[ID_KEY]
|
||||||
|
doc_name = doc.get(DOCUMENT_NAME_KEY)
|
||||||
|
doc_metadata = combined_project_metadata.get(doc_id)
|
||||||
chunks += self._load_chunks_for_document(
|
chunks += self._load_chunks_for_document(
|
||||||
self.docset_id, doc, doc_metadata
|
document_id=doc_id,
|
||||||
|
docset_id=self.docset_id,
|
||||||
|
document_name=doc_name,
|
||||||
|
additional_metadata=doc_metadata,
|
||||||
)
|
)
|
||||||
elif self.file_paths:
|
elif self.file_paths:
|
||||||
# local mode (for integration testing, or pre-downloaded XML)
|
# Local mode (for integration testing, or pre-downloaded XML)
|
||||||
for path in self.file_paths:
|
for path in self.file_paths:
|
||||||
path = Path(path)
|
path = Path(path)
|
||||||
with open(path, "rb") as file:
|
with open(path, "rb") as file:
|
||||||
chunks += self._parse_dgml(
|
chunks += self._parse_dgml(
|
||||||
{
|
content=file.read(),
|
||||||
DOCUMENT_ID_KEY: path.name,
|
document_name=path.name,
|
||||||
DOCUMENT_NAME_KEY: path.name,
|
|
||||||
},
|
|
||||||
file.read(),
|
|
||||||
)
|
)
|
||||||
|
|
||||||
return chunks
|
return chunks
|
||||||
|
@ -1,3 +1,4 @@
|
|||||||
|
from enum import Enum
|
||||||
from typing import List
|
from typing import List
|
||||||
|
|
||||||
from langchain_core.documents import Document
|
from langchain_core.documents import Document
|
||||||
@ -9,6 +10,15 @@ from langchain_core.vectorstores import VectorStore
|
|||||||
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
|
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
|
||||||
|
|
||||||
|
|
||||||
|
class SearchType(str, Enum):
|
||||||
|
"""Enumerator of the types of search to perform."""
|
||||||
|
|
||||||
|
similarity = "similarity"
|
||||||
|
"""Similarity search."""
|
||||||
|
mmr = "mmr"
|
||||||
|
"""Maximal Marginal Relevance reranking of similarity search."""
|
||||||
|
|
||||||
|
|
||||||
class MultiVectorRetriever(BaseRetriever):
|
class MultiVectorRetriever(BaseRetriever):
|
||||||
"""Retrieve from a set of multiple embeddings for the same document."""
|
"""Retrieve from a set of multiple embeddings for the same document."""
|
||||||
|
|
||||||
@ -20,6 +30,8 @@ class MultiVectorRetriever(BaseRetriever):
|
|||||||
id_key: str = "doc_id"
|
id_key: str = "doc_id"
|
||||||
search_kwargs: dict = Field(default_factory=dict)
|
search_kwargs: dict = Field(default_factory=dict)
|
||||||
"""Keyword arguments to pass to the search function."""
|
"""Keyword arguments to pass to the search function."""
|
||||||
|
search_type: SearchType = SearchType.similarity
|
||||||
|
"""Type of search to perform (similarity / mmr)"""
|
||||||
|
|
||||||
def _get_relevant_documents(
|
def _get_relevant_documents(
|
||||||
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
||||||
@ -31,7 +43,13 @@ class MultiVectorRetriever(BaseRetriever):
|
|||||||
Returns:
|
Returns:
|
||||||
List of relevant documents
|
List of relevant documents
|
||||||
"""
|
"""
|
||||||
|
if self.search_type == SearchType.mmr:
|
||||||
|
sub_docs = self.vectorstore.max_marginal_relevance_search(
|
||||||
|
query, **self.search_kwargs
|
||||||
|
)
|
||||||
|
else:
|
||||||
sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
|
sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
|
||||||
|
|
||||||
# We do this to maintain the order of the ids that are returned
|
# We do this to maintain the order of the ids that are returned
|
||||||
ids = []
|
ids = []
|
||||||
for d in sub_docs:
|
for d in sub_docs:
|
||||||
|
44
libs/langchain/poetry.lock
generated
44
libs/langchain/poetry.lock
generated
@ -1973,6 +1973,21 @@ files = [
|
|||||||
[package.dependencies]
|
[package.dependencies]
|
||||||
packaging = "*"
|
packaging = "*"
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "dgml-utils"
|
||||||
|
version = "0.3.0"
|
||||||
|
description = "Python utilities to work with the Docugami Markup Language (DGML) format."
|
||||||
|
optional = true
|
||||||
|
python-versions = ">=3.8.1,<4.0"
|
||||||
|
files = [
|
||||||
|
{file = "dgml_utils-0.3.0-py3-none-any.whl", hash = "sha256:0cb8f6fd7f5fa31919343266260c166aa53009b42a11a172e808fc707e1ac5ba"},
|
||||||
|
{file = "dgml_utils-0.3.0.tar.gz", hash = "sha256:02722e899122caedfb1e90d0be557c7e6dddf86f7f4c19d7888212efde9f78c9"},
|
||||||
|
]
|
||||||
|
|
||||||
|
[package.dependencies]
|
||||||
|
lxml = ">=4.9.3,<5.0.0"
|
||||||
|
tabulate = ">=0.9.0,<0.10.0"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "dill"
|
name = "dill"
|
||||||
version = "0.3.7"
|
version = "0.3.7"
|
||||||
@ -2952,7 +2967,7 @@ files = [
|
|||||||
{file = "greenlet-3.0.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:0b72b802496cccbd9b31acea72b6f87e7771ccfd7f7927437d592e5c92ed703c"},
|
{file = "greenlet-3.0.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:0b72b802496cccbd9b31acea72b6f87e7771ccfd7f7927437d592e5c92ed703c"},
|
||||||
{file = "greenlet-3.0.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:527cd90ba3d8d7ae7dceb06fda619895768a46a1b4e423bdb24c1969823b8362"},
|
{file = "greenlet-3.0.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:527cd90ba3d8d7ae7dceb06fda619895768a46a1b4e423bdb24c1969823b8362"},
|
||||||
{file = "greenlet-3.0.0-cp311-cp311-win_amd64.whl", hash = "sha256:37f60b3a42d8b5499be910d1267b24355c495064f271cfe74bf28b17b099133c"},
|
{file = "greenlet-3.0.0-cp311-cp311-win_amd64.whl", hash = "sha256:37f60b3a42d8b5499be910d1267b24355c495064f271cfe74bf28b17b099133c"},
|
||||||
{file = "greenlet-3.0.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:1482fba7fbed96ea7842b5a7fc11d61727e8be75a077e603e8ab49d24e234383"},
|
{file = "greenlet-3.0.0-cp311-universal2-macosx_10_9_universal2.whl", hash = "sha256:c3692ecf3fe754c8c0f2c95ff19626584459eab110eaab66413b1e7425cd84e9"},
|
||||||
{file = "greenlet-3.0.0-cp312-cp312-macosx_13_0_arm64.whl", hash = "sha256:be557119bf467d37a8099d91fbf11b2de5eb1fd5fc5b91598407574848dc910f"},
|
{file = "greenlet-3.0.0-cp312-cp312-macosx_13_0_arm64.whl", hash = "sha256:be557119bf467d37a8099d91fbf11b2de5eb1fd5fc5b91598407574848dc910f"},
|
||||||
{file = "greenlet-3.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:73b2f1922a39d5d59cc0e597987300df3396b148a9bd10b76a058a2f2772fc04"},
|
{file = "greenlet-3.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:73b2f1922a39d5d59cc0e597987300df3396b148a9bd10b76a058a2f2772fc04"},
|
||||||
{file = "greenlet-3.0.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d1e22c22f7826096ad503e9bb681b05b8c1f5a8138469b255eb91f26a76634f2"},
|
{file = "greenlet-3.0.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d1e22c22f7826096ad503e9bb681b05b8c1f5a8138469b255eb91f26a76634f2"},
|
||||||
@ -2962,6 +2977,7 @@ files = [
|
|||||||
{file = "greenlet-3.0.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:952256c2bc5b4ee8df8dfc54fc4de330970bf5d79253c863fb5e6761f00dda35"},
|
{file = "greenlet-3.0.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:952256c2bc5b4ee8df8dfc54fc4de330970bf5d79253c863fb5e6761f00dda35"},
|
||||||
{file = "greenlet-3.0.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:269d06fa0f9624455ce08ae0179430eea61085e3cf6457f05982b37fd2cefe17"},
|
{file = "greenlet-3.0.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:269d06fa0f9624455ce08ae0179430eea61085e3cf6457f05982b37fd2cefe17"},
|
||||||
{file = "greenlet-3.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:9adbd8ecf097e34ada8efde9b6fec4dd2a903b1e98037adf72d12993a1c80b51"},
|
{file = "greenlet-3.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:9adbd8ecf097e34ada8efde9b6fec4dd2a903b1e98037adf72d12993a1c80b51"},
|
||||||
|
{file = "greenlet-3.0.0-cp312-universal2-macosx_10_9_universal2.whl", hash = "sha256:553d6fb2324e7f4f0899e5ad2c427a4579ed4873f42124beba763f16032959af"},
|
||||||
{file = "greenlet-3.0.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c6b5ce7f40f0e2f8b88c28e6691ca6806814157ff05e794cdd161be928550f4c"},
|
{file = "greenlet-3.0.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c6b5ce7f40f0e2f8b88c28e6691ca6806814157ff05e794cdd161be928550f4c"},
|
||||||
{file = "greenlet-3.0.0-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:ecf94aa539e97a8411b5ea52fc6ccd8371be9550c4041011a091eb8b3ca1d810"},
|
{file = "greenlet-3.0.0-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:ecf94aa539e97a8411b5ea52fc6ccd8371be9550c4041011a091eb8b3ca1d810"},
|
||||||
{file = "greenlet-3.0.0-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:80dcd3c938cbcac986c5c92779db8e8ce51a89a849c135172c88ecbdc8c056b7"},
|
{file = "greenlet-3.0.0-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:80dcd3c938cbcac986c5c92779db8e8ce51a89a849c135172c88ecbdc8c056b7"},
|
||||||
@ -4651,16 +4667,6 @@ files = [
|
|||||||
{file = "MarkupSafe-2.1.3-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:5bbe06f8eeafd38e5d0a4894ffec89378b6c6a625ff57e3028921f8ff59318ac"},
|
{file = "MarkupSafe-2.1.3-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:5bbe06f8eeafd38e5d0a4894ffec89378b6c6a625ff57e3028921f8ff59318ac"},
|
||||||
{file = "MarkupSafe-2.1.3-cp311-cp311-win32.whl", hash = "sha256:dd15ff04ffd7e05ffcb7fe79f1b98041b8ea30ae9234aed2a9168b5797c3effb"},
|
{file = "MarkupSafe-2.1.3-cp311-cp311-win32.whl", hash = "sha256:dd15ff04ffd7e05ffcb7fe79f1b98041b8ea30ae9234aed2a9168b5797c3effb"},
|
||||||
{file = "MarkupSafe-2.1.3-cp311-cp311-win_amd64.whl", hash = "sha256:134da1eca9ec0ae528110ccc9e48041e0828d79f24121a1a146161103c76e686"},
|
{file = "MarkupSafe-2.1.3-cp311-cp311-win_amd64.whl", hash = "sha256:134da1eca9ec0ae528110ccc9e48041e0828d79f24121a1a146161103c76e686"},
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:f698de3fd0c4e6972b92290a45bd9b1536bffe8c6759c62471efaa8acb4c37bc"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:aa57bd9cf8ae831a362185ee444e15a93ecb2e344c8e52e4d721ea3ab6ef1823"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ffcc3f7c66b5f5b7931a5aa68fc9cecc51e685ef90282f4a82f0f5e9b704ad11"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:47d4f1c5f80fc62fdd7777d0d40a2e9dda0a05883ab11374334f6c4de38adffd"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1f67c7038d560d92149c060157d623c542173016c4babc0c1913cca0564b9939"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:9aad3c1755095ce347e26488214ef77e0485a3c34a50c5a5e2471dff60b9dd9c"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:14ff806850827afd6b07a5f32bd917fb7f45b046ba40c57abdb636674a8b559c"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8f9293864fe09b8149f0cc42ce56e3f0e54de883a9de90cd427f191c346eb2e1"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-win32.whl", hash = "sha256:715d3562f79d540f251b99ebd6d8baa547118974341db04f5ad06d5ea3eb8007"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp312-cp312-win_amd64.whl", hash = "sha256:1b8dd8c3fd14349433c79fa8abeb573a55fc0fdd769133baac1f5e07abf54aeb"},
|
|
||||||
{file = "MarkupSafe-2.1.3-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:8e254ae696c88d98da6555f5ace2279cf7cd5b3f52be2b5cf97feafe883b58d2"},
|
{file = "MarkupSafe-2.1.3-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:8e254ae696c88d98da6555f5ace2279cf7cd5b3f52be2b5cf97feafe883b58d2"},
|
||||||
{file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cb0932dc158471523c9637e807d9bfb93e06a95cbf010f1a38b98623b929ef2b"},
|
{file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cb0932dc158471523c9637e807d9bfb93e06a95cbf010f1a38b98623b929ef2b"},
|
||||||
{file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9402b03f1a1b4dc4c19845e5c749e3ab82d5078d16a2a4c2cd2df62d57bb0707"},
|
{file = "MarkupSafe-2.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9402b03f1a1b4dc4c19845e5c749e3ab82d5078d16a2a4c2cd2df62d57bb0707"},
|
||||||
@ -7797,7 +7803,6 @@ files = [
|
|||||||
{file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:69b023b2b4daa7548bcfbd4aa3da05b3a74b772db9e23b982788168117739938"},
|
{file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:69b023b2b4daa7548bcfbd4aa3da05b3a74b772db9e23b982788168117739938"},
|
||||||
{file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:81e0b275a9ecc9c0c0c07b4b90ba548307583c125f54d5b6946cfee6360c733d"},
|
{file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:81e0b275a9ecc9c0c0c07b4b90ba548307583c125f54d5b6946cfee6360c733d"},
|
||||||
{file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ba336e390cd8e4d1739f42dfe9bb83a3cc2e80f567d8805e11b46f4a943f5515"},
|
{file = "PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ba336e390cd8e4d1739f42dfe9bb83a3cc2e80f567d8805e11b46f4a943f5515"},
|
||||||
{file = "PyYAML-6.0.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:326c013efe8048858a6d312ddd31d56e468118ad4cdeda36c719bf5bb6192290"},
|
|
||||||
{file = "PyYAML-6.0.1-cp310-cp310-win32.whl", hash = "sha256:bd4af7373a854424dabd882decdc5579653d7868b8fb26dc7d0e99f823aa5924"},
|
{file = "PyYAML-6.0.1-cp310-cp310-win32.whl", hash = "sha256:bd4af7373a854424dabd882decdc5579653d7868b8fb26dc7d0e99f823aa5924"},
|
||||||
{file = "PyYAML-6.0.1-cp310-cp310-win_amd64.whl", hash = "sha256:fd1592b3fdf65fff2ad0004b5e363300ef59ced41c2e6b3a99d4089fa8c5435d"},
|
{file = "PyYAML-6.0.1-cp310-cp310-win_amd64.whl", hash = "sha256:fd1592b3fdf65fff2ad0004b5e363300ef59ced41c2e6b3a99d4089fa8c5435d"},
|
||||||
{file = "PyYAML-6.0.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:6965a7bc3cf88e5a1c3bd2e0b5c22f8d677dc88a455344035f03399034eb3007"},
|
{file = "PyYAML-6.0.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:6965a7bc3cf88e5a1c3bd2e0b5c22f8d677dc88a455344035f03399034eb3007"},
|
||||||
@ -7805,15 +7810,8 @@ files = [
|
|||||||
{file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:42f8152b8dbc4fe7d96729ec2b99c7097d656dc1213a3229ca5383f973a5ed6d"},
|
{file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:42f8152b8dbc4fe7d96729ec2b99c7097d656dc1213a3229ca5383f973a5ed6d"},
|
||||||
{file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:062582fca9fabdd2c8b54a3ef1c978d786e0f6b3a1510e0ac93ef59e0ddae2bc"},
|
{file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:062582fca9fabdd2c8b54a3ef1c978d786e0f6b3a1510e0ac93ef59e0ddae2bc"},
|
||||||
{file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d2b04aac4d386b172d5b9692e2d2da8de7bfb6c387fa4f801fbf6fb2e6ba4673"},
|
{file = "PyYAML-6.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d2b04aac4d386b172d5b9692e2d2da8de7bfb6c387fa4f801fbf6fb2e6ba4673"},
|
||||||
{file = "PyYAML-6.0.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:e7d73685e87afe9f3b36c799222440d6cf362062f78be1013661b00c5c6f678b"},
|
|
||||||
{file = "PyYAML-6.0.1-cp311-cp311-win32.whl", hash = "sha256:1635fd110e8d85d55237ab316b5b011de701ea0f29d07611174a1b42f1444741"},
|
{file = "PyYAML-6.0.1-cp311-cp311-win32.whl", hash = "sha256:1635fd110e8d85d55237ab316b5b011de701ea0f29d07611174a1b42f1444741"},
|
||||||
{file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"},
|
{file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"},
|
||||||
{file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"},
|
|
||||||
{file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"},
|
|
||||||
{file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c22bec3fbe2524cde73d7ada88f6566758a8f7227bfbf93a408a9d86bcc12a0"},
|
|
||||||
{file = "PyYAML-6.0.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8d4e9c88387b0f5c7d5f281e55304de64cf7f9c0021a3525bd3b1c542da3b0e4"},
|
|
||||||
{file = "PyYAML-6.0.1-cp312-cp312-win32.whl", hash = "sha256:d483d2cdf104e7c9fa60c544d92981f12ad66a457afae824d146093b8c294c54"},
|
|
||||||
{file = "PyYAML-6.0.1-cp312-cp312-win_amd64.whl", hash = "sha256:0d3304d8c0adc42be59c5f8a4d9e3d7379e6955ad754aa9d6ab7a398b59dd1df"},
|
|
||||||
{file = "PyYAML-6.0.1-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:50550eb667afee136e9a77d6dc71ae76a44df8b3e51e41b77f6de2932bfe0f47"},
|
{file = "PyYAML-6.0.1-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:50550eb667afee136e9a77d6dc71ae76a44df8b3e51e41b77f6de2932bfe0f47"},
|
||||||
{file = "PyYAML-6.0.1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1fe35611261b29bd1de0070f0b2f47cb6ff71fa6595c077e42bd0c419fa27b98"},
|
{file = "PyYAML-6.0.1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1fe35611261b29bd1de0070f0b2f47cb6ff71fa6595c077e42bd0c419fa27b98"},
|
||||||
{file = "PyYAML-6.0.1-cp36-cp36m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:704219a11b772aea0d8ecd7058d0082713c3562b4e271b849ad7dc4a5c90c13c"},
|
{file = "PyYAML-6.0.1-cp36-cp36m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:704219a11b772aea0d8ecd7058d0082713c3562b4e271b849ad7dc4a5c90c13c"},
|
||||||
@ -7830,7 +7828,6 @@ files = [
|
|||||||
{file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a0cd17c15d3bb3fa06978b4e8958dcdc6e0174ccea823003a106c7d4d7899ac5"},
|
{file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a0cd17c15d3bb3fa06978b4e8958dcdc6e0174ccea823003a106c7d4d7899ac5"},
|
||||||
{file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:28c119d996beec18c05208a8bd78cbe4007878c6dd15091efb73a30e90539696"},
|
{file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:28c119d996beec18c05208a8bd78cbe4007878c6dd15091efb73a30e90539696"},
|
||||||
{file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7e07cbde391ba96ab58e532ff4803f79c4129397514e1413a7dc761ccd755735"},
|
{file = "PyYAML-6.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7e07cbde391ba96ab58e532ff4803f79c4129397514e1413a7dc761ccd755735"},
|
||||||
{file = "PyYAML-6.0.1-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:49a183be227561de579b4a36efbb21b3eab9651dd81b1858589f796549873dd6"},
|
|
||||||
{file = "PyYAML-6.0.1-cp38-cp38-win32.whl", hash = "sha256:184c5108a2aca3c5b3d3bf9395d50893a7ab82a38004c8f61c258d4428e80206"},
|
{file = "PyYAML-6.0.1-cp38-cp38-win32.whl", hash = "sha256:184c5108a2aca3c5b3d3bf9395d50893a7ab82a38004c8f61c258d4428e80206"},
|
||||||
{file = "PyYAML-6.0.1-cp38-cp38-win_amd64.whl", hash = "sha256:1e2722cc9fbb45d9b87631ac70924c11d3a401b2d7f410cc0e3bbf249f2dca62"},
|
{file = "PyYAML-6.0.1-cp38-cp38-win_amd64.whl", hash = "sha256:1e2722cc9fbb45d9b87631ac70924c11d3a401b2d7f410cc0e3bbf249f2dca62"},
|
||||||
{file = "PyYAML-6.0.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:9eb6caa9a297fc2c2fb8862bc5370d0303ddba53ba97e71f08023b6cd73d16a8"},
|
{file = "PyYAML-6.0.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:9eb6caa9a297fc2c2fb8862bc5370d0303ddba53ba97e71f08023b6cd73d16a8"},
|
||||||
@ -7838,7 +7835,6 @@ files = [
|
|||||||
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5773183b6446b2c99bb77e77595dd486303b4faab2b086e7b17bc6bef28865f6"},
|
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5773183b6446b2c99bb77e77595dd486303b4faab2b086e7b17bc6bef28865f6"},
|
||||||
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:b786eecbdf8499b9ca1d697215862083bd6d2a99965554781d0d8d1ad31e13a0"},
|
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:b786eecbdf8499b9ca1d697215862083bd6d2a99965554781d0d8d1ad31e13a0"},
|
||||||
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bc1bf2925a1ecd43da378f4db9e4f799775d6367bdb94671027b73b393a7c42c"},
|
{file = "PyYAML-6.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bc1bf2925a1ecd43da378f4db9e4f799775d6367bdb94671027b73b393a7c42c"},
|
||||||
{file = "PyYAML-6.0.1-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:04ac92ad1925b2cff1db0cfebffb6ffc43457495c9b3c39d3fcae417d7125dc5"},
|
|
||||||
{file = "PyYAML-6.0.1-cp39-cp39-win32.whl", hash = "sha256:faca3bdcf85b2fc05d06ff3fbc1f83e1391b3e724afa3feba7d13eeab355484c"},
|
{file = "PyYAML-6.0.1-cp39-cp39-win32.whl", hash = "sha256:faca3bdcf85b2fc05d06ff3fbc1f83e1391b3e724afa3feba7d13eeab355484c"},
|
||||||
{file = "PyYAML-6.0.1-cp39-cp39-win_amd64.whl", hash = "sha256:510c9deebc5c0225e8c96813043e62b680ba2f9c50a08d3724c7f28a747d1486"},
|
{file = "PyYAML-6.0.1-cp39-cp39-win_amd64.whl", hash = "sha256:510c9deebc5c0225e8c96813043e62b680ba2f9c50a08d3724c7f28a747d1486"},
|
||||||
{file = "PyYAML-6.0.1.tar.gz", hash = "sha256:bfdf460b1736c775f2ba9f6a92bca30bc2095067b8a9d77876d1fad6cc3b4a43"},
|
{file = "PyYAML-6.0.1.tar.gz", hash = "sha256:bfdf460b1736c775f2ba9f6a92bca30bc2095067b8a9d77876d1fad6cc3b4a43"},
|
||||||
@ -11179,14 +11175,14 @@ cffi = {version = ">=1.11", markers = "platform_python_implementation == \"PyPy\
|
|||||||
cffi = ["cffi (>=1.11)"]
|
cffi = ["cffi (>=1.11)"]
|
||||||
|
|
||||||
[extras]
|
[extras]
|
||||||
all = ["O365", "aleph-alpha-client", "amadeus", "arxiv", "atlassian-python-api", "awadb", "azure-ai-formrecognizer", "azure-ai-textanalytics", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-cosmos", "azure-identity", "beautifulsoup4", "clarifai", "clickhouse-connect", "cohere", "deeplake", "docarray", "duckduckgo-search", "elasticsearch", "esprima", "faiss-cpu", "google-api-python-client", "google-auth", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jinja2", "jq", "lancedb", "langkit", "lark", "librosa", "lxml", "manifest-ml", "marqo", "momento", "nebula3-python", "neo4j", "networkx", "nlpcloud", "nltk", "nomic", "openai", "openlm", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pymongo", "pyowm", "pypdf", "pytesseract", "python-arango", "pyvespa", "qdrant-client", "rdflib", "redis", "requests-toolbelt", "sentence-transformers", "singlestoredb", "tensorflow-text", "tigrisdb", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
|
all = ["O365", "aleph-alpha-client", "amadeus", "arxiv", "atlassian-python-api", "awadb", "azure-ai-formrecognizer", "azure-ai-textanalytics", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-cosmos", "azure-identity", "beautifulsoup4", "clarifai", "clickhouse-connect", "cohere", "deeplake", "dgml-utils", "docarray", "duckduckgo-search", "elasticsearch", "esprima", "faiss-cpu", "google-api-python-client", "google-auth", "google-search-results", "gptcache", "html2text", "huggingface_hub", "jinja2", "jq", "lancedb", "langkit", "lark", "librosa", "lxml", "manifest-ml", "marqo", "momento", "nebula3-python", "neo4j", "networkx", "nlpcloud", "nltk", "nomic", "openai", "openlm", "opensearch-py", "pdfminer-six", "pexpect", "pgvector", "pinecone-client", "pinecone-text", "psycopg2-binary", "pymongo", "pyowm", "pypdf", "pytesseract", "python-arango", "pyvespa", "qdrant-client", "rdflib", "redis", "requests-toolbelt", "sentence-transformers", "singlestoredb", "tensorflow-text", "tigrisdb", "tiktoken", "torch", "transformers", "weaviate-client", "wikipedia", "wolframalpha"]
|
||||||
azure = ["azure-ai-formrecognizer", "azure-ai-textanalytics", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-core", "azure-cosmos", "azure-identity", "azure-search-documents", "openai"]
|
azure = ["azure-ai-formrecognizer", "azure-ai-textanalytics", "azure-ai-vision", "azure-cognitiveservices-speech", "azure-core", "azure-cosmos", "azure-identity", "azure-search-documents", "openai"]
|
||||||
clarifai = ["clarifai"]
|
clarifai = ["clarifai"]
|
||||||
cli = ["typer"]
|
cli = ["typer"]
|
||||||
cohere = ["cohere"]
|
cohere = ["cohere"]
|
||||||
docarray = ["docarray"]
|
docarray = ["docarray"]
|
||||||
embeddings = ["sentence-transformers"]
|
embeddings = ["sentence-transformers"]
|
||||||
extended-testing = ["aiosqlite", "aleph-alpha-client", "anthropic", "arxiv", "assemblyai", "atlassian-python-api", "beautifulsoup4", "bibtexparser", "cassio", "chardet", "dashvector", "databricks-vectorsearch", "esprima", "faiss-cpu", "feedparser", "fireworks-ai", "geopandas", "gitpython", "google-cloud-documentai", "gql", "html2text", "javelin-sdk", "jinja2", "jq", "jsonschema", "lxml", "markdownify", "motor", "msal", "mwparserfromhell", "mwxml", "newspaper3k", "numexpr", "openai", "openai", "openapi-pydantic", "pandas", "pdfminer-six", "pgvector", "psychicapi", "py-trello", "pymupdf", "pypdf", "pypdfium2", "pyspark", "rank-bm25", "rapidfuzz", "rapidocr-onnxruntime", "requests-toolbelt", "rspace_client", "scikit-learn", "sqlite-vss", "streamlit", "sympy", "telethon", "timescale-vector", "tqdm", "upstash-redis", "xata", "xmltodict"]
|
extended-testing = ["aiosqlite", "aleph-alpha-client", "anthropic", "arxiv", "assemblyai", "atlassian-python-api", "beautifulsoup4", "bibtexparser", "cassio", "chardet", "dashvector", "databricks-vectorsearch", "dgml-utils", "esprima", "faiss-cpu", "feedparser", "fireworks-ai", "geopandas", "gitpython", "google-cloud-documentai", "gql", "html2text", "javelin-sdk", "jinja2", "jq", "jsonschema", "lxml", "markdownify", "motor", "msal", "mwparserfromhell", "mwxml", "newspaper3k", "numexpr", "openai", "openai", "openapi-pydantic", "pandas", "pdfminer-six", "pgvector", "psychicapi", "py-trello", "pymupdf", "pypdf", "pypdfium2", "pyspark", "rank-bm25", "rapidfuzz", "rapidocr-onnxruntime", "requests-toolbelt", "rspace_client", "scikit-learn", "sqlite-vss", "streamlit", "sympy", "telethon", "timescale-vector", "tqdm", "upstash-redis", "xata", "xmltodict"]
|
||||||
javascript = ["esprima"]
|
javascript = ["esprima"]
|
||||||
llms = ["clarifai", "cohere", "huggingface_hub", "manifest-ml", "nlpcloud", "openai", "openlm", "torch", "transformers"]
|
llms = ["clarifai", "cohere", "huggingface_hub", "manifest-ml", "nlpcloud", "openai", "openlm", "torch", "transformers"]
|
||||||
openai = ["openai", "tiktoken"]
|
openai = ["openai", "tiktoken"]
|
||||||
@ -11196,4 +11192,4 @@ text-helpers = ["chardet"]
|
|||||||
[metadata]
|
[metadata]
|
||||||
lock-version = "2.0"
|
lock-version = "2.0"
|
||||||
python-versions = ">=3.8.1,<4.0"
|
python-versions = ">=3.8.1,<4.0"
|
||||||
content-hash = "943da392f7b9f8d3677e879ef971eb50c068e0b5658e6e01f3b2589e82fa3b71"
|
content-hash = "ef4b14aed39d823f33de6bda543aadf208c7adedf75bf9db28a682fcc46ea792"
|
||||||
|
@ -145,7 +145,7 @@ fireworks-ai = {version = "^0.6.0", optional = true, python = ">=3.9,<4.0"}
|
|||||||
javelin-sdk = {version = "^0.1.8", optional = true}
|
javelin-sdk = {version = "^0.1.8", optional = true}
|
||||||
msal = {version = "^1.25.0", optional = true}
|
msal = {version = "^1.25.0", optional = true}
|
||||||
databricks-vectorsearch = {version = "^0.21", optional = true}
|
databricks-vectorsearch = {version = "^0.21", optional = true}
|
||||||
|
dgml-utils = {version = "^0.3.0", optional = true}
|
||||||
|
|
||||||
[tool.poetry.group.test.dependencies]
|
[tool.poetry.group.test.dependencies]
|
||||||
# The only dependencies that should be added are
|
# The only dependencies that should be added are
|
||||||
@ -167,7 +167,6 @@ syrupy = "^4.0.2"
|
|||||||
requests-mock = "^1.11.0"
|
requests-mock = "^1.11.0"
|
||||||
langchain-core = {path = "../core", develop = true}
|
langchain-core = {path = "../core", develop = true}
|
||||||
|
|
||||||
|
|
||||||
[tool.poetry.group.codespell.dependencies]
|
[tool.poetry.group.codespell.dependencies]
|
||||||
codespell = "^2.2.0"
|
codespell = "^2.2.0"
|
||||||
|
|
||||||
@ -314,6 +313,7 @@ all = [
|
|||||||
"amadeus",
|
"amadeus",
|
||||||
"librosa",
|
"librosa",
|
||||||
"python-arango",
|
"python-arango",
|
||||||
|
"dgml-utils",
|
||||||
]
|
]
|
||||||
|
|
||||||
cli = [
|
cli = [
|
||||||
@ -384,6 +384,7 @@ extended_testing = [
|
|||||||
"fireworks-ai",
|
"fireworks-ai",
|
||||||
"javelin-sdk",
|
"javelin-sdk",
|
||||||
"databricks-vectorsearch",
|
"databricks-vectorsearch",
|
||||||
|
"dgml-utils",
|
||||||
]
|
]
|
||||||
|
|
||||||
[tool.ruff]
|
[tool.ruff]
|
||||||
|
@ -1,336 +1,379 @@
|
|||||||
<?xml version="1.0" encoding="utf-8"?>
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
<docset:MUTUALNON-DISCLOSUREAGREEMENT-section xmlns:docset="http://www.docugami.com/2021/dgml/PublishTest/NDA" xmlns:addedChunks="http://www.docugami.com/2021/dgml/PublishTest/NDA/addedChunks" xmlns:dg="http://www.docugami.com/2021/dgml" xmlns:dgc="http://www.docugami.com/2021/dgml/docugami/contracts" xmlns:dgm="http://www.docugami.com/2021/dgml/docugami/medical" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml">
|
<dg:chunk cp:version="2.10.10.0.1699162341377-69.0"
|
||||||
<docset:MutualNon-disclosure structure="h1">
|
xmlns:docset="http://www.docugami.com/2021/dgml/TaqiTest20231103/NDA"
|
||||||
<dg:chunk>MUTUAL NON-DISCLOSURE AGREEMENT </dg:chunk>
|
xmlns:addedChunks="http://www.docugami.com/2021/dgml/TaqiTest20231103/NDA/addedChunks"
|
||||||
</docset:MutualNon-disclosure>
|
xmlns:dg="http://www.docugami.com/2021/dgml"
|
||||||
<docset:MUTUALNON-DISCLOSUREAGREEMENT structure="div">
|
xmlns:dgc="http://www.docugami.com/2021/dgml/docugami/contracts"
|
||||||
<docset:ThisMutualNon-disclosureAgreement>
|
xmlns:dgm="http://www.docugami.com/2021/dgml/docugami/medical"
|
||||||
<docset:Preamble structure="p">
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml"
|
||||||
This
|
xmlns:cp="http://classifyprocess.com/2018/07/">
|
||||||
<dg:chunk>Mutual Non-Disclosure Agreement </dg:chunk>(this “
|
<docset:MUTUALNON-DISCLOSUREAGREEMENT-section>
|
||||||
<dg:chunk>Agreement</dg:chunk>”) is entered into and made effective as of
|
<dg:chunk structure="h1">NON-DISCLOSURE AGREEMENT </dg:chunk>
|
||||||
<docset:EffectiveDate xsi:type="date">2/4/2018 </docset:EffectiveDate>between
|
<docset:MUTUALNON-DISCLOSUREAGREEMENT structure="div"> This Non-Disclosure Agreement
|
||||||
<dgc:Org>Docugami Inc.</dgc:Org>, a
|
("Agreement") is entered into as of <docset:EffectiveDate>November 4, 2023 </docset:EffectiveDate>("Effective
|
||||||
<dgc:USState>Delaware </dgc:USState>corporation, whose address is 150 Lake Street South, Suite 221, Kirkland,
|
Date"), by and between: </docset:MUTUALNON-DISCLOSUREAGREEMENT>
|
||||||
<dgc:USState>Washington </dgc:USState>
|
</docset:MUTUALNON-DISCLOSUREAGREEMENT-section>
|
||||||
<dg:chunk>Delaware corporation</dg:chunk>, whose address is
|
<docset:DisclosingParty-section>
|
||||||
<docset:CompanyAddress>
|
<dg:chunk structure="h1">
|
||||||
<dgc:Street>
|
|
||||||
|
Disclosing Party: </dg:chunk>
|
||||||
|
<docset:DisclosingParty structure="div"><docset:PrincipalPlaceofBusiness>Widget Corp.</docset:PrincipalPlaceofBusiness>,
|
||||||
|
a <dgc:USState>Delaware </dgc:USState>corporation with its principal place of business
|
||||||
|
at <docset:PrincipalPlaceofBusiness><docset:PrincipalPlaceofBusiness>
|
||||||
|
<docset:WidgetCorpAddress>123 </docset:WidgetCorpAddress>
|
||||||
|
<docset:PrincipalPlaceofBusiness>Innovation Drive</docset:PrincipalPlaceofBusiness>
|
||||||
|
</docset:PrincipalPlaceofBusiness>
|
||||||
|
, <docset:PrincipalPlaceofBusiness>Techville</docset:PrincipalPlaceofBusiness>, <dgc:USState>
|
||||||
|
Delaware</dgc:USState>, <docset:PrincipalPlaceofBusiness>12345 </docset:PrincipalPlaceofBusiness></docset:PrincipalPlaceofBusiness>
|
||||||
|
("<dgc:Org>
|
||||||
|
<docset:CompanyName>Widget </docset:CompanyName>
|
||||||
|
<docset:CorporateName>Corp.</docset:CorporateName>
|
||||||
|
</dgc:Org>") </docset:DisclosingParty>
|
||||||
|
</docset:DisclosingParty-section>
|
||||||
<dg:chunk>
|
<dg:chunk>
|
||||||
<dgc:Number>150 </dgc:Number>
|
<docset:ReceivingParty-section>
|
||||||
<dgc:StreetName>Lake Street South</dgc:StreetName>
|
<dg:chunk structure="h1">
|
||||||
</dg:chunk>,
|
|
||||||
<dgc:Apt>
|
Receiving Party: </dg:chunk>
|
||||||
<dg:chunk>Suite </dg:chunk>
|
<docset:ReceivingParty structure="div">
|
||||||
<docset:Suite>221</docset:Suite>
|
<dg:chunk structure="p"><docset:RecipientName>Jane Doe</docset:RecipientName>, an
|
||||||
</dgc:Apt>
|
individual residing at <docset:RecipientAddress><docset:RecipientAddress>
|
||||||
</dgc:Street>,
|
<docset:RecipientAddress>456 </docset:RecipientAddress>
|
||||||
<dgc:City>Kirkland</dgc:City>,
|
<docset:RecipientAddress>Privacy Lane</docset:RecipientAddress>
|
||||||
<dgc:State>Washington </dgc:State>
|
</docset:RecipientAddress>
|
||||||
<dgc:Number>98033</dgc:Number>
|
, <docset:RecipientAddress>Safetown</docset:RecipientAddress>, <dgc:USState>
|
||||||
</docset:CompanyAddress>, and
|
California</dgc:USState>, <docset:RecipientAddress>67890 </docset:RecipientAddress></docset:RecipientAddress>
|
||||||
<docset:Signatory>Leonarda Hosler</docset:Signatory>, an individual, whose address is
|
("Recipient") </dg:chunk>
|
||||||
<dgc:Address>
|
|
||||||
<dgc:Street>
|
|
||||||
<dgc:Number>374 </dgc:Number>
|
|
||||||
<dgc:StreetName>William S Canning Blvd</dgc:StreetName>
|
|
||||||
</dgc:Street>,
|
|
||||||
<dg:chunk>
|
<dg:chunk>
|
||||||
<dgc:City>Fall River </dgc:City>
|
<dg:chunk structure="p">
|
||||||
<dgc:State>MA </dgc:State>
|
|
||||||
</dg:chunk>
|
|
||||||
</dgc:Address>
|
|
||||||
<dgc:Number>2721</dgc:Number>
|
|
||||||
<docset:SignatoryAddress>374 William S Canning Blvd, Fall River MA 2721</docset:SignatoryAddress>.
|
|
||||||
</docset:Preamble>
|
|
||||||
<docset:Discussions structure="p">
|
|
||||||
|
|
||||||
The above named parties desire to engage in discussions regarding a potential agreement or other transaction between the parties (the “Purpose”). In connection with such discussions, it may be necessary for the parties to disclose to each other certain confidential information or materials to enable them to evaluate whether to enter into such agreement or transaction. </docset:Discussions>
|
(collectively referred to as the "Parties"). </dg:chunk>
|
||||||
<docset:Consideration>
|
|
||||||
<docset:Consideration structure="p">
|
|
||||||
|
|
||||||
In consideration of the foregoing, the parties agree as follows: </docset:Consideration>
|
|
||||||
<docset:Purposes style="list-style-type: decimal; boundingBox:{left: 266.7; top: 1175.0; width: 2012.5; height: 1858.3; page: 1;}; boundingBox:{left: 266.7; top: 245.8; width: 2012.5; height: 1737.5; page: 2;}; " structure="ol">
|
|
||||||
<docset:Purposes style="boundingBox:{left: 266.7; top: 1175.0; width: 2012.5; height: 575.0; page: 1;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 1175.0; width: 50.0; height: 50.0; page: 1;}; " structure="lim">1. </dg:chunk>
|
|
||||||
<docset:ConfidentialInformation-section>
|
|
||||||
<docset:ConfidentialInformation structure="h1">
|
|
||||||
<dg:chunk>Confidential Information</dg:chunk>.
|
|
||||||
</docset:ConfidentialInformation>
|
|
||||||
<docset:ConfidentialInformation structure="div">For purposes of this
|
|
||||||
<dg:chunk>Agreement</dg:chunk>, “
|
|
||||||
<dg:chunk>Confidential Information</dg:chunk>” means any information or materials disclosed by
|
|
||||||
<dg:chunk>
|
<dg:chunk>
|
||||||
<dg:chunk>one </dg:chunk>party
|
|
||||||
</dg:chunk>to the other party that: (i) if disclosed in writing or in the form of tangible materials, is marked “confidential” or “proprietary” at the time of such disclosure; (ii) if disclosed orally or by visual presentation, is identified as “confidential” or “proprietary” at the time of such disclosure, and is summarized in a writing sent by the disclosing party to the receiving party within
|
<docset:ConfidentialityObligations structure="ol"
|
||||||
<dgc:TimeDuration>
|
style="list-style-type: decimal; boundingBox:{left: 300.0; top: 936.0; width: 30.0; height: 1881.0; page: 1;}; boundingBox:{left: 300.0; top: 309.0; width: 30.0; height: 777.0; page: 2;}; ">
|
||||||
<dg:chunk>thirty </dg:chunk>(
|
<dg:chunk structure="li"
|
||||||
<dg:chunk>30</dg:chunk>) days
|
style="boundingBox:{left: 300.0; top: 936.0; width: 30.0; height: 45.0; page: 1;}; ">
|
||||||
</dgc:TimeDuration>after any such disclosure; or (iii) due to its nature or the circumstances of its disclosure, a person exercising reasonable business judgment would understand to be confidential or proprietary.
|
<dg:chunk structure="lim"
|
||||||
</docset:ConfidentialInformation>
|
style="boundingBox:{left: 300.0; top: 936.0; width: 48.0; height: 45.0; page: 1;}; ">
|
||||||
</docset:ConfidentialInformation-section>
|
1. </dg:chunk>
|
||||||
</docset:Purposes>
|
<docset:DefinitionofConfidentialInformation-section>
|
||||||
<docset:Obligations style="boundingBox:{left: 266.7; top: 1758.3; width: 2012.5; height: 691.7; page: 1;}; " structure="li">
|
<dg:chunk structure="h1">Definition of <dg:chunk>Confidential
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 1758.3; width: 50.0; height: 50.0; page: 1;}; " structure="lim">2. </dg:chunk>
|
Information </dg:chunk></dg:chunk>
|
||||||
<docset:ObligationsAndRestrictions-section>
|
<docset:DefinitionofConfidentialInformation structure="div">For
|
||||||
<docset:Obligations structure="h1">Obligations and
|
purposes of this Agreement, "<dg:chunk>Confidential
|
||||||
<dg:chunk>Restrictions</dg:chunk>.
|
Information</dg:chunk>" shall include all information or
|
||||||
</docset:Obligations>
|
material that has or could have commercial value or other
|
||||||
<docset:ObligationsAndRestrictions structure="div">Each party agrees: (i) to maintain the
|
utility in the business in which Disclosing Party is
|
||||||
<dg:chunk>other party's Confidential Information </dg:chunk>in strict confidence; (ii) not to disclose
|
engaged. If <dg:chunk>Confidential Information </dg:chunk>is
|
||||||
<dg:chunk>such Confidential Information </dg:chunk>to any third party; and (iii) not to use
|
in written form, the <dg:chunk>Disclosing Party </dg:chunk>shall
|
||||||
<dg:chunk>such Confidential Information </dg:chunk>for any purpose except for the Purpose. Each party may disclose the
|
label or stamp the materials with the word "Confidential" or
|
||||||
<dg:chunk>other party’s Confidential Information </dg:chunk>to its employees and consultants who have a bona fide need to know
|
some similar warning. If <dg:chunk>Confidential Information </dg:chunk>is
|
||||||
<dg:chunk>such Confidential Information </dg:chunk>for the Purpose, but solely to the extent necessary to pursue the
|
transmitted orally, the <dg:chunk>Disclosing Party </dg:chunk>shall
|
||||||
<dg:chunk>Purpose </dg:chunk>and for no other purpose; provided, that each such employee and consultant first executes a written agreement (or is otherwise already bound by a written agreement) that contains use and nondisclosure restrictions at least as protective of the
|
promptly provide writing indicating that such oral
|
||||||
<dg:chunk>other party’s Confidential Information </dg:chunk>as those set forth in this
|
communication constituted <dg:chunk>Confidential Information</dg:chunk>
|
||||||
<dg:chunk>Agreement</dg:chunk>.
|
. </docset:DefinitionofConfidentialInformation>
|
||||||
</docset:ObligationsAndRestrictions>
|
</docset:DefinitionofConfidentialInformation-section>
|
||||||
</docset:ObligationsAndRestrictions-section>
|
|
||||||
</docset:Obligations>
|
|
||||||
<docset:Exceptions style="boundingBox:{left: 266.7; top: 2458.3; width: 2012.5; height: 108.3; page: 1;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 2458.3; width: 50.0; height: 50.0; page: 1;}; " structure="lim">3. </dg:chunk>
|
|
||||||
<docset:Exceptions-section>
|
|
||||||
<docset:Exceptions structure="h1">Exceptions. </docset:Exceptions>
|
|
||||||
<docset:Exceptions structure="div">The obligations and restrictions in Section
|
|
||||||
<dg:chunk>2 </dg:chunk>will not apply to any information or materials that:
|
|
||||||
</docset:Exceptions>
|
|
||||||
</docset:Exceptions-section>
|
|
||||||
</docset:Exceptions>
|
|
||||||
<docset:TheDate style="boundingBox:{left: 266.7; top: 2575.0; width: 2012.5; height: 166.7; page: 1;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 2575.0; width: 58.3; height: 50.0; page: 1;}; " structure="lim">(i) </dg:chunk>
|
|
||||||
<docset:TheDate structure="p">were, at the date of disclosure, or have subsequently become, generally known or available to the public through no act or failure to act by the receiving party; </docset:TheDate>
|
|
||||||
</docset:TheDate>
|
|
||||||
<docset:SuchInformation style="boundingBox:{left: 266.7; top: 2750.0; width: 2012.5; height: 108.3; page: 1;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 2750.0; width: 70.8; height: 50.0; page: 1;}; " structure="lim">(ii) </dg:chunk>
|
|
||||||
<docset:TheReceivingParty structure="p">were rightfully known by the receiving party prior to receiving such information or materials from the disclosing party; </docset:TheReceivingParty>
|
|
||||||
</docset:SuchInformation>
|
|
||||||
<docset:TheReceivingParty style="boundingBox:{left: 266.7; top: 2866.7; width: 2012.5; height: 166.7; page: 1;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 2866.7; width: 87.5; height: 50.0; page: 1;}; " structure="lim">(iii) </dg:chunk>
|
|
||||||
<docset:TheReceivingParty structure="p">are rightfully acquired by the receiving party from a third party who has the right to disclose such information or materials without breach of any confidentiality obligation to the disclosing party; or </docset:TheReceivingParty>
|
|
||||||
</docset:TheReceivingParty>
|
|
||||||
<docset:TheReceivingParty style="boundingBox:{left: 266.7; top: 245.8; width: 2012.5; height: 108.3; page: 2;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 245.8; width: 83.3; height: 50.0; page: 2;}; " structure="lim">(iv) </dg:chunk>
|
|
||||||
<docset:TheReceivingParty structure="p">are independently developed by the receiving party without access to any
|
|
||||||
<dg:chunk>Confidential Information </dg:chunk>of the disclosing party.
|
|
||||||
</docset:TheReceivingParty>
|
|
||||||
</docset:TheReceivingParty>
|
|
||||||
<docset:Disclosure style="boundingBox:{left: 266.7; top: 362.5; width: 2012.5; height: 341.7; page: 2;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 362.5; width: 50.0; height: 50.0; page: 2;}; " structure="lim">4. </dg:chunk>
|
|
||||||
<docset:CompelledDisclosure-section>
|
|
||||||
<dg:chunk structure="h1">
|
|
||||||
<dg:chunk>Compelled Disclosure</dg:chunk>.
|
|
||||||
</dg:chunk>
|
</dg:chunk>
|
||||||
<docset:CompelledDisclosure structure="div">Nothing in this
|
<dg:chunk structure="li"
|
||||||
<dg:chunk>Agreement </dg:chunk>will be deemed to restrict a party from disclosing the
|
style="boundingBox:{left: 300.0; top: 1428.0; width: 30.0; height: 48.0; page: 1;}; ">
|
||||||
<dg:chunk>other party’s Confidential Information </dg:chunk>to the extent required by any order, subpoena, law, statute or regulation; provided, that the party required to make such a disclosure uses reasonable efforts to give the other party reasonable advance notice of such required disclosure in order to enable the other party to prevent or limit such disclosure.
|
<dg:chunk structure="lim"
|
||||||
</docset:CompelledDisclosure>
|
style="boundingBox:{left: 300.0; top: 1428.0; width: 48.0; height: 48.0; page: 1;}; ">
|
||||||
</docset:CompelledDisclosure-section>
|
2. </dg:chunk>
|
||||||
</docset:Disclosure>
|
<docset:ExclusionsFromConfidentialInformation-section>
|
||||||
<docset:TheCompletion style="boundingBox:{left: 266.7; top: 712.5; width: 2012.5; height: 512.5; page: 2;}; " structure="li">
|
<dg:chunk structure="h1">Exclusions from <dg:chunk>Confidential
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 712.5; width: 50.0; height: 50.0; page: 2;}; " structure="lim">5. </dg:chunk>
|
Information </dg:chunk></dg:chunk>
|
||||||
<docset:ReturnofConfidentialInformation-section>
|
<docset:ExclusionsFromConfidentialInformation structure="div">Recipient's
|
||||||
<dg:chunk structure="h1">Return of
|
obligations under this Agreement do not extend to
|
||||||
<dg:chunk>Confidential Information</dg:chunk>.
|
information that is: (a) publicly known at the time of
|
||||||
|
disclosure or subsequently becomes publicly known through no
|
||||||
|
fault of the Recipient; (b) discovered or created by the
|
||||||
|
Recipient before disclosure by <dg:chunk>Disclosing Party</dg:chunk>;
|
||||||
|
(c) learned by the Recipient through legitimate means other
|
||||||
|
than from the <dg:chunk>Disclosing Party </dg:chunk>or
|
||||||
|
Disclosing Party's representatives; or (d) is disclosed by
|
||||||
|
Recipient with Disclosing Party's prior written approval. </docset:ExclusionsFromConfidentialInformation>
|
||||||
|
</docset:ExclusionsFromConfidentialInformation-section>
|
||||||
</dg:chunk>
|
</dg:chunk>
|
||||||
<docset:ReturnofConfidentialInformation structure="div">Upon the completion or abandonment of the Purpose, and in any event upon the disclosing party’s request, the receiving party will promptly return to the disclosing party all tangible items and embodiments containing or consisting of the
|
<dg:chunk structure="li"
|
||||||
<dg:chunk>disclosing party’s Confidential Information </dg:chunk>and all copies thereof (including electronic copies), and any notes, analyses, compilations, studies, interpretations, memoranda or other documents (regardless of the form thereof) prepared by or on behalf of the receiving party that contain or are based upon the
|
style="boundingBox:{left: 300.0; top: 1866.0; width: 30.0; height: 45.0; page: 1;}; ">
|
||||||
<dg:chunk>disclosing party’s Confidential Information</dg:chunk>.
|
<dg:chunk structure="lim"
|
||||||
</docset:ReturnofConfidentialInformation>
|
style="boundingBox:{left: 300.0; top: 1866.0; width: 48.0; height: 45.0; page: 1;}; ">
|
||||||
</docset:ReturnofConfidentialInformation-section>
|
3. </dg:chunk>
|
||||||
</docset:TheCompletion>
|
<docset:ObligationsofReceivingParty-section>
|
||||||
<docset:NoObligations style="boundingBox:{left: 266.7; top: 1233.3; width: 2012.5; height: 283.3; page: 2;}; " structure="li">
|
<dg:chunk structure="h1">Obligations of Receiving Party </dg:chunk>
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 1233.3; width: 50.0; height: 50.0; page: 2;}; " structure="lim">6. </dg:chunk>
|
<docset:ObligationsofReceivingParty structure="div">Recipient
|
||||||
<docset:NoObligations-section>
|
shall hold and maintain the <dg:chunk>Confidential
|
||||||
<docset:NoObligations structure="h1">No
|
Information </dg:chunk>in strictest confidence for the sole
|
||||||
<dg:chunk>Obligations</dg:chunk>.
|
and exclusive benefit of the <dg:chunk>Disclosing Party</dg:chunk>.
|
||||||
</docset:NoObligations>
|
Recipient shall carefully restrict access to <dg:chunk>Confidential
|
||||||
<docset:NoObligations structure="div">Each party retains the right, in its sole discretion, to determine whether to disclose any
|
Information </dg:chunk>to employees, contractors, and third
|
||||||
<dg:chunk>Confidential Information </dg:chunk>to the other party. Neither party will be required to negotiate nor enter into any other agreements or arrangements with the other party, whether or not related to the Purpose.
|
parties as is reasonably required and shall require those
|
||||||
</docset:NoObligations>
|
persons to sign nondisclosure restrictions at least as
|
||||||
</docset:NoObligations-section>
|
protective as those in this Agreement. </docset:ObligationsofReceivingParty>
|
||||||
</docset:NoObligations>
|
</docset:ObligationsofReceivingParty-section>
|
||||||
<docset:TheSoleAndExclusiveProperty style="boundingBox:{left: 266.7; top: 1525.0; width: 2012.5; height: 399.0; page: 2;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 1525.0; width: 50.0; height: 50.0; page: 2;}; " structure="lim">7. </dg:chunk>
|
|
||||||
<docset:NoLicense-section>
|
|
||||||
<docset:NoLicense structure="h1">No
|
|
||||||
<dg:chunk>License</dg:chunk>.
|
|
||||||
</docset:NoLicense>
|
|
||||||
<docset:NoLicense structure="div">All
|
|
||||||
<dg:chunk>Confidential Information </dg:chunk>remains the sole and exclusive property of the disclosing party. Each party acknowledges and agrees that nothing in this
|
|
||||||
<dg:chunk>Agreement </dg:chunk>will be construed as granting any rights to the receiving party, by license or otherwise, in or to any
|
|
||||||
<dg:chunk>Confidential Information </dg:chunk>of the disclosing party, or any patent, copyright or other intellectual property or proprietary rights of the disclosing party, except as specified in this
|
|
||||||
<dg:chunk>Agreement</dg:chunk>.
|
|
||||||
</docset:NoLicense>
|
|
||||||
</docset:NoLicense-section>
|
|
||||||
</docset:TheSoleAndExclusiveProperty>
|
|
||||||
<docset:NoWarranty style="boundingBox:{left: 416.7; top: 1933.3; width: 1862.5; height: 50.0; page: 2;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 1933.3; width: 50.0; height: 50.0; page: 2;}; " structure="lim">8. </dg:chunk>
|
|
||||||
<docset:NoWarranty structure="h1">No Warranty. ALL CONFIDENTIAL
|
|
||||||
<dgm:Diagnosis>INFORMATION </dgm:Diagnosis>
|
|
||||||
<dg:chunk>CONFIDENTIAL INFORMATION </dg:chunk>IS PROVIDED
|
|
||||||
</docset:NoWarranty>
|
|
||||||
</docset:NoWarranty>
|
|
||||||
<docset:Exceptions>The obligations and restrictions in Section 2 will not apply to any information or materials that:
|
|
||||||
|
|
||||||
(i) were, at the date of disclosure, or have subsequently become, generally known or available to the public through no act or failure to act by the receiving party;
|
|
||||||
|
|
||||||
(ii) were rightfully known by the receiving party prior to receiving such information or materials from the disclosing party;
|
|
||||||
|
|
||||||
(iii) are rightfully acquired by the receiving party from a third party who has the right to disclose such information or materials without breach of any confidentiality obligation to the disclosing party; or
|
|
||||||
|
|
||||||
(iv) are independently developed by the receiving party without access to any Confidential Information of the disclosing party. </docset:Exceptions>
|
|
||||||
|
|
||||||
4. Compelled Disclosure. Nothing in this Agreement will be deemed to restrict a party from disclosing the other party’s Confidential Information to the extent required by any order, subpoena, law, statute or regulation; provided, that the party required to make such a disclosure uses reasonable efforts to give the other party reasonable advance notice of such required disclosure in order to enable the other party to prevent or limit such disclosure.
|
|
||||||
|
|
||||||
5. Return of Confidential Information. Upon the completion or abandonment of the Purpose, and in any event upon the disclosing party’s request, the receiving party will promptly return to the disclosing party all tangible items and embodiments containing or consisting of the disclosing party’s Confidential Information and all copies thereof (including electronic copies), and any notes, analyses, compilations, studies, interpretations, memoranda or other documents (regardless of the form thereof) prepared by or on behalf of the receiving party that contain or are based upon the disclosing party’s Confidential Information.
|
|
||||||
|
|
||||||
6. No Obligations. Each party retains the right, in its sole discretion, to determine whether to disclose any Confidential Information to the other party. Neither party will be required to negotiate nor enter into any other agreements or arrangements with the other party, whether or not related to the Purpose.
|
|
||||||
|
|
||||||
7. No License. All Confidential Information remains the sole and exclusive property of the disclosing party. Each party acknowledges and agrees that nothing in this Agreement will be construed as granting any rights to the receiving party, by license or otherwise, in or to any Confidential Information of the disclosing party, or any patent, copyright or other intellectual property or proprietary rights of the disclosing party, except as specified in this Agreement.
|
|
||||||
|
|
||||||
8. No Warranty. ALL CONFIDENTIAL INFORMATION IS PROVIDED
|
|
||||||
</docset:Purposes>
|
|
||||||
</docset:Consideration>
|
|
||||||
</docset:ThisMutualNon-disclosureAgreement>
|
|
||||||
<docset:Effect>
|
|
||||||
<dg:chunk structure="h1">
|
|
||||||
|
|
||||||
BY THE
|
|
||||||
<dg:chunk>DISCLOSING PARTY </dg:chunk>“AS IS”.
|
|
||||||
</dg:chunk>
|
</dg:chunk>
|
||||||
<docset:Effect structure="div">
|
<dg:chunk structure="li"
|
||||||
<docset:Effect style="list-style-type: decimal; boundingBox:{left: 266.7; top: 2050.0; width: 2012.5; height: 979.2; page: 2;}; " structure="ol">
|
style="boundingBox:{left: 300.0; top: 2244.0; width: 30.0; height: 48.0; page: 1;}; ">
|
||||||
<docset:ThisAgreement style="boundingBox:{left: 266.7; top: 2050.0; width: 2012.5; height: 166.7; page: 2;}; " structure="li">
|
<dg:chunk structure="lim"
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 2050.0; width: 50.0; height: 50.0; page: 2;}; " structure="lim">9. </dg:chunk>
|
style="boundingBox:{left: 300.0; top: 2244.0; width: 48.0; height: 48.0; page: 1;}; ">
|
||||||
<docset:Term-section>
|
4. </dg:chunk>
|
||||||
<dg:chunk structure="h1">Term. </dg:chunk>
|
<docset:TimePeriods-section>
|
||||||
<docset:Term structure="div">This
|
<dg:chunk structure="h1">Time Periods </dg:chunk>
|
||||||
<dg:chunk>Agreement </dg:chunk>will remain in effect for a period of
|
<docset:TimePeriods structure="div">The nondisclosure provisions
|
||||||
<docset:RemaininEffect>
|
of this Agreement shall survive the termination of this
|
||||||
<dg:chunk>five </dg:chunk>(
|
Agreement and Recipient's duty to hold <dg:chunk>Confidential
|
||||||
<dg:chunk>5</dg:chunk>) years
|
Information </dg:chunk>in confidence shall remain in effect
|
||||||
</docset:RemaininEffect>from the date of last disclosure of
|
until the <dg:chunk>Confidential Information </dg:chunk>no
|
||||||
<dg:chunk>Confidential Information </dg:chunk>by either party, at which time it will terminate.
|
longer qualifies as a trade secret or until <dg:chunk>Disclosing
|
||||||
</docset:Term>
|
Party </dg:chunk>sends Recipient written notice releasing
|
||||||
</docset:Term-section>
|
Recipient from this Agreement, whichever occurs first. </docset:TimePeriods>
|
||||||
</docset:ThisAgreement>
|
</docset:TimePeriods-section>
|
||||||
<docset:EquitableRelief style="boundingBox:{left: 266.7; top: 2225.0; width: 2012.5; height: 400.0; page: 2;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 2225.0; width: 79.2; height: 50.0; page: 2;}; " structure="lim">10. </dg:chunk>
|
|
||||||
<docset:EquitableRelief-section>
|
|
||||||
<docset:EquitableRelief structure="h1">
|
|
||||||
<dg:chunk>Equitable Relief</dg:chunk>.
|
|
||||||
</docset:EquitableRelief>
|
|
||||||
<docset:EquitableRelief structure="div">Each party acknowledges that the unauthorized use or disclosure of the
|
|
||||||
<dg:chunk>disclosing party’s Confidential Information </dg:chunk>may cause the disclosing party to incur irreparable harm and significant damages, the degree of which may be difficult to ascertain. Accordingly, each party agrees that the disclosing party will have the right to seek immediate equitable relief to enjoin any unauthorized use or disclosure of
|
|
||||||
<dg:chunk>its Confidential Information</dg:chunk>, in addition to any other rights and remedies that it may have at law or otherwise.
|
|
||||||
</docset:EquitableRelief>
|
|
||||||
</docset:EquitableRelief-section>
|
|
||||||
</docset:EquitableRelief>
|
|
||||||
<docset:Accordance style="boundingBox:{left: 266.7; top: 2633.3; width: 2012.5; height: 395.8; page: 2;}; " structure="li">
|
|
||||||
<dg:chunk style="boundingBox:{left: 416.7; top: 2633.3; width: 79.2; height: 50.0; page: 2;}; " structure="lim">11. </dg:chunk>
|
|
||||||
<docset:Miscellaneous-section>
|
|
||||||
<dg:chunk structure="h1">Miscellaneous. </dg:chunk>
|
|
||||||
<docset:Miscellaneous structure="div">This
|
|
||||||
<dg:chunk>Agreement </dg:chunk>will be governed and construed in accordance with the laws of the
|
|
||||||
<dg:chunk>State </dg:chunk>of
|
|
||||||
<dgc:USState>Washington</dgc:USState>, excluding its body of law controlling conflict of laws. This
|
|
||||||
<dg:chunk>Agreement </dg:chunk>is the complete and exclusive understanding and agreement between the parties regarding the subject matter of this
|
|
||||||
<dg:chunk>Agreement </dg:chunk>and supersedes all prior agreements, understandings and communications, oral or written, between the parties regarding the subject matter of this
|
|
||||||
<dg:chunk>Agreement</dg:chunk>. If any provision of this
|
|
||||||
<dg:chunk>Agreement </dg:chunk>is held invalid or unenforceable by a court of competent jurisdiction, that provision of this
|
|
||||||
<dg:chunk>Agreement </dg:chunk>will be enforced to the maximum extent permissible and the other provisions of this
|
|
||||||
<dg:chunk>Agreement </dg:chunk>will remain in full force and effect. Neither party may assign this
|
|
||||||
<dg:chunk>Agreement</dg:chunk>, in whole or in part, by operation of law or otherwise, without the other party’s prior written consent, and any attempted assignment without such consent will be void. This
|
|
||||||
<dg:chunk>Agreement </dg:chunk>may be executed in counterparts, each of which will be deemed an original, but all of which together will constitute one and the same instrument.
|
|
||||||
</docset:Miscellaneous>
|
|
||||||
</docset:Miscellaneous-section>
|
|
||||||
</docset:Accordance>
|
|
||||||
</docset:Effect>
|
|
||||||
<docset:SIGNATUREPAGEFOLLOWS-section>
|
|
||||||
<dg:chunk structure="h1">
|
|
||||||
|
|
||||||
[SIGNATURE PAGE FOLLOWS] </dg:chunk>
|
|
||||||
<docset:SIGNATUREPAGEFOLLOWS structure="div">
|
|
||||||
<dg:chunk structure="h1">
|
|
||||||
|
|
||||||
IN
|
|
||||||
<dg:chunk>WITNESS </dg:chunk>WHEREOF,
|
|
||||||
</dg:chunk>
|
</dg:chunk>
|
||||||
<docset:INWITNESSWHEREOF structure="div">
|
<dg:chunk structure="li"
|
||||||
<docset:TheParties structure="p">the parties hereto have executed this
|
style="boundingBox:{left: 300.0; top: 2565.0; width: 30.0; height: 48.0; page: 1;}; ">
|
||||||
<dg:chunk>Mutual Non-Disclosure Agreement </dg:chunk>by their duly authorized officers or representatives as of the date first set forth above.
|
<dg:chunk structure="lim"
|
||||||
</docset:TheParties>
|
style="boundingBox:{left: 300.0; top: 2565.0; width: 48.0; height: 48.0; page: 1;}; ">
|
||||||
<docset:DocugamiInc>
|
5. </dg:chunk>
|
||||||
<docset:DocugamiInc style="boundingBox:{left: 316.7; top: 529.2; width: 1958.8; height: 247.7; page: 4;}; ">
|
<docset:Relationships-section>
|
||||||
<xhtml:table style="boundingBox:{left: 316.7; top: 529.2; width: 1958.8; height: 247.7; page: 4;}; ">
|
<dg:chunk structure="h1">Relationships </dg:chunk>
|
||||||
<xhtml:tbody style="boundingBox:{left: 316.7; top: 529.2; width: 1958.8; height: 247.7; page: 4;}; ">
|
<docset:Relationships structure="div">Nothing contained in this
|
||||||
<xhtml:tr style="boundingBox:{left: 316.7; top: 529.2; width: 1958.8; height: 91.0; page: 4;}; ">
|
Agreement shall be deemed to constitute either party a
|
||||||
<xhtml:td style="boundingBox:{left: 316.7; top: 529.2; width: 768.8; height: 91.0; page: 4;}; ">
|
partner, joint venture, or employee of the other party for
|
||||||
<docset:DocugamiInc structure="h1">
|
any purpose.
|
||||||
<dgc:Org>
|
|
||||||
<dg:chunk>DOCUGAMI INC</dg:chunk>.
|
</docset:Relationships>
|
||||||
</dgc:Org>:
|
</docset:Relationships-section>
|
||||||
</docset:DocugamiInc>
|
</dg:chunk>
|
||||||
|
<dg:chunk structure="li"
|
||||||
|
style="boundingBox:{left: 300.0; top: 2772.0; width: 30.0; height: 45.0; page: 1;}; ">
|
||||||
|
<dg:chunk structure="lim"
|
||||||
|
style="boundingBox:{left: 300.0; top: 2772.0; width: 48.0; height: 45.0; page: 1;}; ">
|
||||||
|
6. </dg:chunk>
|
||||||
|
<docset:Severability-section>
|
||||||
|
<dg:chunk structure="h1">Severability </dg:chunk>
|
||||||
|
<docset:Severability structure="div">If a court finds any
|
||||||
|
provision of this Agreement invalid or unenforceable, the
|
||||||
|
remainder of this Agreement shall be interpreted so as best
|
||||||
|
to effect the intent of the parties.
|
||||||
|
|
||||||
|
</docset:Severability>
|
||||||
|
</docset:Severability-section>
|
||||||
|
</dg:chunk>
|
||||||
|
<dg:chunk structure="li"
|
||||||
|
style="boundingBox:{left: 300.0; top: 309.0; width: 30.0; height: 45.0; page: 2;}; ">
|
||||||
|
<dg:chunk structure="lim"
|
||||||
|
style="boundingBox:{left: 300.0; top: 309.0; width: 48.0; height: 45.0; page: 2;}; ">
|
||||||
|
7. </dg:chunk>
|
||||||
|
<docset:Integration-section>
|
||||||
|
<dg:chunk structure="h1">Integration </dg:chunk>
|
||||||
|
<docset:Integration structure="div">This Agreement expresses the
|
||||||
|
complete understanding of the parties with respect to the
|
||||||
|
subject matter and supersedes all prior proposals,
|
||||||
|
agreements, representations, and understandings. This
|
||||||
|
Agreement may not be amended except in writing signed by
|
||||||
|
both parties.
|
||||||
|
|
||||||
|
</docset:Integration>
|
||||||
|
</docset:Integration-section>
|
||||||
|
</dg:chunk>
|
||||||
|
<dg:chunk structure="li"
|
||||||
|
style="boundingBox:{left: 300.0; top: 573.0; width: 30.0; height: 45.0; page: 2;}; ">
|
||||||
|
<dg:chunk structure="lim"
|
||||||
|
style="boundingBox:{left: 300.0; top: 573.0; width: 48.0; height: 45.0; page: 2;}; ">
|
||||||
|
8. </dg:chunk>
|
||||||
|
<docset:Waiver-section>
|
||||||
|
<dg:chunk structure="h1">Waiver </dg:chunk>
|
||||||
|
<docset:Waiver structure="div">The failure to exercise any right
|
||||||
|
provided in this Agreement shall not be a waiver of prior or
|
||||||
|
subsequent rights.
|
||||||
|
|
||||||
|
</docset:Waiver>
|
||||||
|
</docset:Waiver-section>
|
||||||
|
</dg:chunk>
|
||||||
|
<dg:chunk structure="li"
|
||||||
|
style="boundingBox:{left: 300.0; top: 720.0; width: 30.0; height: 48.0; page: 2;}; ">
|
||||||
|
<dg:chunk structure="lim"
|
||||||
|
style="boundingBox:{left: 300.0; top: 720.0; width: 48.0; height: 48.0; page: 2;}; ">
|
||||||
|
9. </dg:chunk>
|
||||||
|
<docset:NoticeofImmunity-section>
|
||||||
|
<dg:chunk structure="h1">Notice of Immunity </dg:chunk>
|
||||||
|
<docset:NoticeofImmunity structure="div">Employee is provided
|
||||||
|
notice that an individual shall not be held criminally or
|
||||||
|
civilly liable under any federal or state trade secret law
|
||||||
|
for the disclosure of a trade secret that is made (i) in
|
||||||
|
confidence to a federal, state, or local government
|
||||||
|
official, either directly or indirectly, or to an attorney;
|
||||||
|
and (ii) solely for the purpose of reporting or
|
||||||
|
investigating a suspected violation of law.
|
||||||
|
|
||||||
|
</docset:NoticeofImmunity>
|
||||||
|
</docset:NoticeofImmunity-section>
|
||||||
|
</dg:chunk>
|
||||||
|
<dg:chunk structure="li"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1041.0; width: 30.0; height: 45.0; page: 2;}; ">
|
||||||
|
<dg:chunk structure="lim"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1041.0; width: 81.0; height: 45.0; page: 2;}; ">
|
||||||
|
10. </dg:chunk>
|
||||||
|
<dg:chunk>Table of <dg:chunk>Authorized Disclosures </dg:chunk>
|
||||||
|
|
||||||
|
</dg:chunk>
|
||||||
|
</dg:chunk>
|
||||||
|
</docset:ConfidentialityObligations>
|
||||||
|
<dg:chunk>
|
||||||
|
<docset:AuthorizedRecipients structure="p">The following table outlines
|
||||||
|
individuals who are authorized to receive <dg:chunk>Confidential
|
||||||
|
Information</dg:chunk>, their role, and the purpose of disclosure: </docset:AuthorizedRecipients>
|
||||||
|
|
||||||
|
<docset:TableofAuthorizedDisclosures>
|
||||||
|
<xhtml:table structure="table"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1272.0; width: 2040.0; height: 372.0; page: 2;}; ">
|
||||||
|
<xhtml:tbody structure="tbody"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1272.0; width: 2040.0; height: 372.0; page: 2;}; ">
|
||||||
|
<xhtml:tr structure="tr"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1272.0; width: 2040.0; height: 93.0; page: 2;}; ">
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1272.0; width: 603.0; height: 93.0; page: 2;}; ">
|
||||||
|
<dg:chunk>Authorized Individual </dg:chunk>
|
||||||
|
|
||||||
</xhtml:td>
|
</xhtml:td>
|
||||||
<xhtml:td style="boundingBox:{left: 1085.4; top: 529.2; width: 1190.0; height: 91.0; page: 4;}; ">
|
<xhtml:td structure="td"
|
||||||
<docset:DOCUGAMIINC structure="h1">
|
style="boundingBox:{left: 924.0; top: 1272.0; width: 114.0; height: 93.0; page: 2;}; ">
|
||||||
<dgc:Person>Leonarda Hosler</dgc:Person>:
|
Role
|
||||||
</docset:DOCUGAMIINC>
|
|
||||||
|
</xhtml:td>
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 1338.0; top: 1272.0; width: 1002.0; height: 93.0; page: 2;}; ">Purpose
|
||||||
|
of Disclosure
|
||||||
|
|
||||||
</xhtml:td>
|
</xhtml:td>
|
||||||
</xhtml:tr>
|
</xhtml:tr>
|
||||||
<xhtml:tr style="boundingBox:{left: 316.7; top: 620.2; width: 1958.8; height: 156.7; page: 4;}; ">
|
<xhtml:tr structure="tr"
|
||||||
<xhtml:td style="boundingBox:{left: 316.7; top: 620.2; width: 768.8; height: 156.7; page: 4;}; ">
|
style="boundingBox:{left: 300.0; top: 1365.0; width: 2040.0; height: 93.0; page: 2;}; ">
|
||||||
<docset:DOCUGAMIINCSignatuRe style="boundingBox:{left: 316.7; top: 620.2; width: 768.8; height: 156.7; page: 4;}; ">
|
<xhtml:td structure="td"
|
||||||
<dg:chunk>Signatu </dg:chunk>re:
|
style="boundingBox:{left: 300.0; top: 1365.0; width: 603.0; height: 93.0; page: 2;}; ">
|
||||||
</docset:DOCUGAMIINCSignatuRe>
|
<docset:AuthorizedIndividualJohnSmith>
|
||||||
|
<docset:Name>John Smith </docset:Name>
|
||||||
|
|
||||||
|
</docset:AuthorizedIndividualJohnSmith>
|
||||||
</xhtml:td>
|
</xhtml:td>
|
||||||
<xhtml:td style="boundingBox:{left: 1085.4; top: 620.2; width: 1190.0; height: 156.7; page: 4;}; ">
|
<xhtml:td structure="td"
|
||||||
<docset:LeonardaHosler style="boundingBox:{left: 1085.4; top: 620.2; width: 1190.0; height: 156.7; page: 4;}; ">
|
style="boundingBox:{left: 903.0; top: 1365.0; width: 435.0; height: 93.0; page: 2;}; ">
|
||||||
<dg:chunk>Signatu </dg:chunk>re:
|
<docset:JohnSmithRole>
|
||||||
</docset:LeonardaHosler>
|
<docset:ProjectManagerName>Project Manager </docset:ProjectManagerName>
|
||||||
|
|
||||||
|
</docset:JohnSmithRole>
|
||||||
|
</xhtml:td>
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 1338.0; top: 1365.0; width: 1002.0; height: 93.0; page: 2;}; ">
|
||||||
|
<docset:JohnSmithPurposeofDisclosure>
|
||||||
|
<dg:chunk structure="p">Oversee project to which
|
||||||
|
the NDA relates </dg:chunk>
|
||||||
|
|
||||||
|
</docset:JohnSmithPurposeofDisclosure>
|
||||||
|
</xhtml:td>
|
||||||
|
</xhtml:tr>
|
||||||
|
<xhtml:tr structure="tr"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1458.0; width: 2040.0; height: 93.0; page: 2;}; ">
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1458.0; width: 603.0; height: 93.0; page: 2;}; ">
|
||||||
|
<docset:AuthorizedIndividualLisaWhite>
|
||||||
|
<docset:Author>Lisa White </docset:Author>
|
||||||
|
|
||||||
|
</docset:AuthorizedIndividualLisaWhite>
|
||||||
|
</xhtml:td>
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 903.0; top: 1458.0; width: 435.0; height: 93.0; page: 2;}; ">
|
||||||
|
<docset:LisaWhiteRole>
|
||||||
|
<dg:chunk>Lead Developer </dg:chunk>
|
||||||
|
|
||||||
|
</docset:LisaWhiteRole>
|
||||||
|
</xhtml:td>
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 1338.0; top: 1458.0; width: 1002.0; height: 93.0; page: 2;}; ">
|
||||||
|
<docset:LisaWhitePurposeofDisclosure>Software
|
||||||
|
development and analysis
|
||||||
|
|
||||||
|
</docset:LisaWhitePurposeofDisclosure>
|
||||||
|
</xhtml:td>
|
||||||
|
</xhtml:tr>
|
||||||
|
<xhtml:tr structure="tr"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1551.0; width: 2040.0; height: 93.0; page: 2;}; ">
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 300.0; top: 1551.0; width: 603.0; height: 93.0; page: 2;}; ">
|
||||||
|
<docset:AuthorizedIndividualMichaelBrown>
|
||||||
|
<docset:Name>Michael Brown </docset:Name>
|
||||||
|
|
||||||
|
</docset:AuthorizedIndividualMichaelBrown>
|
||||||
|
</xhtml:td>
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 903.0; top: 1551.0; width: 435.0; height: 93.0; page: 2;}; ">
|
||||||
|
<docset:MichaelBrownRole>
|
||||||
|
<dg:chunk>Financial <docset:FinancialAnalyst>
|
||||||
|
Analyst </docset:FinancialAnalyst></dg:chunk>
|
||||||
|
|
||||||
|
</docset:MichaelBrownRole>
|
||||||
|
</xhtml:td>
|
||||||
|
<xhtml:td structure="td"
|
||||||
|
style="boundingBox:{left: 1338.0; top: 1551.0; width: 1002.0; height: 93.0; page: 2;}; ">
|
||||||
|
<docset:MichaelBrownPurposeofDisclosure>Financial
|
||||||
|
analysis and reporting </docset:MichaelBrownPurposeofDisclosure>
|
||||||
</xhtml:td>
|
</xhtml:td>
|
||||||
</xhtml:tr>
|
</xhtml:tr>
|
||||||
</xhtml:tbody>
|
</xhtml:tbody>
|
||||||
</xhtml:table>
|
</xhtml:table>
|
||||||
</docset:DocugamiInc>
|
</docset:TableofAuthorizedDisclosures>
|
||||||
<docset:JeanPaoliName>
|
</dg:chunk>
|
||||||
<docset:JeanPaoliName style="boundingBox:{left: 316.7; top: 858.3; width: 1958.8; height: 189.1; page: 4;}; ">
|
</dg:chunk>
|
||||||
<xhtml:table style="boundingBox:{left: 316.7; top: 858.3; width: 1958.8; height: 189.1; page: 4;}; ">
|
</dg:chunk>
|
||||||
<xhtml:tbody style="boundingBox:{left: 316.7; top: 858.3; width: 1958.8; height: 189.1; page: 4;}; ">
|
</docset:ReceivingParty>
|
||||||
<xhtml:tr style="boundingBox:{left: 316.7; top: 858.3; width: 1958.8; height: 91.7; page: 4;}; ">
|
</docset:ReceivingParty-section>
|
||||||
<xhtml:td style="boundingBox:{left: 316.7; top: 858.3; width: 229.2; height: 91.7; page: 4;}; ">
|
<docset:INWITNESSWHEREOF-section>
|
||||||
<dg:chunk structure="h1">Name: </dg:chunk>
|
<dg:chunk structure="h1"> IN <dg:chunk>WITNESS WHEREOF</dg:chunk>, </dg:chunk>
|
||||||
</xhtml:td>
|
<docset:INWITNESSWHEREOF structure="div">the Parties have executed this Non-Disclosure
|
||||||
<xhtml:td style="boundingBox:{left: 545.8; top: 858.3; width: 564.6; height: 91.7; page: 4;}; ">
|
Agreement as of the <dg:chunk>Effective Date </dg:chunk>first above written. </docset:INWITNESSWHEREOF>
|
||||||
<dgc:Person style="boundingBox:{left: 545.8; top: 858.3; width: 564.6; height: 91.7; page: 4;}; ">Jean Paoli </dgc:Person>
|
</docset:INWITNESSWHEREOF-section>
|
||||||
</xhtml:td>
|
</dg:chunk>
|
||||||
<xhtml:td style="boundingBox:{left: 1110.4; top: 858.3; width: 1165.0; height: 91.7; page: 4;}; ">
|
<docset:WidgetCorp-section>
|
||||||
<dg:chunk structure="h1">Name: </dg:chunk>
|
<dg:chunk structure="h1">
|
||||||
</xhtml:td>
|
|
||||||
</xhtml:tr>
|
<docset:CompanyName>Widget Corp. </docset:CompanyName>
|
||||||
<xhtml:tr style="boundingBox:{left: 316.7; top: 950.0; width: 1958.8; height: 97.4; page: 4;}; ">
|
</dg:chunk>
|
||||||
<xhtml:td style="boundingBox:{left: 316.7; top: 950.0; width: 229.2; height: 97.4; page: 4;}; ">
|
<docset:By-section structure="div">
|
||||||
<docset:NameTitle structure="h1">Title: </docset:NameTitle>
|
<dg:chunk structure="h1">
|
||||||
</xhtml:td>
|
|
||||||
<xhtml:td style="boundingBox:{left: 545.8; top: 950.0; width: 564.6; height: 97.4; page: 4;}; ">
|
By: </dg:chunk>
|
||||||
<docset:TitleJeanPaoli style="boundingBox:{left: 545.8; top: 950.0; width: 564.6; height: 97.4; page: 4;}; ">CEO </docset:TitleJeanPaoli>
|
<docset:By structure="div">_____________________________ </docset:By>
|
||||||
</xhtml:td>
|
</docset:By-section>
|
||||||
<xhtml:td style="boundingBox:{left: 1110.4; top: 950.0; width: 1165.0; height: 97.4; page: 4;}; ">
|
</docset:WidgetCorp-section>
|
||||||
<docset:Name style="boundingBox:{left: 1110.4; top: 950.0; width: 1165.0; height: 97.4; page: 4;}; ">
|
<dg:chunk structure="h1"> Name: <docset:Name>Alan Black </docset:Name></dg:chunk>
|
||||||
<docset:Title structure="h1">Title: </docset:Title>
|
<dg:chunk>
|
||||||
</docset:Name>
|
<dg:chunk structure="h1"> Title: <docset:ChiefExecutiveOfficer>Chief Executive Officer </docset:ChiefExecutiveOfficer></dg:chunk>
|
||||||
</xhtml:td>
|
<docset:Date-section structure="div">
|
||||||
</xhtml:tr>
|
<dg:chunk structure="h1">
|
||||||
</xhtml:tbody>
|
|
||||||
</xhtml:table>
|
Date: </dg:chunk>
|
||||||
</docset:JeanPaoliName>
|
<docset:Date structure="div">___________________________ </docset:Date>
|
||||||
</docset:JeanPaoliName>
|
</docset:Date-section>
|
||||||
</docset:DocugamiInc>
|
</dg:chunk>
|
||||||
</docset:INWITNESSWHEREOF>
|
<docset:Recipient-section>
|
||||||
</docset:SIGNATUREPAGEFOLLOWS>
|
<dg:chunk structure="h1">
|
||||||
</docset:SIGNATUREPAGEFOLLOWS-section>
|
|
||||||
</docset:Effect>
|
Recipient </dg:chunk>
|
||||||
</docset:Effect>
|
<docset:By-section structure="div">
|
||||||
</docset:MUTUALNON-DISCLOSUREAGREEMENT>
|
<dg:chunk structure="h1">
|
||||||
</docset:MUTUALNON-DISCLOSUREAGREEMENT-section>
|
|
||||||
|
By: </dg:chunk>
|
||||||
|
<docset:By structure="div">_____________________________ </docset:By>
|
||||||
|
</docset:By-section>
|
||||||
|
</docset:Recipient-section>
|
||||||
|
<docset:NameJaneDoe-section>
|
||||||
|
<dg:chunk structure="h1"> Name: <docset:Name>Jane Doe </docset:Name></dg:chunk>
|
||||||
|
<docset:Date-section structure="div">
|
||||||
|
<dg:chunk structure="h1">
|
||||||
|
|
||||||
|
Date: </dg:chunk>
|
||||||
|
<docset:Date structure="div">___________________________</docset:Date>
|
||||||
|
</docset:Date-section>
|
||||||
|
</docset:NameJaneDoe-section>
|
||||||
|
</dg:chunk>
|
@ -8,19 +8,18 @@ from langchain.document_loaders import DocugamiLoader
|
|||||||
DOCUGAMI_XML_PATH = Path(__file__).parent / "test_data" / "docugami-example.xml"
|
DOCUGAMI_XML_PATH = Path(__file__).parent / "test_data" / "docugami-example.xml"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.requires("lxml")
|
@pytest.mark.requires("dgml_utils")
|
||||||
def test_docugami_loader_local() -> None:
|
def test_docugami_loader_local() -> None:
|
||||||
"""Test DocugamiLoader."""
|
"""Test DocugamiLoader."""
|
||||||
loader = DocugamiLoader(file_paths=[DOCUGAMI_XML_PATH])
|
loader = DocugamiLoader(file_paths=[DOCUGAMI_XML_PATH])
|
||||||
docs = loader.load()
|
docs = loader.load()
|
||||||
|
|
||||||
assert len(docs) == 19
|
assert len(docs) == 25
|
||||||
|
|
||||||
xpath = docs[0].metadata.get("xpath")
|
assert "/docset:DisclosingParty" in docs[1].metadata["xpath"]
|
||||||
assert str(xpath).endswith("/docset:Preamble")
|
assert "h1" in docs[1].metadata["structure"]
|
||||||
assert docs[0].metadata["structure"] == "p"
|
assert "DisclosingParty" in docs[1].metadata["tag"]
|
||||||
assert docs[0].metadata["tag"] == "Preamble"
|
assert docs[1].page_content.startswith("Disclosing")
|
||||||
assert docs[0].page_content.startswith("MUTUAL NON-DISCLOSURE AGREEMENT")
|
|
||||||
|
|
||||||
|
|
||||||
def test_docugami_initialization() -> None:
|
def test_docugami_initialization() -> None:
|
||||||
|
Loading…
Reference in New Issue
Block a user