# MediaWikiDump

>[MediaWiki XML Dumps](https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps) contain the content of a wiki (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, etc.

This covers how to load a MediaWiki XML dump file into a document format that we can use downstream.

It uses `mwxml` from `mediawiki-utilities` to dump and `mwparserfromhell` from `earwig` to parse MediaWiki wikicode.

Dump files can be obtained with dumpBackup.php or on the Special:Statistics page of the Wiki.

In [None]:
# mediawiki-utilities supports XML schema 0.11 in unmerged branches
!pip install -qU git+https://github.com/mediawiki-utilities/python-mwtypes@updates_schema_0.11
# mediawiki-utilities mwxml has a bug, fix PR pending
!pip install -qU git+https://github.com/gdedrouas/python-mwxml@xml_format_0.11
!pip install -qU mwparserfromhell

In [3]:
from langchain.document_loaders import MWDumpLoader

In [4]:
loader = MWDumpLoader(
    file_path = "example_data/testmw_pages_current.xml", 
    encoding="utf8",
    #namespaces = [0,2,3] Optional list to load only specific namespaces. Loads all namespaces by default.
    skip_redirects = True, #will skip over pages that just redirect to other pages (or not if False)
    stop_on_error = False #will skip over pages that cause parsing errors (or not if False)
     )
documents = loader.load()
print(f"You have {len(documents)} document(s) in your data ")

You have 177 document(s) in your data 


In [7]:
documents[:5]

[Document(page_content='\t\n\t\n\tArtist\n\tReleased\n\tRecorded\n\tLength\n\tLabel\n\tProducer', metadata={'source': 'Album'}),
 Document(page_content='{| class="article-table plainlinks" style="width:100%;"\n|- style="font-size:18px;"\n! style="padding:0px;" | Template documentation\n|-\n| Note: portions of the template sample may not be visible without values provided.\n|-\n| View or edit this documentation. (About template documentation)\n|-\n| Editors can experiment in this template\'s [ sandbox] and [ test case] pages.\n|}Category:Documentation templates', metadata={'source': 'Documentation'}),
 Document(page_content='Description\nThis template is used to insert descriptions on template pages.\n\nSyntax\nAdd <noinclude></noinclude> at the end of the template page.\n\nAdd <noinclude></noinclude> to transclude an alternative page from the /doc subpage.\n\nUsage\n\nOn the Template page\nThis is the normal format when used:\n\nTEMPLATE CODE\n<includeonly>Any categories to be inserted