# wiki_dump_extractor **Repository Path**: mirrors_Zulko/wiki_dump_extractor ## Basic Information - **Project Name**: wiki_dump_extractor - **Description**: Code to extract and process wikipedia pages from a XML dump - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-03-03 - **Last Updated**: 2026-02-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Wiki dump extractor A python library to extract and analyze pages from a wiki dump. This library is used in particular in the [Landnotes](https://github.com/Zulko/landnotes) project to extract and analyze pages from the Wikipedia dump. The project is hosted on [GitHub](https://github.com/zulko/wiki_dump_extractor) an the HTML documentation is available [here](https://zulko.github.io/wiki_dump_extractor/). ## Scope Make the wikipedia dumps easier to work with: - Extract pages from a wiki dump - Be easy to install and run - Be fast (can iterate over 50,000 pages / secong using Avro) - Be memory efficient - Allow for batch processing and parallel processing Provide utilities for page analysis: - Date parsing - Section extraction - Text cleaning - and more. ## Usage To simply iterate over the pages in the dump: ```python from wiki_dump_extractor import WikiDumpExtractor dump_file = "enwiki-20220301-pages-articles-multistream.xml.bz2" extractor = WikiDumpExtractor(file_path=dump_file) for page in extractor.iter_pages(limit=1000): print(page.title) ``` To extract the pages by batches (here we save the pages separate CSV files): ```python from wiki_dump_extractor import WikiDumpExtractor dump_file = "enwiki-20220301-pages-articles-multistream.xml.bz2" extractor = WikiDumpExtractor(file_path=dump_file) batches = extractor.iter_page_batches(batch_size=1000, limit=10) for i, batch in enumerate(batches): df = pandas.DataFrame([page.to_dict() for page in batch]) df.to_csv(f"batch_{i}.csv") ``` ### Converting the dump to Avro There are many reasons why you might want to convert the dump to Avro. The original `xml.bz2` dump is 22Gb but very slow to read from (250/s), the uncompressed dump is 107Gb, relatively fast to read (this library uses lxml which reads thousands of pages per second), however 50% of the pages in there are empty redirect pages. The following code converts the batch to a 28G avro dump that only contains the 12 million real pages, stores redirects in a fast LMDB database, and creates an index for quick page lookups. The operation takes ~40 minutes depending on your machine. ```python from wiki_dump_extractor import WikiXmlDumpExtractor file_path = "enwiki-20250201-pages-articles-multistream.xml" extractor = WikiXmlDumpExtractor(file_path=file_path) ignored_fields = ["timestamp", "page_id", "revision_id", "redirect_title"] extractor.extract_pages_to_avro( output_file="wiki_dump.avro", redirects_db_path="redirects.lmdb", # LMDB database for fast redirect lookups ignored_fields=ignored_fields, ) ``` Then index the pages for fast lookups: ```python from wiki_dump_extractor import WikiAvroDumpExtractor extractor = WikiAvroDumpExtractor(file_path="wiki_dump.avro") extractor.index_pages(page_index_db="page_index.lmdb") ``` Later on, read the Avro file and use redirects and index as follows (reads the 12 million pages in ~3-4 minutes depending on your machine): ```python from wiki_dump_extractor import WikiAvroDumpExtractor # Create extractor extractor = WikiAvroDumpExtractor( file_path="wiki_dump.avro", index_dir="page_index.lmdb" # Use the index for faster lookups ) # Get pages with automatic redirect resolution pages = extractor.get_page_batch_by_title( ["Page Title 1", "Page Title 2"] ) ``` ## Installation ```bash pip install wiki-dump-extractor ``` Or from the source in development mode: ```bash pip install -e . ``` To use the LLM-specific module (that would be mostly if you are on a project like Landnotes), use ```bash pip install wiki-dump-extractor[llm] ``` Or locally: ```bash pip install -e ".[llm]" ``` To install with tests, use `pip install -e ".[dev]"` then run the tests with `pytest` in the root directory. ### Requirements for running the LLM utils ```bash # Add the Cloud SDK distribution URI as a package source echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list # Import the Google Cloud public key curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - # Update the package list and install the Cloud SDK sudo apt-get update && sudo apt-get install google-cloud-sdk ```