# bible-corpus

**Repository Path**: mirrors_alvations/bible-corpus

## Basic Information

- **Project Name**: bible-corpus
- **Description**: A multilingual parallel corpus created from translations of the Bible.
- **Primary Language**: Unknown
- **License**: CC0-1.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-02-06
- **Last Updated**: 2026-05-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# bible-corpus
A multilingual parallel corpus created from translations of the Bible.

Here you can find a multilingual parallel corpus created from translations of the Bible.
This an effort to create a parallel corpus containing as many languages as possible that could be used for 
a number of NLP tasks. Using the Book, Chapter and Verse indices the corpus is aligned (almost) at a sentence level. 
(There are cases where two verses in one language are translated as one in another).

Following [a similar effort](http://www.umiacs.umd.edu/~resnik/parallel/bible.html) by Philip Resnik and Mari Broman 
Olsen at the University of Maryland, I have encoded the text of each language in XML files using the
[Corpus Encoding Standard](http://www.cs.vassar.edu/CES/). 
Refer to the following paper for more details about the creation of the corpus:

* [A massively parallel corpus: the Bible in 100 languages](http://link.springer.com/article/10.1007/s10579-014-9287-y),
Christos Christodoulopoulos and Mark Steedman, *Language Resources and Evaluation*, 49 (2)

[Armin Hoenen](https://www.hucompute.org/team/armin-hoenen) from the [Text Technology Lab](https://www.hucompute.org/) 
at the Goethe Universität, has created tokenised versions of four languages 
(Chinese, Japanese, Thai, Vietnamese). They are included in this collection but they can also be found 
[here](https://www.hucompute.org/ressourcen/corpora).

If you are looking for a quick way to generating a raw text version of each Bible, you can use following Python snippet (replace `lang` with the name of the XML file):
```
import xml.etree.ElementTree as ET
lang = 'English'
root = ET.fromstring(open(lang + '.xml').read())
with open(lang + '.txt', 'w', encoding='utf-8') as out:
    for n in root.iter('seg'):
        out.write(n.text.strip() + '\n')
```
or for a specific book:
```
book_id = 'b.GEN'
with open(lang + '-' + book_id + '.txt', 'w', encoding='utf-8') as out:
    for n in root.findall('.//div[@id="'+book_id+'"]/*seg'):
        out.write(n.text.strip() + '\n')
```

Follow this link for [a collection of tools for reading/processing the corpus](https://github.com/christos-c/bible-corpus-tools).