Deploying Kosh's Backend

Input: Data module

A Kosh data module consists of:

Lexical data in XML

You can add to Kosh any valid XML file. The following entry belongs to the the Basque dictionary Hiztegi Batua. This dictionary has been compiled by the Academy of the Basque Language, Euzkaltzaindia. You can also access it at Kosh Data

 <entry id="13">
    <form>
      <orth>abadetasun</orth>
    </form>
    <sense n="1">
      <gramGrp>
        <pos>
          <q>iz.</q>
        </pos>
      </gramGrp>
      <def>monasterioko buruaren kargua eta egitekoa</def>
    </sense>
    <sense n="2">
      <gramGrp>
        <pos>
          <q>iz.</q>
        </pos>
      </gramGrp>
      <def>apaizgoa</def>
      <usg type="geo">
        <q>Bizk.</q>
      </usg>
    </sense>
  </entry>

JSON Configuration File

Kosh leverages XPath 1.0 notation to specify information about XML nodes and their subnodes in a JSON configuration file.

Specifying XML Nodes and Subnodes

To define the XML nodes and their subnodes to be indexed, you need to create a JSON configuration file. In this file, you use the XPath 1.0 notation to specify the desired nodes and subnodes. By using XPath expressions, you can precisely identify the XML elements for indexing.

Indexing Arrays of Elements

Elasticsearch inherently supports indexing arrays of elements. When deploying Kosh, it is important to inform both Kosh and Elasticsearch about array indexing. To do so, you must modify the fields property in your JSON configuration file by enclosing the respective value in square brackets. For example, if the field you want to index is named sense_def, you should specify it as [sense_def] within the fields property.

String Indexing Options

Kosh provides two options for indexing strings, depending on your requirements:

Saving Strings as They Are: If you want to index strings without any preprocessing, preserving their original form, you should set the "type" property to "keyword". This ensures that Elasticsearch stores the strings as exact values without applying any analysis.
Preprocessing Strings: If you wish to analyze strings before indexing them, allowing Elasticsearch to process and tokenize them, you should set the "type" property to "text". By doing so, Elasticsearch will apply its default analysis process to the strings, making them searchable based on the generated tokens.

Automatic Entry IDs

If your XML dictionary entries do not have unique identifiers (IDs), Kosh can automatically generate them for you. This ensures that each entry in the Elasticsearch index has a unique identifier for easy retrieval and referencing.

Indexing Process and XML Tags

By default, Kosh indexes the entire XML entry. However, it's important to note that during this process, the XML tags themselves are not analyzed. This means that you cannot directly search for XML tags within the indexed data. The indexing process focuses on the content within the tags, making it searchable based on the specified indexing options.

Configuration file (hiztegibatua_mapping.json (opens in a new tab)) for the hiztegibatua (opens in a new tab) dictionary.

{
  "mappings": {
    "_meta": {
      "_xpaths": {
        "id": "./@id",
        "root": "//entry",
        "fields": {
          "lemma": "./form/orth",
          "[sense_def]": "./sense/def",
          "[sense_pos]": "./sense/gramGrp/pos/q",
          "[dicteg]": "./sense/dicteg/q"
        }
      }
    },
    "properties": {
      "lemma": {
        "type": "keyword"
      },
      "sense_def": {
        "type": "text"
      },
      "sense_pos": {
        "type": "text"
      },
      "dicteg": {
        "type": "text"
      }
    }
  }
}

'.kosh' file

To facilitate Kosh's interaction with your data, you're required to create a '.kosh' file in each data module. This file will inform Kosh about:

The designated index name for your dataset.
The location of your XML data.
The location of your configuration file.

For instance, the '.kosh' file for hiztegibatua (opens in a new tab) would look like this:

[hiztegibatua]
files: ["hiztegibatua.xml"]
schema: hiztegibatua_mapping.json

Important Notice: Dataset Naming Guidelines

To ensure consistency and ease of use, it is crucial to adhere to the following guidelines when naming your datasets. Please take note of the following:

Use ASCII letters: Limit dataset names to characters from a-z and A-Z. Avoid special characters or symbols.
Default to lowercase: It is strongly recommended to use lowercase letters as the default case for dataset names. This helps maintain uniformity and improves readability.
Underscores allowed: You may include underscores _ within dataset names to enhance clarity or separate meaningful components. But please note that the use of underscores is not recommended.

Example valid dataset names that conform to these guidelines include H_N and H_n. But we do recommend using lowercase letters and avoiding underscores, as in hn.

Multiple Files

In cases where your dictionary is split across multiple files, the '.kosh' file should list all files, as illustrated below:

[de_alcedo]
files: ["alcedo-1.tei", "alcedo-2.tei", "alcedo-3.tei", "alcedo-4.tei", "alcedo-5.tei"]
schema: de_alcedo_mapping.json

Adding Custom Metadata

Kosh allows you to add custom metadata to your data modules. This metadata will be indexed along with your data, allowing you to search for it. To add custom metadata, inside the '.kosh' file in your data module and add the metadata as follows:

[hoenig]
files: ["hoenig.tei"]
schema: hoenig_mapping.json
title: "Wörterbuch der Kölner Mundart"
authors: ["Fritz Hönig"]
source_languages: ["ksh"]
target_languages: ["deu"]

Kosh is flexible and adapts to your specific needs. Therefore, you can append as many metadata fields as necessary for your project.

To insert multiple values into a particular metadata field, simply separate each value with a comma and enclose the list in square brackets. For example:

source_languages: ["afr", "deu"]

If you want to add a single value, you can do so without brackets:

year: "1922"

Kosh Deployment

Kosh can be implemented natively on Unix-like systems or through Docker. However, we strongly recommend deployment using Docker for the most efficient setup and management.

Keep in mind: When deployed natively on Linux systems, Kosh introduces a beneficial feature of data synchronization. With this feature, any changes made to a file within a data module trigger Kosh to automatically update the index. It's important to be aware, though, that this feature is not currently available on macOS platforms.

With Docker

Procedure:

git clone https://github.com/cceh/kosh
cd kosh

In docker-compose.override.yml, you need to specify the path to your data modules, i.e. replace../kosh_data/hoenig:

    
version: '2.3'
 
services:
  elastic:
    ## Uncomment the next line when the host network should be used.
    # network_mode: host
 
    ## Uncomment the next line when deploying in production.
    # restart: always
 
  kosh:
    ## Uncomment the next line when the host network should be used.
    # network_mode: host
 
    ## uncomment the next line when deploying in production
    # restart: always
 
    ## volumes:
    ##   - PATH_TO_KOSH_INI_FILE:/etc/kosh.ini:ro
    ##   - PATH_TO_XML_LEXICAL_DATA_WITH_KOSH_FILES:/var/lib/kosh:ro
    volumes:
      - ./kosh.ini:/etc/kosh.ini:ro
      - ../kosh_data/hoenig:/var/lib/kosh:ro
 
    command: ['--config_file', '/etc/kosh.ini']

sudo docker-compose up -d

To check the logs:

sudo docker-compose logs

To stop and redeploy:

sudo docker-compose down

With pip (pyproject.toml)

You must have python = ">= 3.11" installed.
Execute pip install <path>/<to>/<kosh>

Sample datasets: Kosh Data

In the Kosh Data repository, you'll discover a variety of datasets designed for deployment with Kosh. Each dataset contains the necessary files that Kosh requires for operation:

Lexical data encoded in XML
A JSON configuration file
A '.kosh' file

You are encouraged to review these datasets as examples, they serve as a useful guide when setting up and configuring your own Kosh data modules.

Data Synchronization Frontend