Skip to content

OntoLAMA: Dataset Overview and Usage Guide

paperlink paperlink dataset

Citation

@inproceedings{he2023language,
    title={Language Model Analysis for Ontology Subsumption Inference},
    author={He, Yuan and Chen, Jiaoyan and Jimenez-Ruiz, Ernesto and Dong, Hang and Horrocks, Ian},
    booktitle={Findings of the Association for Computational Linguistics: ACL 2023},
    pages={3439--3453},
    year={2023}
}

This page provides an overview of the \(\textsf{OntoLAMA}\) datasets, how to use them, and the related probing approach introduced in the research paper.

Overview

\(\textsf{OntoLAMA}\) is a set of language model (LM) probing datasets and a prompt-based probing method for ontology subsumption inference or ontology completion. The work follows the "LMs-as-KBs" literature but focuses on conceptualised knowledge extracted from formalised KBs such as the OWL ontologies. Specifically, the subsumption inference (SI) task is introduced and formulated in the Natural Language Inference (NLI) style, where the sub-concept and the super-concept involved in a subsumption axiom are verbalised and fitted into a template to form the premise and hypothesis, respectively. The sampled axioms are verified through ontology reasoning. The SI task is further divided into Atomic SI and Complex SI where the former involves only atomic named concepts and the latter involves both atomic and complex concepts. Real-world ontologies of different scales and domains are used for constructing OntoLAMA and in total there are four Atomic SI datasets and two Complex SI datasets.

Statistics

Source #NamedConcepts #EquivAxioms #Dataset (Train/Dev/Test)
Schema.org 894 - Atomic SI: 808/404/2,830
DOID 11,157 - Atomic SI: 90,500/11,312/11,314
FoodOn 30,995 2,383 Atomic SI: 768,486/96,060/96,062
Complex SI: 3,754/1,850/13,080
GO 43,303 11,456 Atomic SI: 772,870/96,608/96,610
Complex SI: 72,318/9,040/9,040
MNLI - - biMNLI: 235,622/26,180/12,906

Usage

Users have two options for accessing the OntoLAMA datasets. They can either download the datasets directly from Zenodo or use the Huggingface Datasets platform.

If using Huggingface, users should first install the dataset package:

pip install datasets

Then, a dataset can be accessed by:

from datasets import load_dataset
# dataset = load_dataset("krr-oxford/OntoLAMA", dataset_name)
# for example, loading the Complex SI dataset of Go
dataset = load_dataset("krr-oxford/OntoLAMA", "go-complex-SI") 

Options of dataset_name include:

  • "bimnli" (from MNLI)
  • "schemaorg-atomic-SI" (from Schema.org)
  • "doid-atomic-SI" (from DOID)
  • "foodon-atomic-SI", "foodon-complex-SI" (from FoodOn)
  • "go-atomic-SI", "go-complex-SI" (from Go)

After loading the dataset, a particular data split can be accessed by:

dataset[split_name]  # split_name = "train", "validation", or "test"

Please refer to the Huggingface page for examples of data points and explanations of data fields.

If downloading from Zenodo, users can simply target on specific .jsonl files.

Prompt-based Probing

\(\textsf{OntoLAMA}\) adopts the prompt-based probing approach to examine an LM's knowledge. Specifically, it wraps the verbalised sub-concept and super-concept into a template with a masked position; the LM is expected to predict the masked token and determine whether there exists a subsumption relationship between the two concepts.

The verbalisation algorithm has been implemented as a separate ontology processing module, see verbalise ontology concepts.

To conduct probing, users can write the following code into a script, e.g., probing.py:

from openprompt.config import get_config
from deeponto.complete.ontolama import run_inference

config, args = get_config()
# you can then manipulate the configuration before running the inference
config.learning_setting = "few_shot"  # zero_shot, full
config.manual_template.choice = 0  # using the first template in the template file
...

# run the subsumption inference
run_inference(config, args)

Then, run the script with the following command:

python probing.py --config_yaml config.yaml

See an example of config.yaml at DeepOnto/scripts/ontolama/config.yaml

The template file for the SI task (two templates) is located in DeepOnto/scripts/ontolama/si_templates.txt.

The template file for the biMNLI task (two templates) is located in DeepOnto/scripts/ontolama/nli_templates.txt.

The label word file for both SI and biMNLI tasks is located in DeepOnto/scripts/ontolama/label_words.jsonl.


Last update: March 18, 2024
Created: March 10, 2023
GitHub: @Lawhy   Personal Page: yuanhe.wiki