BERTMap

Paper

\(\textsf{BERTMap}\) is proposed in the paper: BERTMap: A BERT-based Ontology Alignment System (AAAI-2022).

@inproceedings{he2022bertmap,
    title={BERTMap: a BERT-based ontology alignment system},
    author={He, Yuan and Chen, Jiaoyan and Antonyrajah, Denvar and Horrocks, Ian},
    booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
    volume={36},
    number={5},
    pages={5684--5691},
    year={2022}
}

\(\textsf{BERTMap}\) is a BERT-based ontology matching (OM) system consisting of following components:

Text semantics corpora construction from input ontologies, and optionally from input mappings and other auxiliary ontologies.
BERT synonym classifier training on synonym and non-synonym samples in text semantics corpora.
Sub-word Inverted Index construction from the tokenised class annotations for candidate selection in mapping prediction.
Mapping Predictor which integrates a simple edit distance-based string matching module and the fine-tuned BERT synonym classifier for mapping scoring. For each source ontology class, narrow down target class candidates using the sub-word inverted index, apply string matching for "easy" mappings and then apply BERT matching.
Mapping Refiner which consists of the mapping extension and mapping repair modules. Mapping extension is an iterative process based on the locality principle. Mapping repair utilises the LogMap's debugger.

\(\textsf{BERTMapLt}\) is a light-weight version of \(\textsf{BERTMap}\) without the BERT module and mapping refiner.

See the tutorial for \(\textsf{BERTMap}\) here.

`BERTMapPipeline(src_onto, tgt_onto, config)`

Class for the whole ontology alignment pipeline of \(\textsf{BERTMap}\) and \(\textsf{BERTMapLt}\) models.

Note

Parameters related to BERT training are None by default. They will be constructed for \(\textsf{BERTMap}\) and stay as None for \(\textsf{BERTMapLt}\).

Attributes:

Name	Type	Description
`config`	`CfgNode`	The configuration for BERTMap or BERTMapLt.
`name`	`str`	The name of the model, either `bertmap` or `bertmaplt`.
`output_path`	`str`	The path to the output directory.
`src_onto`	`Ontology`	The source ontology to be matched.
`tgt_onto`	`Ontology`	The target ontology to be matched.
`annotation_property_iris`	`List[str]`	The annotation property IRIs used for extracting synonyms and nonsynonyms.
`src_annotation_index`	`dict`	A dictionary that stores the `(class_iri, class_annotations)` pairs from `src_onto` according to `annotation_property_iris`.
`tgt_annotation_index`	`dict`	A dictionary that stores the `(class_iri, class_annotations)` pairs from `tgt_onto` according to `annotation_property_iris`.
`known_mappings`	`List[ReferenceMapping]`	List of known mappings for constructing the cross-ontology corpus.
`auxliary_ontos`	`List[Ontology]`	List of auxiliary ontolgoies for constructing any auxiliary corpus.
`corpora`	`dict`	A dictionary that stores the `summary` of built text semantics corpora and the sampled `synonyms` and `nonsynonyms`.
`finetune_data`	`dict`	A dictionary that stores the `training` and `validation` splits of samples from `corpora`.
`bert`	`BERTSynonymClassifier`	A BERT model for synonym classification and mapping prediction.
`best_checkpoint`	`str`	The path to the best BERT checkpoint which will be loaded after training.
`mapping_predictor`	`MappingPredictor`	The predictor function based on class annotations, used for global matching or mapping scoring.

Parameters:

Name	Type	Description	Default
`src_onto`	`Ontology`	The source ontology for alignment.	required
`tgt_onto`	`Ontology`	The target ontology for alignment.	required
`config`	`CfgNode`	The configuration for BERTMap or BERTMapLt.	required

Source code in src/deeponto/align/bertmap/pipeline.py

def __init__(self, src_onto: Ontology, tgt_onto: Ontology, config: CfgNode):
    """Initialise the BERTMap or BERTMapLt model.

    Args:
        src_onto (Ontology): The source ontology for alignment.
        tgt_onto (Ontology): The target ontology for alignment.
        config (CfgNode): The configuration for BERTMap or BERTMapLt.
    """
    # load the configuration and confirm model name is valid
    self.config = config
    self.name = self.config.model
    if not self.name in MODEL_OPTIONS.keys():
        raise RuntimeError(f"`model` {self.name} in the config file is not one of the supported.")

    # create the output directory, e.g., experiments/bertmap
    self.config.output_path = "." if not self.config.output_path else self.config.output_path
    self.config.output_path = os.path.abspath(self.config.output_path)
    self.output_path = os.path.join(self.config.output_path, self.name)
    create_path(self.output_path)

    # create logger and progress manager (hidden attribute) 
    self.logger = create_logger(self.name, self.output_path)
    self.enlighten_manager = enlighten.get_manager()

    # ontology
    self.src_onto = src_onto
    self.tgt_onto = tgt_onto
    self.annotation_property_iris = self.config.annotation_property_iris
    self.logger.info(f"Load the following configurations:\n{print_dict(self.config)}")
    config_path = os.path.join(self.output_path, "config.yaml")
    self.logger.info(f"Save the configuration file at {config_path}.")
    self.save_bertmap_config(self.config, config_path)

    # build the annotation thesaurus
    self.src_annotation_index, _ = self.src_onto.build_annotation_index(self.annotation_property_iris, apply_lowercasing=True)
    self.tgt_annotation_index, _ = self.tgt_onto.build_annotation_index(self.annotation_property_iris, apply_lowercasing=True)
    if (not self.src_annotation_index) or (not self.tgt_annotation_index):
        raise RuntimeError("No class annotations found in input ontologies; unable to produce alignment.")

    # provided mappings if any
    self.known_mappings = self.config.known_mappings
    if self.known_mappings:
        self.known_mappings = ReferenceMapping.read_table_mappings(self.known_mappings)

    # auxiliary ontologies if any
    self.auxiliary_ontos = self.config.auxiliary_ontos
    if self.auxiliary_ontos:
        self.auxiliary_ontos = [Ontology(ao) for ao in self.auxiliary_ontos]

    self.data_path = os.path.join(self.output_path, "data")
    # load or construct the corpora
    self.corpora_path = os.path.join(self.data_path, "text-semantics.corpora.json")
    self.corpora = self.load_text_semantics_corpora()

    # load or construct fine-tune data
    self.finetune_data_path = os.path.join(self.data_path, "fine-tune.data.json")
    self.finetune_data = self.load_finetune_data()

    # load the bert model and train
    self.bert_config = self.config.bert
    self.bert_pretrained_path = self.bert_config.pretrained_path
    self.bert_finetuned_path = os.path.join(self.output_path, "bert")
    self.bert_resume_training = self.bert_config.resume_training
    self.bert_synonym_classifier = None
    self.best_checkpoint = None
    if self.name == "bertmap":
        self.bert_synonym_classifier = self.load_bert_synonym_classifier()
        # train if the loaded classifier is not in eval mode
        if self.bert_synonym_classifier.eval_mode == False:
            self.logger.info(
                f"Data statistics:\n \
                {print_dict(self.bert_synonym_classifier.data_stat)}"
            )
            self.bert_synonym_classifier.train(self.bert_resume_training)
            # turn on eval mode after training
            self.bert_synonym_classifier.eval()
        # NOTE potential redundancy here: after training, load the best checkpoint
        self.best_checkpoint = self.load_best_checkpoint()
        if not self.best_checkpoint:
            raise RuntimeError(f"No best checkpoint found for the BERT synonym classifier model.")
        self.logger.info(f"Fine-tuning finished, found best checkpoint at {self.best_checkpoint}.")
    else:
        self.logger.info(f"No training needed; skip BERT fine-tuning.")

    # pretty progress bar tracking
    self.enlighten_status = self.enlighten_manager.status_bar(
        status_format=u'Global Matching{fill}Stage: {demo}{fill}{elapsed}',
        color='bold_underline_bright_white_on_lightslategray',
        justify=enlighten.Justify.CENTER, demo='Initializing',
        autorefresh=True, min_delta=0.5
    )

    # mapping predictions
    self.global_matching_config = self.config.global_matching

    # build ignored class index for OAEI
    self.ignored_class_index = None  
    if self.global_matching_config.for_oaei:
        self.ignored_class_index = defaultdict(lambda: False)
        for src_class_iri, src_class in self.src_onto.owl_classes.items():
            use_in_alignment = self.src_onto.get_annotations(src_class, "http://oaei.ontologymatching.org/bio-ml/ann/use_in_alignment")
            if use_in_alignment and str(use_in_alignment[0]).lower() == "false":
                self.ignored_class_index[src_class_iri] = True
        for tgt_class_iri, tgt_class in self.tgt_onto.owl_classes.items():
            use_in_alignment = self.tgt_onto.get_annotations(tgt_class, "http://oaei.ontologymatching.org/bio-ml/ann/use_in_alignment")
            if use_in_alignment and str(use_in_alignment[0]).lower() == "false":
                self.ignored_class_index[tgt_class_iri] = True

    self.mapping_predictor = MappingPredictor(
        output_path=self.output_path,
        tokenizer_path=self.bert_config.pretrained_path,
        src_annotation_index=self.src_annotation_index,
        tgt_annotation_index=self.tgt_annotation_index,
        bert_synonym_classifier=self.bert_synonym_classifier,
        num_raw_candidates=self.global_matching_config.num_raw_candidates,
        num_best_predictions=self.global_matching_config.num_best_predictions,
        batch_size_for_prediction=self.bert_config.batch_size_for_prediction,
        logger=self.logger,
        enlighten_manager=self.enlighten_manager,
        enlighten_status=self.enlighten_status,
        ignored_class_index=self.ignored_class_index,
    )
    self.mapping_refiner = None

    # if global matching is disabled (potentially used for class pair scoring)
    if self.config.global_matching.enabled:
        self.mapping_predictor.mapping_prediction()  # mapping prediction
        if self.name == "bertmap":
            self.mapping_refiner = MappingRefiner(
                output_path=self.output_path,
                src_onto=self.src_onto,
                tgt_onto=self.tgt_onto,
                mapping_predictor=self.mapping_predictor,
                mapping_extension_threshold=self.global_matching_config.mapping_extension_threshold,
                mapping_filtered_threshold=self.global_matching_config.mapping_filtered_threshold,
                logger=self.logger,
                enlighten_manager=self.enlighten_manager,
                enlighten_status=self.enlighten_status
            )
            self.mapping_refiner.mapping_extension()  # mapping extension
            self.mapping_refiner.mapping_repair()  # mapping repair
        self.enlighten_status.update(demo="Finished")  
    else:
        self.enlighten_status.update(demo="Skipped")  

    self.enlighten_status.close()

`load_or_construct(data_file, data_name, construct_func, *args, **kwargs)`

Load existing data or construct a new one.

An auxlirary function that checks the existence of a data file and loads it if it exists. Otherwise, construct new data with the input construct_func which is supported generate a local data file.

Source code in src/deeponto/align/bertmap/pipeline.py

def load_or_construct(self, data_file: str, data_name: str, construct_func: Callable, *args, **kwargs):
    """Load existing data or construct a new one.

    An auxlirary function that checks the existence of a data file and loads it if it exists.
    Otherwise, construct new data with the input `construct_func` which is supported generate
    a local data file.
    """
    if os.path.exists(data_file):
        self.logger.info(f"Load existing {data_name} from {data_file}.")
    else:
        self.logger.info(f"Construct new {data_name} and save at {data_file}.")
        construct_func(*args, **kwargs)
    # load the data file that is supposed to be saved locally
    return load_file(data_file)

`load_text_semantics_corpora()`

Load or construct text semantics corpora.

See TextSemanticsCorpora.

Source code in src/deeponto/align/bertmap/pipeline.py

def load_text_semantics_corpora(self):
    """Load or construct text semantics corpora.

    See [`TextSemanticsCorpora`][deeponto.align.bertmap.text_semantics.TextSemanticsCorpora].
    """
    data_name = "text semantics corpora"

    if self.name == "bertmap":

        def construct():
            corpora = TextSemanticsCorpora(
                src_onto=self.src_onto,
                tgt_onto=self.tgt_onto,
                annotation_property_iris=self.annotation_property_iris,
                class_mappings=self.known_mappings,
                auxiliary_ontos=self.auxiliary_ontos,
            )
            self.logger.info(str(corpora))
            corpora.save(self.data_path)

        return self.load_or_construct(self.corpora_path, data_name, construct)

    self.logger.info(f"No training needed; skip the construction of {data_name}.")
    return None

`load_finetune_data()`

Load or construct fine-tuning data from text semantics corpora.

Steps of constructing fine-tuning data from text semantics:

Mix synonym and nonsynonym data.
Randomly sample 90% as training samples and 10% as validation.

Source code in src/deeponto/align/bertmap/pipeline.py

def load_finetune_data(self):
    r"""Load or construct fine-tuning data from text semantics corpora.

    Steps of constructing fine-tuning data from text semantics:

    1. Mix synonym and nonsynonym data.
    2. Randomly sample 90% as training samples and 10% as validation.
    """
    data_name = "fine-tuning data"

    if self.name == "bertmap":

        def construct():
            finetune_data = dict()
            samples = self.corpora["synonyms"] + self.corpora["nonsynonyms"]
            random.shuffle(samples)
            split_index = int(0.9 * len(samples))  # split at 90%
            finetune_data["training"] = samples[:split_index]
            finetune_data["validation"] = samples[split_index:]
            save_file(finetune_data, self.finetune_data_path)

        return self.load_or_construct(self.finetune_data_path, data_name, construct)

    self.logger.info(f"No training needed; skip the construction of {data_name}.")
    return None

`load_bert_synonym_classifier()`

Load the BERT model from a pre-trained or a local checkpoint.

If loaded from pre-trained, it means to start training from a pre-trained model such as bert-uncased.
If loaded from local, turn on the eval mode for mapping predictions.
If self.bert_resume_training is True, it will be loaded from the latest saved checkpoint.

Source code in src/deeponto/align/bertmap/pipeline.py

def load_bert_synonym_classifier(self):
    """Load the BERT model from a pre-trained or a local checkpoint.

    - If loaded from pre-trained, it means to start training from a pre-trained model such as `bert-uncased`.
    - If loaded from local, turn on the `eval` mode for mapping predictions.
    - If `self.bert_resume_training` is `True`, it will be loaded from the latest saved checkpoint.
    """
    checkpoint = self.load_best_checkpoint()  # load the best checkpoint or nothing
    eval_mode = True
    # if no checkpoint has been found, start training from scratch OR resume training
    # no point to load the best checkpoint if resume training (will automatically search for the latest checkpoint)
    if not checkpoint or self.bert_resume_training:
        checkpoint = self.bert_pretrained_path
        eval_mode = False  # since it is for training now

    return BERTSynonymClassifier(
        loaded_path=checkpoint,
        output_path=self.bert_finetuned_path,
        eval_mode=eval_mode,
        max_length_for_input=self.bert_config.max_length_for_input,
        num_epochs_for_training=self.bert_config.num_epochs_for_training,
        batch_size_for_training=self.bert_config.batch_size_for_training,
        batch_size_for_prediction=self.bert_config.batch_size_for_prediction,
        training_data=self.finetune_data["training"],
        validation_data=self.finetune_data["validation"],
    )

`load_best_checkpoint()`

Find the best checkpoint by searching for trainer states in each checkpoint file.

Source code in src/deeponto/align/bertmap/pipeline.py

def load_best_checkpoint(self) -> Optional[str]:
    """Find the best checkpoint by searching for trainer states in each checkpoint file."""
    best_checkpoint = -1

    if os.path.exists(self.bert_finetuned_path):
        for file in os.listdir(self.bert_finetuned_path):
            # load trainer states from each checkpoint file
            if file.startswith("checkpoint"):
                trainer_state = load_file(
                    os.path.join(self.bert_finetuned_path, file, "trainer_state.json")
                )
                checkpoint = int(trainer_state["best_model_checkpoint"].split("/")[-1].split("-")[-1])
                # find the latest best checkpoint
                if checkpoint > best_checkpoint:
                    best_checkpoint = checkpoint

    if best_checkpoint == -1:
        best_checkpoint = None
    else:
        best_checkpoint = os.path.join(self.bert_finetuned_path, f"checkpoint-{best_checkpoint}")

    return best_checkpoint

`load_bertmap_config(config_file=None)` `staticmethod`

Load the BERTMap configuration in .yaml. If the file is not provided, use the default configuration.

Source code in src/deeponto/align/bertmap/pipeline.py

@staticmethod
def load_bertmap_config(config_file: Optional[str] = None):
    """Load the BERTMap configuration in `.yaml`. If the file
    is not provided, use the default configuration.
    """
    if not config_file:
        config_file = DEFAULT_CONFIG_FILE
        print(f"Use the default configuration at {DEFAULT_CONFIG_FILE}.")  
    if not config_file.endswith(".yaml"):
        raise RuntimeError("Configuration file should be in `yaml` format.")
    return CfgNode(load_file(config_file))

`save_bertmap_config(config, config_file)` `staticmethod`

Save the BERTMap configuration in .yaml.

Source code in src/deeponto/align/bertmap/pipeline.py

@staticmethod
def save_bertmap_config(config: CfgNode, config_file: str):
    """Save the BERTMap configuration in `.yaml`."""
    with open(config_file, "w") as c:
        config.dump(stream=c, sort_keys=False, default_flow_style=False)

`AnnotationThesaurus(onto, annotation_property_iris, apply_transitivity=False)`

A thesaurus class for synonyms and non-synonyms extracted from an ontology.

Some related definitions of arguments here:

A synonym_group is a set of annotation phrases that are synonymous to each other;
The transitivity of synonyms means if A and B are synonymous and B and C are synonymous, then A and C are synonymous. This is achieved by a connected graph-based algorithm.
A synonym_pair is a pair synonymous annotation phrase which can be extracted from the cartesian product of a synonym_group and itself. NOTE that reflexivity and symmetry are preserved meaning that (i) every phrase A is a synonym of itself and (ii) if (A, B) is a synonym pair then (B, A) is a synonym pair, too.

Attributes:

Name	Type	Description
`onto`	`Ontology`	An ontology to construct the annotation thesaurus from.
`annotation_index`	`Dict[str, Set[str]]`	An index of the class annotations with `(class_iri, annotations)` pairs.
`annotation_property_iris`	`List[str]`	A list of annotation property IRIs used to extract the annotations.
`average_number_of_annotations_per_class`	`int`	The average number of (extracted) annotations per ontology class.
`apply_transitivity`	`bool`	Apply synonym transitivity to merge synonym groups or not.
`synonym_groups`	`List[Set[str]]`	The list of synonym groups extracted from the ontology according to specified annotation properties.

Parameters:

Name	Type	Description	Default
`onto`	`Ontology`	The input ontology to extract annotations from.	required
`annotation_property_iris`	`List[str]`	Specify which annotation properties to be used.	required
`apply_transitivity`	`bool`	Apply synonym transitivity to merge synonym groups or not. Defaults to `False`.	`False`

Source code in src/deeponto/align/bertmap/text_semantics.py

def __init__(self, onto: Ontology, annotation_property_iris: List[str], apply_transitivity: bool = False):
    r"""Initialise a thesaurus for ontology class annotations.

    Args:
        onto (Ontology): The input ontology to extract annotations from.
        annotation_property_iris (List[str]): Specify which annotation properties to be used.
        apply_transitivity (bool, optional): Apply synonym transitivity to merge synonym groups or not. Defaults to `False`.
    """

    self.onto = onto
    # build the annotation index to extract synonyms from `onto`
    # the input property iris may not exist in this ontology
    # the output property iris will be truncated to the existing ones
    index, iris = self.onto.build_annotation_index(
        annotation_property_iris=annotation_property_iris,
        entity_type="Classes",
        apply_lowercasing=True,
    )
    self.annotation_index = index
    self.annotation_property_iris = iris
    total_number_of_annotations = sum([len(v) for v in self.annotation_index.values()])
    self.average_number_of_annotations_per_class = total_number_of_annotations / len(self.annotation_index)

    # synonym groups
    self.apply_transitivity = apply_transitivity
    self.synonym_groups = list(self.annotation_index.values())
    if self.apply_transitivity:
        self.synonym_groups = self.merge_synonym_groups_by_transitivity(self.synonym_groups)

    # summary
    self.info = {
        type(self).__name__: {
            "ontology": self.onto.info[type(self.onto).__name__],
            "average_number_of_annotations_per_class": round(self.average_number_of_annotations_per_class, 3),
            "number_of_synonym_groups": len(self.synonym_groups),
        }
    }

`get_synonym_pairs(synonym_group, remove_duplicates=True)` `staticmethod`

Get synonym pairs from a synonym group through a cartesian product.

Parameters:

Name	Type	Description	Default
`synonym_group`	`Set[str]`	A set of annotation phrases that are synonymous to each other.	required

Returns:

Type	Description
`List[Tuple[str, str]]`	A list of synonym pairs.

Source code in src/deeponto/align/bertmap/text_semantics.py

@staticmethod
def get_synonym_pairs(synonym_group: Set[str], remove_duplicates: bool = True):
    """Get synonym pairs from a synonym group through a cartesian product.

    Args:
        synonym_group (Set[str]): A set of annotation phrases that are synonymous to each other.

    Returns:
        (List[Tuple[str, str]]): A list of synonym pairs.
    """
    synonym_pairs = list(itertools.product(synonym_group, synonym_group))
    if remove_duplicates:
        return uniqify(synonym_pairs)
    else:
        return synonym_pairs

`merge_synonym_groups_by_transitivity(synonym_groups)` `staticmethod`

Merge synonym groups by transitivity.

Synonym groups that share a common annotation phrase will be merged. NOTE that for multiple ontologies, we can merge their synonym groups by first concatenating them then use this function.

Note

In \(\textsf{BERTMap}\) experiments we have considered this as a data augmentation approach but it does not bring a significant performance improvement. However, if the overall number of annotations is not large enough then this could be a good option.

Parameters:

Name	Type	Description	Default
`synonym_groups`	`List[Set[str]]`	A sequence of synonym groups to be merged.	required

Returns:

Type	Description
`List[Set[str]]`	A list of merged synonym groups.

Source code in src/deeponto/align/bertmap/text_semantics.py

@staticmethod
def merge_synonym_groups_by_transitivity(synonym_groups: List[Set[str]]):
    r"""Merge synonym groups by transitivity.

    Synonym groups that share a common annotation phrase will be merged. NOTE that for
    multiple ontologies, we can merge their synonym groups by first concatenating them
    then use this function.

    !!! note

        In $\textsf{BERTMap}$ experiments we have considered this as a data augmentation approach
        but it does not bring a significant performance improvement. However, if the
        overall number of annotations is not large enough then this could be a good option.

    Args:
        synonym_groups (List[Set[str]]): A sequence of synonym groups to be merged.

    Returns:
        (List[Set[str]]): A list of merged synonym groups.
    """
    synonym_pairs = []
    for synonym_group in synonym_groups:
        # gather synonym pairs from the self-product of a synonym group
        synonym_pairs += AnnotationThesaurus.get_synonym_pairs(synonym_group, remove_duplicates=False)
    synonym_pairs = uniqify(synonym_pairs)
    merged_grouped_synonyms = AnnotationThesaurus.connected_labels(synonym_pairs)
    return merged_grouped_synonyms

`connected_annotations(synonym_pairs)` `staticmethod`

Build a graph for adjacency among the class annotations (labels) such that the transitivity of synonyms is ensured.

Auxiliary function for merge_synonym_groups_by_transitivity.

Parameters:

Name	Type	Description	Default
`synonym_pairs`	`List[Tuple[str, str]]`	List of pairs of phrases that are synonymous.	required

Returns:

Type	Description
`List[Set[str]]`	A list of synonym groups.

Source code in src/deeponto/align/bertmap/text_semantics.py

@staticmethod
def connected_annotations(synonym_pairs: List[Tuple[str, str]]):
    """Build a graph for adjacency among the class annotations (labels) such that
    the **transitivity** of synonyms is ensured.

    Auxiliary function for [`merge_synonym_groups_by_transitivity`][deeponto.align.bertmap.text_semantics.AnnotationThesaurus.merge_synonym_groups_by_transitivity].

    Args:
        synonym_pairs (List[Tuple[str, str]]): List of pairs of phrases that are synonymous.

    Returns:
        (List[Set[str]]): A list of synonym groups.
    """
    graph = nx.Graph()
    graph.add_edges_from(synonym_pairs)
    # nx.draw(G, with_labels = True)
    connected = list(nx.connected_components(graph))
    return connected

`synonym_sampling(num_samples=None)`

Sample synonym pairs from a list of synonym groups extracted from the input ontology.

According to the \(\textsf{BERTMap}\) paper, synonyms are defined as label pairs that belong to the same ontology class.

NOTE this has been validated for getting the same results as in the original \(\textsf{BERTMap}\) repository.

Parameters:

Name	Type	Description	Default
`num_samples`	`int`	The (maximum) number of unique samples extracted. Defaults to `None`.	`None`

Returns:

Type	Description
`List[Tuple[str, str]]`	A list of unique synonym pair samples.

Source code in src/deeponto/align/bertmap/text_semantics.py

def synonym_sampling(self, num_samples: Optional[int] = None):
    r"""Sample synonym pairs from a list of synonym groups extracted from the input ontology.

    According to the $\textsf{BERTMap}$ paper, **synonyms** are defined as label pairs that belong
    to the same ontology class.

    NOTE this has been validated for getting the same results as in the original $\textsf{BERTMap}$ repository.

    Args:
        num_samples (int, optional): The (maximum) number of **unique** samples extracted. Defaults to `None`.

    Returns:
        (List[Tuple[str, str]]): A list of unique synonym pair samples.
    """
    synonym_pool = []
    for synonym_group in self.synonym_groups:
        # do not remove duplicates in the loop to save time
        synonym_pairs = self.get_synonym_pairs(synonym_group, remove_duplicates=False)
        synonym_pool += synonym_pairs
    # remove duplicates afer the loop
    synonym_pool = uniqify(synonym_pool)

    if (not num_samples) or (num_samples >= len(synonym_pool)):
        # print("Return all synonym pairs without downsampling.")
        return synonym_pool
    else:
        return random.sample(synonym_pool, num_samples)

`soft_nonsynonym_sampling(num_samples, max_iter=5)`

Sample soft non-synonyms from a list of synonym groups extracted from the input ontology.

According to the \(\textsf{BERTMap}\) paper, soft non-synonyms are defined as label pairs from two different synonym groups that are randomly selected.

Parameters:

Name	Type	Description	Default
`num_samples`	`int`	The (maximum) number of unique samples extracted; this is required unlike for synonym sampling because the non-synonym pool is significantly larger (considering random combinations of different synonym groups).	required
`max_iter`	`int`	The maximum number of iterations for conducting sampling. Defaults to `5`.	`5`

Returns:

Type	Description
`List[Tuple[str, str]]`	A list of unique (soft) non-synonym pair samples.

Source code in src/deeponto/align/bertmap/text_semantics.py

def soft_nonsynonym_sampling(self, num_samples: int, max_iter: int = 5):
    r"""Sample **soft** non-synonyms from a list of synonym groups extracted from the input ontology.

    According to the $\textsf{BERTMap}$ paper, **soft non-synonyms** are defined as label pairs
    from two *different* synonym groups that are **randomly** selected.

    Args:
        num_samples (int): The (maximum) number of **unique** samples extracted; this is
            required **unlike for synonym sampling** because the non-synonym pool is **significantly
            larger** (considering random combinations of different synonym groups).
        max_iter (int): The maximum number of iterations for conducting sampling. Defaults to `5`.

    Returns:
        (List[Tuple[str, str]]): A list of unique (soft) non-synonym pair samples.
    """
    nonsyonym_pool = []
    # randomly select disjoint synonym group pairs from all
    for _ in range(num_samples):
        left_synonym_group, right_synonym_group = tuple(random.sample(self.synonym_groups, 2))
        try:
            # randomly choose one label from a synonym group
            left_label = random.choice(list(left_synonym_group))
            right_label = random.choice(list(right_synonym_group))
            nonsyonym_pool.append((left_label, right_label))
        except:
            # skip if there are no class labels
            continue

    # DataUtils.uniqify is too slow so we should avoid operating it too often
    nonsyonym_pool = uniqify(nonsyonym_pool)

    while len(nonsyonym_pool) < num_samples and max_iter > 0:
        max_iter = max_iter - 1  # reduce the iteration to prevent exhausting loop
        nonsyonym_pool += self.soft_nonsynonym_sampling(num_samples - len(nonsyonym_pool), max_iter)
        nonsyonym_pool = uniqify(nonsyonym_pool)

    return nonsyonym_pool

`weighted_random_choices_of_sibling_groups(k=1)`

Randomly (weighted) select a number of sibling class groups.

The weights are computed according to the sizes of the sibling class groups.

Source code in src/deeponto/align/bertmap/text_semantics.py

def weighted_random_choices_of_sibling_groups(self, k: int = 1):
    """Randomly (weighted) select a number of sibling class groups.

    The weights are computed according to the sizes of the sibling class groups.
    """
    weights = [len(s) for s in self.onto.sibling_class_groups]
    weights = [w / sum(weights) for w in weights]  # normalised
    return random.choices(self.onto.sibling_class_groups, weights=weights, k=k)

`hard_nonsynonym_sampling(num_samples, max_iter=5)`

Sample hard non-synonyms from sibling classes of the input ontology.

According to the \(\textsf{BERTMap}\) paper, hard non-synonyms are defined as label pairs that belong to two disjoint ontology classes. For practical reason, the condition is eased to two sibling ontology classes.

Parameters:

Name	Type	Description	Default
`num_samples`	`int`	The (maximum) number of unique samples extracted; this is required unlike for synonym sampling because the non-synonym pool is significantly larger (considering random combinations of different synonym groups).	required
`max_iter`	`int`	The maximum number of iterations for conducting sampling. Defaults to `5`.	`5`

Returns:

Type	Description
`List[Tuple[str, str]]`	A list of unique (hard) non-synonym pair samples.

Source code in src/deeponto/align/bertmap/text_semantics.py

def hard_nonsynonym_sampling(self, num_samples: int, max_iter: int = 5):
    r"""Sample **hard** non-synonyms from sibling classes of the input ontology.

    According to the $\textsf{BERTMap}$ paper, **hard non-synonyms** are defined as label pairs
    that belong to two **disjoint** ontology classes. For practical reason, the condition
    is eased to two **sibling** ontology classes.

    Args:
        num_samples (int): The (maximum) number of **unique** samples extracted; this is
            required **unlike for synonym sampling** because the non-synonym pool is **significantly
            larger** (considering random combinations of different synonym groups).
        max_iter (int): The maximum number of iterations for conducting sampling. Defaults to `5`.

    Returns:
        (List[Tuple[str, str]]): A list of unique (hard) non-synonym pair samples.
    """
    # intialise the sibling class groups
    self.onto.sibling_class_groups

    if not self.onto.sibling_class_groups:
        warnings.warn("Skip hard negative sampling as no sibling class groups are defined.")
        return []

    # flatten the disjointness groups into all pairs of hard neagtives
    nonsynonym_pool = []
    # randomly (weighted) select a number of sibling class groups with replacement
    sibling_class_groups = self.weighted_random_choices_of_sibling_groups(k=num_samples)

    for sibling_class_group in sibling_class_groups:
        # random select two sibling classes; no weights this time
        left_class_iri, right_class_iri = tuple(random.sample(sibling_class_group, 2))
        try:
            # random select a label for each of them
            left_label = random.choice(list(self.annotation_index[left_class_iri]))
            right_label = random.choice(list(self.annotation_index[right_class_iri]))
            # add the label pair to the pool
            nonsynonym_pool.append((left_label, right_label))
        except:
            # skip them if there are no class labels
            continue

    # DataUtils.uniqify is too slow so we should avoid operating it too often
    nonsynonym_pool = uniqify(nonsynonym_pool)

    while len(nonsynonym_pool) < num_samples and max_iter > 0:
        max_iter = max_iter - 1  # reduce the iteration to prevent exhausting loop
        nonsynonym_pool += self.hard_nonsynonym_sampling(num_samples - len(nonsynonym_pool), max_iter)
        nonsynonym_pool = uniqify(nonsynonym_pool)

    return nonsynonym_pool

`IntraOntologyTextSemanticsCorpus(onto, annotation_property_iris, soft_negative_ratio=2, hard_negative_ratio=2)`

Class for creating the intra-ontology text semantics corpus from an ontology.

As defined in the \(\textsf{BERTMap}\) paper, the intra-ontology text semantics corpus consists of synonym and non-synonym pairs extracted from the ontology class annotations.

Attributes:

Name	Type	Description
`onto`	`Ontology`	An ontology to construct the intra-ontology text semantics corpus from.
`annotation_property_iris`	`List[str]`	Specify which annotation properties to be used.
`soft_negative_ratio`	`int`	The expected negative sample ratio of the soft non-synonyms to the extracted synonyms. Defaults to `2`.
`hard_negative_ratio`	`int`	The expected negative sample ratio of the hard non-synonyms to the extracted synonyms. Defaults to `2`. However, hard non-synonyms are sometimes insufficient given an ontology's hierarchy, the soft ones are used to compensate the number in this case.

Source code in src/deeponto/align/bertmap/text_semantics.py

def __init__(
    self,
    onto: Ontology,
    annotation_property_iris: List[str],
    soft_negative_ratio: int = 2,
    hard_negative_ratio: int = 2,
):
    self.onto = onto
    # $\textsf{BERTMap}$ does not apply synonym transitivity
    self.thesaurus = AnnotationThesaurus(onto, annotation_property_iris, apply_transitivity=False)

    self.synonyms = self.thesaurus.synonym_sampling()
    # sample hard negatives first as they might not be enough
    num_hard = hard_negative_ratio * len(self.synonyms)
    self.hard_nonsynonyms = self.thesaurus.hard_nonsynonym_sampling(num_hard)
    # compensate the number of hard negatives as soft negatives are almost always available
    num_soft = (soft_negative_ratio + hard_negative_ratio) * len(self.synonyms) - len(self.hard_nonsynonyms)
    self.soft_nonsynonyms = self.thesaurus.soft_nonsynonym_sampling(num_soft)

    self.info = {
        type(self).__name__: {
            "num_synonyms": len(self.synonyms),
            "num_nonsynonyms": len(self.soft_nonsynonyms) + len(self.hard_nonsynonyms),
            "num_soft_nonsynonyms": len(self.soft_nonsynonyms),
            "num_hard_nonsynonyms": len(self.hard_nonsynonyms),
            "annotation_thesaurus": self.thesaurus.info["AnnotationThesaurus"],
        }
    }

`save(save_path)`

Save the intra-ontology corpus (a .json file for label pairs and its summary) in the specified directory.

Source code in src/deeponto/align/bertmap/text_semantics.py

def save(self, save_path: str):
    """Save the intra-ontology corpus (a `.json` file for label pairs
    and its summary) in the specified directory.
    """
    create_path(save_path)
    save_json = {
        "summary": self.info,
        "synonyms": [(pos[0], pos[1], 1) for pos in self.synonyms],
        "nonsynonyms": [(neg[0], neg[1], 0) for neg in self.soft_nonsynonyms + self.hard_nonsynonyms],
    }
    save_file(save_json, os.path.join(save_path, "intra-onto.corpus.json"))

`CrossOntologyTextSemanticsCorpus(class_mappings, src_onto, tgt_onto, annotation_property_iris, negative_ratio=4)`

Class for creating the cross-ontology text semantics corpus from two ontologies and provided mappings between them.

As defined in the \(\textsf{BERTMap}\) paper, the cross-ontology text semantics corpus consists of synonym and non-synonym pairs extracted from the annotations/labels of class pairs involved in the provided cross-ontology mappigns.

Attributes:

Name	Type	Description
`class_mappings`	`List[ReferenceMapping]`	A list of cross-ontology class mappings.
`src_onto`	`Ontology`	The source ontology whose class IRIs are heads of the `class_mappings`.
`tgt_onto`	`Ontology`	The target ontology whose class IRIs are tails of the `class_mappings`.
`annotation_property_iris`	`List[str]`	A list of annotation property IRIs used to extract the annotations.
`negative_ratio`	`int`	The expected negative sample ratio of the non-synonyms to the extracted synonyms. Defaults to `4`. NOTE that we do not have hard non-synonyms at the cross-ontology level.

Source code in src/deeponto/align/bertmap/text_semantics.py

def __init__(
    self,
    class_mappings: List[ReferenceMapping],
    src_onto: Ontology,
    tgt_onto: Ontology,
    annotation_property_iris: List[str],
    negative_ratio: int = 4,
):
    self.class_mappings = class_mappings
    self.src_onto = src_onto
    self.tgt_onto = tgt_onto
    # build the annotation thesaurus for each ontology
    self.src_thesaurus = AnnotationThesaurus(src_onto, annotation_property_iris)
    self.tgt_thesaurus = AnnotationThesaurus(tgt_onto, annotation_property_iris)
    self.negative_ratio = negative_ratio

    self.synonyms = self.synonym_sampling_from_mappings()
    num_negative = negative_ratio * len(self.synonyms)
    self.nonsynonyms = self.nonsynonym_sampling_from_mappings(num_negative)

    self.info = {
        type(self).__name__: {
            "num_synonyms": len(self.synonyms),
            "num_nonsynonyms": len(self.nonsynonyms),
            "num_mappings": len(self.class_mappings),
            "src_annotation_thesaurus": self.src_thesaurus.info["AnnotationThesaurus"],
            "tgt_annotation_thesaurus": self.tgt_thesaurus.info["AnnotationThesaurus"],
        }
    }

`save(save_path)`

Save the cross-ontology corpus (a .json file for label pairs and its summary) in the specified directory.

Source code in src/deeponto/align/bertmap/text_semantics.py

def save(self, save_path: str):
    """Save the cross-ontology corpus (a `.json` file for label pairs
    and its summary) in the specified directory.
    """
    create_path(save_path)
    save_json = {
        "summary": self.info,
        "synonyms": [(pos[0], pos[1], 1) for pos in self.synonyms],
        "nonsynonyms": [(neg[0], neg[1], 0) for neg in self.nonsynonyms],
    }
    save_file(save_json, os.path.join(save_path, "cross-onto.corpus.json"))

`synonym_sampling_from_mappings()`

Sample synonyms from cross-ontology class mappings.

Arguments of this method are all class attributes. See CrossOntologyTextSemanticsCorpus.

According to the \(\textsf{BERTMap}\) paper, cross-ontology synonyms are defined as label pairs that belong to two matched classes. Suppose the class \(C\) from the source ontology and the class \(D\) from the target ontology are matched according to one of the class_mappings, then the cartesian product of labels of \(C\) and labels of \(D\) form cross-ontology synonyms. Note that identity synonyms in the form of \((a, a)\) are removed because they have been covered in the intra-ontology case.

Returns:

Type	Description
`List[Tuple[str, str]]`	A list of unique synonym pair samples from ontology class mappings.

Source code in src/deeponto/align/bertmap/text_semantics.py

def synonym_sampling_from_mappings(self):
    r"""Sample synonyms from cross-ontology class mappings.

    Arguments of this method are all class attributes.
    See [`CrossOntologyTextSemanticsCorpus`][deeponto.align.bertmap.text_semantics.CrossOntologyTextSemanticsCorpus].

    According to the $\textsf{BERTMap}$ paper, **cross-ontology synonyms** are defined as label pairs
    that belong to two **matched** classes. Suppose the class $C$ from the source ontology
    and the class $D$ from the target ontology are matched according to one of the `class_mappings`,
    then the cartesian product of labels of $C$ and labels of $D$ form cross-ontology synonyms.
    Note that **identity synonyms** in the form of $(a, a)$ are removed because they have been covered
    in the intra-ontology case.

    Returns:
        (List[Tuple[str, str]]): A list of unique synonym pair samples from ontology class mappings.
    """
    synonym_pool = []

    for class_mapping in self.class_mappings:
        src_class_iri, tgt_class_iri = class_mapping.to_tuple()
        src_class_annotations = self.src_thesaurus.annotation_index[src_class_iri]
        tgt_class_annotations = self.tgt_thesaurus.annotation_index[tgt_class_iri]
        synonym_pairs = list(itertools.product(src_class_annotations, tgt_class_annotations))
        # remove the identity synonyms as the have been covered in the intra-ontology case
        synonym_pairs = [(l, r) for l, r in synonym_pairs if l != r]
        backward_synonym_pairs = [(r, l) for l, r in synonym_pairs]
        synonym_pool += synonym_pairs + backward_synonym_pairs

    synonym_pool = uniqify(synonym_pool)
    return synonym_pool

`nonsynonym_sampling_from_mappings(num_samples, max_iter=5)`

Sample non-synonyms from cross-ontology class mappings.

Arguments of this method are all class attributes. See CrossOntologyTextSemanticsCorpus.

According to the \(\textsf{BERTMap}\) paper, cross-ontology non-synonyms are defined as label pairs that belong to two unmatched classes. Assume that the provided class mappings are self-contained in the sense that they are complete for the classes involved in them, then we can randomly sample two cross-ontology classes that are not matched according to the mappings and take their labels as nonsynonyms. In practice, it is quite unlikely to obtain false negatives since the number of incorrect mappings is much larger than the number of correct ones.

Returns:

Type	Description
`List[Tuple[str, str]]`	A list of unique nonsynonym pair samples from ontology class mappings.

Source code in src/deeponto/align/bertmap/text_semantics.py

def nonsynonym_sampling_from_mappings(self, num_samples: int, max_iter: int = 5):
    r"""Sample non-synonyms from cross-ontology class mappings.

    Arguments of this method are all class attributes.
    See [`CrossOntologyTextSemanticsCorpus`][deeponto.align.bertmap.text_semantics.CrossOntologyTextSemanticsCorpus].

    According to the $\textsf{BERTMap}$ paper, **cross-ontology non-synonyms** are defined as label pairs
    that belong to two **unmatched** classes. Assume that the provided class mappings are self-contained
    in the sense that they are complete for the classes involved in them, then we can randomly
    sample two cross-ontology classes that are not matched according to the mappings and take
    their labels as nonsynonyms. In practice, it is quite unlikely to obtain false negatives since
    the number of incorrect mappings is much larger than the number of correct ones.

    Returns:
        (List[Tuple[str, str]]): A list of unique nonsynonym pair samples from ontology class mappings.
    """
    nonsynonym_pool = []

    # form cross-ontology synonym groups
    cross_onto_synonym_group_pair = []
    for class_mapping in self.class_mappings:
        src_class_iri, tgt_class_iri = class_mapping.to_tuple()
        src_class_annotations = self.src_thesaurus.annotation_index[src_class_iri]
        tgt_class_annotations = self.tgt_thesaurus.annotation_index[tgt_class_iri]
        # let each matched class pair's annotations form a synonym group_pair
        cross_onto_synonym_group_pair.append((src_class_annotations, tgt_class_annotations))

    # randomly select disjoint synonym group pairs from all
    for _ in range(num_samples):
        left_class_pair, right_class_pair = tuple(random.sample(cross_onto_synonym_group_pair, 2))
        try:
            # randomly choose one label from a synonym group
            left_label = random.choice(list(left_class_pair[0]))  # choosing the src side by [0]
            right_label = random.choice(list(right_class_pair[1]))  # choosing the tgt side by [1]
            nonsynonym_pool.append((left_label, right_label))
        except:
            # skip if there are no class labels
            continue

    # DataUtils.uniqify is too slow so we should avoid operating it too often
    nonsynonym_pool = uniqify(nonsynonym_pool)
    while len(nonsynonym_pool) < num_samples and max_iter > 0:
        max_iter = max_iter - 1  # reduce the iteration to prevent exhausting loop
        nonsynonym_pool += self.nonsynonym_sampling_from_mappings(num_samples - len(nonsynonym_pool), max_iter)
        nonsynonym_pool = uniqify(nonsynonym_pool)
    return nonsynonym_pool

`TextSemanticsCorpora(src_onto, tgt_onto, annotation_property_iris, class_mappings=None, auxiliary_ontos=None)`

Class for creating the collection text semantics corpora.

As defined in the \(\textsf{BERTMap}\) paper, the collection of text semantics corpora contains at least two intra-ontology sub-corpora from the source and target ontologies, respectively. If some class mappings are provided, then a cross-ontology sub-corpus will be created. If some additional auxiliary ontologies are provided, the intra-ontology corpora created from them will serve as the auxiliary sub-corpora.

Attributes:

Name	Type	Description
`src_onto`	`Ontology`	The source ontology to be matched or aligned.
`tgt_onto`	`Ontology`	The target ontology to be matched or aligned.
`annotation_property_iris`	`List[str]`	A list of annotation property IRIs used to extract the annotations.
`class_mappings`	`List[ReferenceMapping]`	A list of cross-ontology class mappings between the source and the target ontologies. Defaults to `None`.
`auxiliary_ontos`	`List[Ontology]`	A list of auxiliary ontologies for augmenting more synonym/non-synonym samples. Defaults to `None`.

Source code in src/deeponto/align/bertmap/text_semantics.py

def __init__(
    self,
    src_onto: Ontology,
    tgt_onto: Ontology,
    annotation_property_iris: List[str],
    class_mappings: Optional[List[ReferenceMapping]] = None,
    auxiliary_ontos: Optional[List[Ontology]] = None,
):
    self.synonyms = []
    self.nonsynonyms = []

    # build intra-ontology corpora
    # negative sample ratios are by default
    self.intra_src_onto_corpus = IntraOntologyTextSemanticsCorpus(src_onto, annotation_property_iris)
    self.add_samples_from_sub_corpus(self.intra_src_onto_corpus)
    self.intra_tgt_onto_corpus = IntraOntologyTextSemanticsCorpus(tgt_onto, annotation_property_iris)
    self.add_samples_from_sub_corpus(self.intra_tgt_onto_corpus)

    # build cross-ontolgoy corpora
    self.class_mappings = class_mappings
    self.cross_onto_corpus = None
    if self.class_mappings:
        self.cross_onto_corpus = CrossOntologyTextSemanticsCorpus(
            class_mappings, src_onto, tgt_onto, annotation_property_iris
        )
        self.add_samples_from_sub_corpus(self.cross_onto_corpus)

    # build auxiliary ontology corpora (same as intra-ontology)
    self.auxiliary_ontos = auxiliary_ontos
    self.auxiliary_onto_corpora = []
    if self.auxiliary_ontos:
        for auxiliary_onto in self.auxiliary_ontos:
            self.auxiliary_onto_corpora.append(
                IntraOntologyTextSemanticsCorpus(auxiliary_onto, annotation_property_iris)
            )
    for auxiliary_onto_corpus in self.auxiliary_onto_corpora:
        self.add_samples_from_sub_corpus(auxiliary_onto_corpus)

    # DataUtils.uniqify the samples
    self.synonyms = uniqify(self.synonyms)
    self.nonsynonyms = uniqify(self.nonsynonyms)
    # remove invalid nonsynonyms
    self.nonsynonyms = list(set(self.nonsynonyms) - set(self.synonyms))

    # summary
    self.info = {
        type(self).__name__: {
            "num_synonyms": len(self.synonyms),
            "num_nonsynonyms": len(self.nonsynonyms),
            "intra_src_onto_corpus": self.intra_src_onto_corpus.info["IntraOntologyTextSemanticsCorpus"],
            "intra_tgt_onto_corpus": self.intra_tgt_onto_corpus.info["IntraOntologyTextSemanticsCorpus"],
            "cross_onto_corpus": self.cross_onto_corpus.info["CrossOntologyTextSemanticsCorpus"]
            if self.cross_onto_corpus
            else None,
            "auxiliary_onto_corpora": [
                a.info["IntraOntologyTextSemanticsCorpus"] for a in self.auxiliary_onto_corpora
            ],
        }
    }

`save(save_path)`

Save the overall text semantics corpora (a .json file for label pairs and its summary) in the specified directory.

Source code in src/deeponto/align/bertmap/text_semantics.py

def save(self, save_path: str):
    """Save the overall text semantics corpora (a `.json` file for label pairs
    and its summary) in the specified directory.
    """
    create_path(save_path)
    save_json = {
        "summary": self.info,
        "synonyms": [(pos[0], pos[1], 1) for pos in self.synonyms],
        "nonsynonyms": [(neg[0], neg[1], 0) for neg in self.nonsynonyms],
    }
    save_file(save_json, os.path.join(save_path, "text-semantics.corpora.json"))

`add_samples_from_sub_corpus(sub_corpus)`

Add synonyms and non-synonyms from each sub-corpus to the overall collection.

Source code in src/deeponto/align/bertmap/text_semantics.py

def add_samples_from_sub_corpus(
    self, sub_corpus: Union[IntraOntologyTextSemanticsCorpus, CrossOntologyTextSemanticsCorpus]
):
    """Add synonyms and non-synonyms from each sub-corpus to the overall collection."""
    self.synonyms += sub_corpus.synonyms
    if isinstance(sub_corpus, IntraOntologyTextSemanticsCorpus):
        self.nonsynonyms += sub_corpus.soft_nonsynonyms + sub_corpus.hard_nonsynonyms
    else:
        self.nonsynonyms += sub_corpus.nonsynonyms

`BERTSynonymClassifier(loaded_path, output_path, eval_mode, max_length_for_input, num_epochs_for_training=None, batch_size_for_training=None, batch_size_for_prediction=None, training_data=None, validation_data=None)`

Class for BERT synonym classifier.

The main scoring module of \(\textsf{BERTMap}\) consisting of a BERT model and a binary synonym classifier.

Attributes:

Name	Type	Description
`loaded_path`	`str`	The path to the checkpoint of a pre-trained BERT model.
`output_path`	`str`	The path to the output BERT model (usually fine-tuned).
`eval_mode`	`bool`	Set to `False` if the model is loaded for training.
`max_length_for_input`	`int`	The maximum length of an input sequence.
`num_epochs_for_training`	`int`	The number of epochs for training a BERT model.
`batch_size_for_training`	`int`	The batch size for training a BERT model.
`batch_size_for_prediction`	`int`	The batch size for making predictions.
`training_data`	`Dataset`	Data for training the model if `for_training` is set to `True`. Defaults to `None`.
`validation_data`	`Dataset`	Data for validating the model if `for_training` is set to `True`. Defaults to `None`.
`training_args`	`TrainingArguments`	Training arguments for training the model if `for_training` is set to `True`. Defaults to `None`.
`trainer`	`Trainer`	The model trainer fed with `training_args` and data samples. Defaults to `None`.
`softmax`	`torch.nn.SoftMax`	The softmax layer used for normalising synonym scores. Defaults to `None`.

Source code in src/deeponto/align/bertmap/bert_classifier.py

def __init__(
    self,
    loaded_path: str,
    output_path: str,
    eval_mode: bool,
    max_length_for_input: int,
    num_epochs_for_training: Optional[float] = None,
    batch_size_for_training: Optional[int] = None,
    batch_size_for_prediction: Optional[int] = None,
    training_data: Optional[List[Tuple[str, str, int]]] = None,  # (sentence1, sentence2, label)
    validation_data: Optional[List[Tuple[str, str, int]]] = None,
):
    # Load the pretrained BERT model from the given path
    self.loaded_path = loaded_path
    print(f"Loading a BERT model from: {self.loaded_path}.")
    self.model = AutoModelForSequenceClassification.from_pretrained(
        self.loaded_path, output_hidden_states=eval_mode
    )
    self.tokenizer = Tokenizer.from_pretrained(loaded_path)

    self.output_path = output_path
    self.eval_mode = eval_mode
    self.max_length_for_input = max_length_for_input
    self.num_epochs_for_training = num_epochs_for_training
    self.batch_size_for_training = batch_size_for_training
    self.batch_size_for_prediction = batch_size_for_prediction
    self.training_data = None
    self.validation_data = None
    self.data_stat = {}
    self.training_args = None
    self.trainer = None
    self.softmax = None

    # load the pre-trained BERT model and set it to eval mode (static)
    if self.eval_mode:
        self.eval()
    # load the pre-trained BERT model for fine-tuning
    else:
        if not training_data:
            raise RuntimeError("Training data should be provided when `for_training` is `True`.")
        if not validation_data:
            raise RuntimeError("Validation data should be provided when `for_training` is `True`.")
        # load data (max_length is used for truncation)
        self.training_data = self.load_dataset(training_data, "training")
        self.validation_data = self.load_dataset(validation_data, "validation")
        self.data_stat = {
            "num_training": len(self.training_data),
            "num_validation": len(self.validation_data),
        }

        # generate training arguments
        epoch_steps = len(self.training_data) // self.batch_size_for_training  # total steps of an epoch
        if torch.cuda.device_count() > 0:
            epoch_steps = epoch_steps // torch.cuda.device_count()  # to deal with multi-gpus case
        # keep logging steps consisitent even for small batch size
        # report logging on every 0.02 epoch
        logging_steps = int(epoch_steps * 0.02)
        # eval on every 0.2 epoch
        eval_steps = 10 * logging_steps
        # generate the training arguments
        self.training_args = TrainingArguments(
            output_dir=self.output_path,
            num_train_epochs=self.num_epochs_for_training,
            per_device_train_batch_size=self.batch_size_for_training,
            per_device_eval_batch_size=self.batch_size_for_training,
            warmup_ratio=0.0,
            weight_decay=0.01,
            logging_steps=logging_steps,
            logging_dir=f"{self.output_path}/tensorboard",
            eval_steps=eval_steps,
            evaluation_strategy="steps",
            do_train=True,
            do_eval=True,
            save_steps=eval_steps,
            save_total_limit=2,
            load_best_model_at_end=True,
        )
        # build the trainer
        self.trainer = Trainer(
            model=self.model,
            args=self.training_args,
            train_dataset=self.training_data,
            eval_dataset=self.validation_data,
            compute_metrics=self.compute_metrics,
            tokenizer=self.tokenizer._tokenizer,
        )

`train(resume_from_checkpoint=None)`

Start training the BERT model.

Source code in src/deeponto/align/bertmap/bert_classifier.py

def train(self, resume_from_checkpoint: Optional[Union[bool, str]] = None):
    """Start training the BERT model."""
    if self.eval_mode:
        raise RuntimeError("Training cannot be started in `eval` mode.")
    self.trainer.train(resume_from_checkpoint=resume_from_checkpoint)

`eval()`

To eval mode.

Source code in src/deeponto/align/bertmap/bert_classifier.py

def eval(self):
    """To eval mode."""
    print("The BERT model is set to eval mode for making predictions.")
    self.model.eval()
    # TODO: to implement multi-gpus for inference
    self.device = self.get_device(device_num=0)
    self.model.to(self.device)
    self.softmax = torch.nn.Softmax(dim=1).to(self.device)

`predict(sent_pairs)`

Run prediction pipeline for synonym classification.

Return the softmax probailities of predicting pairs as synonyms (index=1).

Source code in src/deeponto/align/bertmap/bert_classifier.py

def predict(self, sent_pairs: List[Tuple[str, str]]):
    r"""Run prediction pipeline for synonym classification.

    Return the `softmax` probailities of predicting pairs as synonyms (`index=1`).
    """
    inputs = self.process_inputs(sent_pairs)
    with torch.no_grad():
        return self.softmax(self.model(**inputs).logits)[:, 1]

`load_dataset(data, split)`

Load the list of (annotation1, annotation2, label) samples into a datasets.Dataset.

Source code in src/deeponto/align/bertmap/bert_classifier.py

def load_dataset(self, data: List[Tuple[str, str, int]], split: str) -> Dataset:
    r"""Load the list of `(annotation1, annotation2, label)` samples into a `datasets.Dataset`."""

    def iterate():
        for sample in data:
            yield {"annotation1": sample[0], "annotation2": sample[1], "labels": sample[2]}

    dataset = Dataset.from_generator(iterate)
    # NOTE: no padding here because the Trainer class supports dynamic padding
    dataset = dataset.map(
        lambda examples: self.tokenizer._tokenizer(
            examples["annotation1"], examples["annotation2"], max_length=self.max_length_for_input, truncation=True
        ),
        batched=True,
        desc=f"Load {split} data:",
    )
    return dataset

`process_inputs(sent_pairs)`

Process input sentence pairs for the BERT model.

Transform the sentences into BERT input embeddings and load them into the device. This function is called only when the BERT model is about to make predictions (eval mode).

Source code in src/deeponto/align/bertmap/bert_classifier.py

def process_inputs(self, sent_pairs: List[Tuple[str, str]]):
    r"""Process input sentence pairs for the BERT model.

    Transform the sentences into BERT input embeddings and load them into the device.
    This function is called only when the BERT model is about to make predictions (`eval` mode).
    """
    return self.tokenizer._tokenizer(
        sent_pairs,
        return_tensors="pt",
        max_length=self.max_length_for_input,
        padding=True,
        truncation=True,
    ).to(self.device)

`compute_metrics(pred)` `staticmethod`

Add more evaluation metrics into the training log.

Source code in src/deeponto/align/bertmap/bert_classifier.py

@staticmethod
def compute_metrics(pred):
    """Add more evaluation metrics into the training log."""
    # TODO: currently only accuracy is added, will expect more in the future if needed
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}

`get_device(device_num=0)` `staticmethod`

Get a device (GPU or CPU) for the torch model

Source code in src/deeponto/align/bertmap/bert_classifier.py

@staticmethod
def get_device(device_num: int = 0):
    """Get a device (GPU or CPU) for the torch model"""
    # If there's a GPU available...
    if torch.cuda.is_available():
        # Tell PyTorch to use the GPU.
        device = torch.device(f"cuda:{device_num}")
        print("There are %d GPU(s) available." % torch.cuda.device_count())
        print("We will use the GPU:", torch.cuda.get_device_name(device_num))
    # If not...
    else:
        print("No GPU available, using the CPU instead.")
        device = torch.device("cpu")
    return device

`set_seed(seed_val=888)` `staticmethod`

Set random seed for reproducible results.

Source code in src/deeponto/align/bertmap/bert_classifier.py

@staticmethod
def set_seed(seed_val: int = 888):
    """Set random seed for reproducible results."""
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

`MappingPredictor(output_path, tokenizer_path, src_annotation_index, tgt_annotation_index, bert_synonym_classifier, num_raw_candidates, num_best_predictions, batch_size_for_prediction, logger, enlighten_manager, enlighten_status, ignored_class_index=None)`

Class for the mapping prediction module of \(\textsf{BERTMap}\) and \(\textsf{BERTMapLt}\) models.

Attributes:

Name	Type	Description
`tokenizer`	`Tokenizer`	The tokenizer used for constructing the inverted annotation index and candidate selection.
`src_annotation_index`	`dict`	A dictionary that stores the `(class_iri, class_annotations)` pairs from `src_onto` according to `annotation_property_iris`.
`tgt_annotation_index`	`dict`	A dictionary that stores the `(class_iri, class_annotations)` pairs from `tgt_onto` according to `annotation_property_iris`.
`tgt_inverted_annotation_index`	`InvertedIndex`	The inverted index built from `tgt_annotation_index` used for target class candidate selection.
`bert_synonym_classifier`	`BERTSynonymClassifier`	The BERT synonym classifier fine-tuned on text semantics corpora.
`num_raw_candidates`	`int`	The maximum number of selected target class candidates for a source class.
`num_best_predictions`	`int`	The maximum number of best scored mappings presevred for a source class.
`batch_size_for_prediction`	`int`	The batch size of class annotation pairs for computing synonym scores.
`ignored_class_index`	`dict`	OAEI arguemnt, a dictionary that stores the `(class_iri, used_in_alignment)` pairs.

Source code in src/deeponto/align/bertmap/mapping_prediction.py

def __init__(
    self,
    output_path: str,
    tokenizer_path: str,
    src_annotation_index: dict,
    tgt_annotation_index: dict,
    bert_synonym_classifier: Optional[BERTSynonymClassifier],
    num_raw_candidates: Optional[int],
    num_best_predictions: Optional[int],
    batch_size_for_prediction: int,
    logger: Logger,
    enlighten_manager: enlighten.Manager,
    enlighten_status: enlighten.StatusBar,
    ignored_class_index: Optional[dict] = None,
):
    self.logger = logger
    self.enlighten_manager = enlighten_manager
    self.enlighten_status = enlighten_status

    self.tokenizer = Tokenizer.from_pretrained(tokenizer_path)

    self.logger.info("Build inverted annotation index for candidate selection.")
    self.src_annotation_index = src_annotation_index
    self.tgt_annotation_index = tgt_annotation_index
    self.tgt_inverted_annotation_index = Ontology.build_inverted_annotation_index(
        tgt_annotation_index, self.tokenizer
    )
    # the fundamental judgement for whether bertmap or bertmaplt is loaded
    self.bert_synonym_classifier = bert_synonym_classifier
    self.num_raw_candidates = num_raw_candidates
    self.num_best_predictions = num_best_predictions
    self.batch_size_for_prediction = batch_size_for_prediction
    self.output_path = output_path

    # for the OAEI, adding in check for classes that are not used in alignment
    self.ignored_class_index = ignored_class_index

    self.init_class_mapping = lambda head, tail, score: EntityMapping(head, tail, "<EquivalentTo>", score)

`bert_mapping_score(src_class_annotations, tgt_class_annotations)`

\(\textsf{BERTMap}\)'s main mapping score module which utilises the fine-tuned BERT synonym classifier.

Compute the synonym score for each pair of src-tgt class annotations, and return the average score as the mapping score. Apply string matching before applying the BERT module to filter easy mappings (with scores \(1.0\)).

Source code in src/deeponto/align/bertmap/mapping_prediction.py

def bert_mapping_score(
    self,
    src_class_annotations: Set[str],
    tgt_class_annotations: Set[str],
):
    r"""$\textsf{BERTMap}$'s main mapping score module which utilises the fine-tuned BERT synonym
    classifier.

    Compute the **synonym score** for each pair of src-tgt class annotations, and return
    the **average** score as the mapping score. Apply string matching before applying the
    BERT module to filter easy mappings (with scores $1.0$).
    """

    if not src_class_annotations or not tgt_class_annotations:
        warnings.warn("Return zero score due to empty input class annotations...")
        return 0.0

    # apply string matching before applying the bert module
    prelim_score = self.edit_similarity_mapping_score(
        src_class_annotations,
        tgt_class_annotations,
        string_match_only=True,
    )
    if prelim_score == 1.0:
        return prelim_score
    # apply BERT classifier and define mapping score := Average(SynonymScores)
    class_annotation_pairs = list(itertools.product(src_class_annotations, tgt_class_annotations))
    synonym_scores = self.bert_synonym_classifier.predict(class_annotation_pairs)
    # only one element tensor is able to be extracted as a scalar by .item()
    return float(torch.mean(synonym_scores).item())

`edit_similarity_mapping_score(src_class_annotations, tgt_class_annotations, string_match_only=False)` `staticmethod`

\(\textsf{BERTMap}\)'s string match module and \(\textsf{BERTMapLt}\)'s mapping prediction function.

Compute the normalised edit similarity (1 - normalised edit distance) for each pair of src-tgt class annotations, and return the maximum score as the mapping score.

Source code in src/deeponto/align/bertmap/mapping_prediction.py

@staticmethod
def edit_similarity_mapping_score(
    src_class_annotations: Set[str],
    tgt_class_annotations: Set[str],
    string_match_only: bool = False,
):
    r"""$\textsf{BERTMap}$'s string match module and $\textsf{BERTMapLt}$'s mapping prediction function.

    Compute the **normalised edit similarity** `(1 - normalised edit distance)` for each pair
    of src-tgt class annotations, and return the **maximum** score as the mapping score.
    """

    if not src_class_annotations or not tgt_class_annotations:
        warnings.warn("Return zero score due to empty input class annotations...")
        return 0.0

    # edge case when src and tgt classes have an exact match of annotation
    if len(src_class_annotations.intersection(tgt_class_annotations)) > 0:
        return 1.0
    # a shortcut to save time for $\textsf{BERTMap}$
    if string_match_only:
        return 0.0
    annotation_pairs = itertools.product(src_class_annotations, tgt_class_annotations)
    sim_scores = [levenshtein.normalized_similarity(src, tgt) for src, tgt in annotation_pairs]
    return max(sim_scores)

`mapping_prediction_for_src_class(src_class_iri)`

Predict \(N\) best scored mappings for a source ontology class, where \(N\) is specified in self.num_best_predictions.

Apply the string matching module to compute "easy" mappings.
Return the mappings if found any, or if there is no BERT synonym classifier as in \(\textsf{BERTMapLt}\).
If using the BERT synonym classifier module:
- Generate batches for class annotation pairs. Each batch contains the combinations of the source class annotations and \(M\) target candidate classes' annotations. \(M\) is determined by batch_size_for_prediction, i.e., stop adding annotations of a target class candidate into the current batch if this operation will cause the size of current batch to exceed the limit.
- Compute the synonym scores for each batch and aggregate them into mapping scores; preserve \(N\) best scored candidates and update them in the next batch. By this dynamic process, we eventually get \(N\) best scored mappings for a source ontology class.

Source code in src/deeponto/align/bertmap/mapping_prediction.py

def mapping_prediction_for_src_class(self, src_class_iri: str) -> List[EntityMapping]:
    r"""Predict $N$ best scored mappings for a source ontology class, where
    $N$ is specified in `self.num_best_predictions`.

    1. Apply the **string matching** module to compute "easy" mappings.
    2. Return the mappings if found any, or if there is no BERT synonym classifier
    as in $\textsf{BERTMapLt}$.
    3. If using the BERT synonym classifier module:

        - Generate batches for class annotation pairs. Each batch contains the combinations of the
        source class annotations and $M$ target candidate classes' annotations. $M$ is determined
        by `batch_size_for_prediction`, i.e., stop adding annotations of a target class candidate into
        the current batch if this operation will cause the size of current batch to exceed the limit.
        - Compute the synonym scores for each batch and aggregate them into mapping scores; preserve
        $N$ best scored candidates and update them in the next batch. By this dynamic process, we eventually
        get $N$ best scored mappings for a source ontology class.
    """

    src_class_annotations = self.src_annotation_index[src_class_iri]
    # previously wrongly put tokenizer again !!!
    tgt_class_candidates = self.tgt_inverted_annotation_index.idf_select(
        list(src_class_annotations), pool_size=len(self.tgt_annotation_index.keys())
    )  # [(tgt_class_iri, idf_score)]
    # if some classes are set to be ignored, remove them from the candidates
    if self.ignored_class_index:
        tgt_class_candidates = [(iri, idf_score) for iri, idf_score in tgt_class_candidates if not self.ignored_class_index[iri]]
    # select a truncated number of candidates
    tgt_class_candidates = tgt_class_candidates[:self.num_raw_candidates]
    best_scored_mappings = []

    # for string matching: save time if already found string-matched candidates
    def string_match():
        """Compute string-matched mappings."""
        string_matched_mappings = []
        for tgt_candidate_iri, _ in tgt_class_candidates:
            tgt_candidate_annotations = self.tgt_annotation_index[tgt_candidate_iri]
            prelim_score = self.edit_similarity_mapping_score(
                src_class_annotations,
                tgt_candidate_annotations,
                string_match_only=True,
            )
            if prelim_score > 0.0:
                # if src_class_annotations.intersection(tgt_candidate_annotations):
                string_matched_mappings.append(
                    self.init_class_mapping(src_class_iri, tgt_candidate_iri, prelim_score)
                )

        return string_matched_mappings

    best_scored_mappings += string_match()
    # return string-matched mappings if found or if there is no bert module (bertmaplt)
    if best_scored_mappings or not self.bert_synonym_classifier:
        self.logger.info(f"The best scored class mappings for {src_class_iri} are\n{best_scored_mappings}")
        return best_scored_mappings

    def generate_batched_annotations(batch_size: int):
        """Generate batches of class annotations for the input source class and its
        target candidates.
        """
        batches = []
        # the `nums`` parameter determines how the annotations are grouped
        current_batch = CfgNode({"annotations": [], "nums": []})
        for i, (tgt_candidate_iri, _) in enumerate(tgt_class_candidates):
            tgt_candidate_annotations = self.tgt_annotation_index[tgt_candidate_iri]
            annotation_pairs = list(itertools.product(src_class_annotations, tgt_candidate_annotations))
            current_batch.annotations += annotation_pairs
            num_annotation_pairs = len(annotation_pairs)
            current_batch.nums.append(num_annotation_pairs)
            # collect when the batch is full or for the last target class candidate
            if sum(current_batch.nums) > batch_size or i == len(tgt_class_candidates) - 1:
                batches.append(current_batch)
                current_batch = CfgNode({"annotations": [], "nums": []})
        return batches

    def bert_match():
        """Compute mappings with fine-tuned BERT synonym classifier."""
        bert_matched_mappings = []
        class_annotation_batches = generate_batched_annotations(self.batch_size_for_prediction)
        batch_base_candidate_idx = (
            0  # after each batch, the base index will be increased by # of covered target candidates
        )
        device = self.bert_synonym_classifier.device

        # intialize N prediction scores and N corresponding indices w.r.t `tgt_class_candidates`
        final_best_scores = torch.tensor([-1] * self.num_best_predictions).to(device)
        final_best_idxs = torch.tensor([-1] * self.num_best_predictions).to(device)

        for annotation_batch in class_annotation_batches:

            synonym_scores = self.bert_synonym_classifier.predict(annotation_batch.annotations)
            # aggregating to mappings cores
            grouped_synonym_scores = torch.split(
                synonym_scores,
                split_size_or_sections=annotation_batch.nums,
            )
            mapping_scores = torch.stack([torch.mean(chunk) for chunk in grouped_synonym_scores])
            assert len(mapping_scores) == len(annotation_batch.nums)

            # preserve N best scored mappings
            # scale N in case there are less than N tgt candidates in this batch
            N = min(len(mapping_scores), self.num_best_predictions)
            batch_best_scores, batch_best_idxs = torch.topk(mapping_scores, k=N)
            batch_best_idxs += batch_base_candidate_idx

            # we do the substitution for every batch to prevent from memory overflow
            final_best_scores, _idxs = torch.topk(
                torch.cat([batch_best_scores, final_best_scores]),
                k=self.num_best_predictions,
            )
            final_best_idxs = torch.cat([batch_best_idxs, final_best_idxs])[_idxs]

            # update the index for target candidate classes
            batch_base_candidate_idx += len(annotation_batch.nums)

        for candidate_idx, mapping_score in zip(final_best_idxs, final_best_scores):
            # ignore intial values (-1.0) for dummy mappings
            # the threshold 0.9 is for mapping extension
            if mapping_score.item() >= 0.9:
                tgt_candidate_iri = tgt_class_candidates[candidate_idx.item()][0]
                bert_matched_mappings.append(
                    self.init_class_mapping(
                        src_class_iri,
                        tgt_candidate_iri,
                        mapping_score.item(),
                    )
                )

        assert len(bert_matched_mappings) <= self.num_best_predictions
        self.logger.info(f"The best scored class mappings for {src_class_iri} are\n{bert_matched_mappings}")
        return bert_matched_mappings

    return bert_match()

`mapping_prediction()`

Apply global matching for each class in the source ontology.

See mapping_prediction_for_src_class.

If this process is accidentally stopped, it can be resumed from already saved predictions. The progress bar keeps track of the number of source ontology classes that have been matched.

Source code in src/deeponto/align/bertmap/mapping_prediction.py

def mapping_prediction(self):
    r"""Apply global matching for each class in the source ontology.

    See [`mapping_prediction_for_src_class`][deeponto.align.bertmap.mapping_prediction.MappingPredictor.mapping_prediction_for_src_class].

    If this process is accidentally stopped, it can be resumed from already saved predictions. The progress
    bar keeps track of the number of source ontology classes that have been matched.
    """
    self.logger.info("Start global matching for each class in the source ontology.")

    match_dir = os.path.join(self.output_path, "match")
    try:
        mapping_index = load_file(os.path.join(match_dir, "raw_mappings.json"))
        self.logger.info("Load the existing mapping prediction file.")
    except:
        mapping_index = dict()
        create_path(match_dir)

    progress_bar = self.enlighten_manager.counter(
        total=len(self.src_annotation_index), desc="Mapping Prediction", unit="per src class"
    )
    self.enlighten_status.update(demo="Mapping Prediction")

    for i, src_class_iri in enumerate(self.src_annotation_index.keys()):
        # skip computed classes
        if src_class_iri in mapping_index.keys():
            self.logger.info(f"[Class {i}] Skip matching {src_class_iri} as already computed.")
            progress_bar.update()
            continue
        # for OAEI
        if self.ignored_class_index and self.ignored_class_index[src_class_iri]:
            self.logger.info(f"[Class {i}] Skip matching {src_class_iri} as marked as not used in alignment.")
            progress_bar.update()
            continue
        mappings = self.mapping_prediction_for_src_class(src_class_iri)
        mapping_index[src_class_iri] = [m.to_tuple(with_score=True) for m in mappings]

        if i % 100 == 0 or i == len(self.src_annotation_index) - 1:
            save_file(mapping_index, os.path.join(match_dir, "raw_mappings.json"))
            # also save a .tsv version
            mapping_in_tuples = list(itertools.chain.from_iterable(mapping_index.values()))
            mapping_df = pd.DataFrame(mapping_in_tuples, columns=["SrcEntity", "TgtEntity", "Score"])
            mapping_df.to_csv(os.path.join(match_dir, "raw_mappings.tsv"), sep="\t", index=False)
            self.logger.info("Save currently computed mappings to prevent undesirable loss.")

        progress_bar.update()

    self.logger.info("Finished mapping prediction for each class in the source ontology.")
    progress_bar.close()

`MappingRefiner(output_path, src_onto, tgt_onto, mapping_predictor, mapping_extension_threshold, mapping_filtered_threshold, logger, enlighten_manager, enlighten_status)`

Class for the mapping refinement module of \(\textsf{BERTMap}\).

\(\textsf{BERTMapLt}\) does not go through mapping refinement for its being "light". All the attributes of this class are supposed to be passed from BERTMapPipeline.

Attributes:

Name	Type	Description
`src_onto`	`Ontology`	The source ontology to be matched.
`tgt_onto`	`Ontology`	The target ontology to be matched.
`mapping_predictor`	`MappingPredictor`	The mapping prediction module of BERTMap.
`mapping_extension_threshold`	`float`	Mappings with scores \(\geq\) this value will be considered in the iterative mapping extension process.
`raw_mappings`	`List[EntityMapping]`	List of raw class mappings predicted in the global matching phase.
`mapping_score_dict`	`dict`	A dynamic dictionary that keeps track of mappings (with scores) that have already been computed.
`mapping_filter_threshold`	`float`	Mappings with scores \(\geq\) this value will be preserved for the final mapping repairing.

Source code in src/deeponto/align/bertmap/mapping_refinement.py

def __init__(
    self,
    output_path: str,
    src_onto: Ontology,
    tgt_onto: Ontology,
    mapping_predictor: MappingPredictor,
    mapping_extension_threshold: float,
    mapping_filtered_threshold: float,
    logger: Logger,
    enlighten_manager: enlighten.Manager,
    enlighten_status: enlighten.StatusBar
):
    self.output_path = output_path
    self.logger = logger
    self.enlighten_manager = enlighten_manager
    self.enlighten_status = enlighten_status

    self.src_onto = src_onto
    self.tgt_onto = tgt_onto

    # iterative mapping extension
    self.mapping_predictor = mapping_predictor
    self.mapping_extension_threshold = mapping_extension_threshold  # \kappa
    self.raw_mappings = EntityMapping.read_table_mappings(
        os.path.join(self.output_path, "match", "raw_mappings.tsv"),
        threshold=self.mapping_extension_threshold,
        relation="<EquivalentTo>",
    )
    # keep track of already scored mappings to prevent duplicated predictions
    self.mapping_score_dict = dict()
    for m in self.raw_mappings:
        src_class_iri, tgt_class_iri, score = m.to_tuple(with_score=True)
        self.mapping_score_dict[(src_class_iri, tgt_class_iri)] = score

    # the threshold for final filtering the extended mappings
    self.mapping_filtered_threshold = mapping_filtered_threshold  # \lambda

    # logmap mapping repair folder
    self.logmap_repair_path = os.path.join(self.output_path, "match", "logmap-repair")

    # paths for mapping extension and repair
    self.extended_mapping_path = os.path.join(self.output_path, "match", "extended_mappings.tsv")
    self.filtered_mapping_path = os.path.join(self.output_path, "match", "filtered_mappings.tsv")
    self.repaired_mapping_path = os.path.join(self.output_path, "match", "repaired_mappings.tsv")

`mapping_extension(max_iter=10)`

Iterative mapping extension based on the locality principle.

For each class pair \((c, c')\) (scored in the global matching phase) with score \(\geq \kappa\), search for plausible mappings between the parents of \(c\) and \(c'\), and between the children of \(c\) and \(c'\). This is an iterative process as the set newly discovered mappings can act renew the frontier for searching. Terminate if no new mappings with score \(\geq \kappa\) can be found or the limit max_iter has been reached. Note that \(\kappa\) is set to \(0.9\) by default (can be altered in the configuration file). The mapping extension progress bar keeps track of the total number of extended mappings (including the previously predicted ones).

A further filtering will be performed by only preserving mappings with score \(\geq \lambda\), in the original BERTMap paper, \(\lambda\) is determined by the validation mappings, but in practice \(\lambda\) is not a sensitive hyperparameter and validation mappings are often not available. Therefore, we manually set \(\lambda\) to \(0.9995\) by default (can be altered in the configuration file). The mapping filtering progress bar keeps track of the total number of filtered mappings (this bar is purely for logging purpose).

Parameters:

Name	Type	Description	Default
`max_iter`	`int`	The maximum number of mapping extension iterations. Defaults to `10`.	`10`

Source code in src/deeponto/align/bertmap/mapping_refinement.py

def mapping_extension(self, max_iter: int = 10):
    r"""Iterative mapping extension based on the locality principle.

    For each class pair $(c, c')$ (scored in the global matching phase) with score 
    $\geq \kappa$, search for plausible mappings between the parents of $c$ and $c'$,
    and between the children of $c$ and $c'$. This is an iterative process as the set 
    newly discovered mappings can act renew the frontier for searching. Terminate if
    no new mappings with score $\geq \kappa$ can be found or the limit `max_iter` has 
    been reached. Note that $\kappa$ is set to $0.9$ by default (can be altered
    in the configuration file). The mapping extension progress bar keeps track of the 
    total number of extended mappings (including the previously predicted ones).

    A further filtering will be performed by only preserving mappings with score $\geq \lambda$,
    in the original BERTMap paper, $\lambda$ is determined by the validation mappings, but
    in practice $\lambda$ is not a sensitive hyperparameter and validation mappings are often
    not available. Therefore, we manually set $\lambda$ to $0.9995$ by default (can be altered
    in the configuration file). The mapping filtering progress bar keeps track of the 
    total number of filtered mappings (this bar is purely for logging purpose).

    Args:
        max_iter (int, optional): The maximum number of mapping extension iterations. Defaults to `10`.
    """

    num_iter = 0
    self.enlighten_status.update(demo="Mapping Extension")
    extension_progress_bar = self.enlighten_manager.counter(
        desc=f"Mapping Extension [Iteration #{num_iter}]", unit="mapping"
    )
    filtering_progress_bar = self.enlighten_manager.counter(
        desc=f"Mapping Filtering", unit="mapping"
    )

    if os.path.exists(self.extended_mapping_path) and os.path.exists(self.filtered_mapping_path):
        self.logger.info(
            f"Found extended and filtered mapping files at {self.extended_mapping_path}"
            + f" and {self.filtered_mapping_path}.\nPlease check file integrity; if incomplete, "
            + "delete them and re-run the program."
        )

        # for animation purposes
        extension_progress_bar.desc = f"Mapping Extension"
        for _ in EntityMapping.read_table_mappings(self.extended_mapping_path):
            extension_progress_bar.update()

        self.enlighten_status.update(demo="Mapping Filtering")
        for _ in EntityMapping.read_table_mappings(self.filtered_mapping_path):
            filtering_progress_bar.update()

        extension_progress_bar.close()
        filtering_progress_bar.close()

        return
    # intialise the frontier, explored, final expansion sets with the raw mappings
    # NOTE be careful of address pointers
    frontier = [m.to_tuple() for m in self.raw_mappings]
    expansion = [m.to_tuple(with_score=True) for m in self.raw_mappings]
    # for animation purposes
    for _ in range(len(expansion)):
        extension_progress_bar.update()

    self.logger.info(
        f"Start mapping extension for each class pair with score >= {self.mapping_extension_threshold}."
    )
    while frontier and num_iter < max_iter:
        new_mappings = []
        for src_class_iri, tgt_class_iri in frontier:
            # one hop extension makes sure new mappings are really "new"
            cur_new_mappings = self.one_hop_extend(src_class_iri, tgt_class_iri)
            extension_progress_bar.update(len(cur_new_mappings))
            new_mappings += cur_new_mappings
        # add new mappings to the expansion set
        expansion += new_mappings
        # renew frontier with the newly discovered mappings
        frontier = [(x, y) for x, y, _ in new_mappings]

        self.logger.info(f"Add {len(new_mappings)} mappings at iteration #{num_iter}.")
        num_iter += 1
        extension_progress_bar.desc = f"Mapping Extension [Iteration #{num_iter}]"

    num_extended = len(expansion) - len(self.raw_mappings)
    self.logger.info(
        f"Finished iterative mapping extension with {num_extended} new mappings and in total {len(expansion)} extended mappings."
    )

    extended_mapping_df = pd.DataFrame(expansion, columns=["SrcEntity", "TgtEntity", "Score"])
    extended_mapping_df.to_csv(self.extended_mapping_path, sep="\t", index=False)

    self.enlighten_status.update(demo="Mapping Filtering")

    filtered_expansion = [
        (src, tgt, score) for src, tgt, score in expansion if score >= self.mapping_filtered_threshold
    ]
    self.logger.info(
        f"Filtered the extended mappings by a threshold of {self.mapping_filtered_threshold}."
        + f"There are {len(filtered_expansion)} mappings left for mapping repair."
    )

    for _ in range(len(filtered_expansion)):
        filtering_progress_bar.update()

    filtered_mapping_df = pd.DataFrame(filtered_expansion, columns=["SrcEntity", "TgtEntity", "Score"])
    filtered_mapping_df.to_csv(self.filtered_mapping_path, sep="\t", index=False)

    extension_progress_bar.close()
    filtering_progress_bar.close()
    return filtered_expansion

`one_hop_extend(src_class_iri, tgt_class_iri, pool_size=200)`

Extend mappings from a scored class pair \((c, c')\) by searching from one-hop neighbors.

Search for plausible mappings between the parents of \(c\) and \(c'\), and between the children of \(c\) and \(c'\). Mappings that are not already computed (recorded in self.mapping_score_dict) and have a score \(\geq\) self.mapping_extension_threshold will be returned as new mappings.

Parameters:

Name	Type	Description	Default
`src_class_iri`	`str`	The IRI of the source ontology class \(c\).	required
`tgt_class_iri`	`str`	The IRI of the target ontology class \(c'\).	required
`pool_size`	`int`	The maximum number of plausible mappings to be extended. Defaults to 200.	`200`

Returns:

Type	Description
`List[EntityMapping]`	A list of one-hop extended mappings.

Source code in src/deeponto/align/bertmap/mapping_refinement.py

def one_hop_extend(self, src_class_iri: str, tgt_class_iri: str, pool_size: int = 200):
    r"""Extend mappings from a scored class pair $(c, c')$ by
    searching from one-hop neighbors.

    Search for plausible mappings between the parents of $c$ and $c'$,
    and between the children of $c$ and $c'$. Mappings that are not
    already computed (recorded in `self.mapping_score_dict`) and have
    a score $\geq$ `self.mapping_extension_threshold` will be returned as
    **new** mappings.

    Args:
        src_class_iri (str): The IRI of the source ontology class $c$.
        tgt_class_iri (str): The IRI of the target ontology class $c'$.
        pool_size (int, optional): The maximum number of plausible mappings to be extended. Defaults to 200.

    Returns:
        (List[EntityMapping]): A list of one-hop extended mappings.
    """

    def get_iris(owl_objects):
        return [str(x.getIRI()) for x in owl_objects]

    src_class = self.src_onto.get_owl_object(src_class_iri)
    src_class_parent_iris = get_iris(self.src_onto.get_asserted_parents(src_class, named_only=True))
    src_class_children_iris = get_iris(self.src_onto.get_asserted_children(src_class, named_only=True))

    tgt_class = self.tgt_onto.get_owl_object(tgt_class_iri)
    tgt_class_parent_iris = get_iris(self.tgt_onto.get_asserted_parents(tgt_class, named_only=True))
    tgt_class_children_iris = get_iris(self.tgt_onto.get_asserted_children(tgt_class, named_only=True))

    # pair up parents and children, respectively; NOTE set() might not be necessary
    parent_pairs = list(set(itertools.product(src_class_parent_iris, tgt_class_parent_iris)))
    children_pairs = list(set(itertools.product(src_class_children_iris, tgt_class_children_iris)))

    candidate_pairs = parent_pairs + children_pairs
    # downsample if the number of candidates is too large
    if len(candidate_pairs) > pool_size:
        candidate_pairs = random.sample(candidate_pairs, pool_size)

    extended_mappings = []
    for src_candidate_iri, tgt_candidate_iri in parent_pairs + children_pairs:

        # if already computed meaning that it is not a new mapping
        if (src_candidate_iri, tgt_candidate_iri) in self.mapping_score_dict:
            continue

        src_candidate_annotations = self.mapping_predictor.src_annotation_index[src_candidate_iri]
        tgt_candidate_annotations = self.mapping_predictor.tgt_annotation_index[tgt_candidate_iri]
        score = self.mapping_predictor.bert_mapping_score(src_candidate_annotations, tgt_candidate_annotations)
        # add to already scored collection
        self.mapping_score_dict[(src_candidate_iri, tgt_candidate_iri)] = score

        # skip mappings with low scores
        if score < self.mapping_extension_threshold:
            continue

        extended_mappings.append((src_candidate_iri, tgt_candidate_iri, score))

    self.logger.info(
        f"New mappings (in tuples) extended from {(src_class_iri, tgt_class_iri)} are:\n" + f"{extended_mappings}"
    )

    return extended_mappings

`mapping_repair()`

Repair the filtered mappings with LogMap's debugger.

Note

A sub-folder under match named logmap-repair contains LogMap-related intermediate files.

Source code in src/deeponto/align/bertmap/mapping_refinement.py

def mapping_repair(self):
    """Repair the filtered mappings with LogMap's debugger.

    !!! note

        A sub-folder under `match` named `logmap-repair` contains LogMap-related intermediate files.
    """

    # progress bar for animation purposes
    self.enlighten_status.update(demo="Mapping Repairing")
    repair_progress_bar = self.enlighten_manager.counter(
        desc=f"Mapping Repairing", unit="mapping"
    )

    # skip repairing if already found the file
    if os.path.exists(self.repaired_mapping_path):
        self.logger.info(
            f"Found the repaired mapping file at {self.repaired_mapping_path}."
            + "\nPlease check file integrity; if incomplete, "
            + "delete it and re-run the program."
        )
        # update progress bar for animation purposes
        for _ in EntityMapping.read_table_mappings(self.repaired_mapping_path):
            repair_progress_bar.update()
        repair_progress_bar.close()
        return 

    # start mapping repair
    self.logger.info("Repair the filtered mappings with LogMap debugger.")
    # formatting the filtered mappings
    self.logmap_repair_formatting()

    # run the LogMap repair module on the extended mappings
    run_logmap_repair(
        self.src_onto.owl_path,
        self.tgt_onto.owl_path,
        os.path.join(self.logmap_repair_path, f"filtered_mappings_for_LogMap_repair.txt"),
        self.logmap_repair_path,
        Ontology.get_max_jvm_memory()
    )

    # create table mappings from LogMap repair outputs
    with open(os.path.join(self.logmap_repair_path, "mappings_repaired_with_LogMap.tsv"), "r") as f:
        lines = f.readlines()
    with open(os.path.join(self.output_path, "match", "repaired_mappings.tsv"), "w+") as f:
        f.write("SrcEntity\tTgtEntity\tScore\n")
        for line in lines:
            src_ent_iri, tgt_ent_iri, score = line.split("\t")
            f.write(f"{src_ent_iri}\t{tgt_ent_iri}\t{score}")
            repair_progress_bar.update()

    self.logger.info("Mapping repair finished.")
    repair_progress_bar.close()

`logmap_repair_formatting()`

Transform the filtered mapping file into the LogMap format.

An auxiliary function of the mapping repair module which requires mappings to be formatted as LogMap's input format.

Source code in src/deeponto/align/bertmap/mapping_refinement.py

def logmap_repair_formatting(self):
    """Transform the filtered mapping file into the LogMap format.

    An auxiliary function of the mapping repair module which requires mappings
    to be formatted as LogMap's input format.
    """
    # read the filtered mapping file and convert to tuples
    filtered_mappings = EntityMapping.read_table_mappings(self.filtered_mapping_path)
    filtered_mappings_in_tuples = [m.to_tuple(with_score=True) for m in filtered_mappings]

    # write the mappings into logmap format
    lines = []
    for src_class_iri, tgt_class_iri, score in filtered_mappings_in_tuples:
        lines.append(f"{src_class_iri}|{tgt_class_iri}|=|{score}|CLS\n")

    # create a path to prevent error
    create_path(self.logmap_repair_path)
    formatted_file = os.path.join(self.logmap_repair_path, f"filtered_mappings_for_LogMap_repair.txt")
    with open(formatted_file, "w") as f:
        f.writelines(lines)

    return lines

Last update: March 7, 2023
Created: January 13, 2023

GitHub: @Lawhy Personal Page: yuanhe.wiki

BERTMap

BERTMapPipeline(src_onto, tgt_onto, config)

load_or_construct(data_file, data_name, construct_func, *args, **kwargs)

load_text_semantics_corpora()

load_finetune_data()

load_bert_synonym_classifier()

load_best_checkpoint()

load_bertmap_config(config_file=None) staticmethod

save_bertmap_config(config, config_file) staticmethod

AnnotationThesaurus(onto, annotation_property_iris, apply_transitivity=False)

get_synonym_pairs(synonym_group, remove_duplicates=True) staticmethod

merge_synonym_groups_by_transitivity(synonym_groups) staticmethod

connected_annotations(synonym_pairs) staticmethod

synonym_sampling(num_samples=None)

soft_nonsynonym_sampling(num_samples, max_iter=5)

weighted_random_choices_of_sibling_groups(k=1)

hard_nonsynonym_sampling(num_samples, max_iter=5)

IntraOntologyTextSemanticsCorpus(onto, annotation_property_iris, soft_negative_ratio=2, hard_negative_ratio=2)

save(save_path)

CrossOntologyTextSemanticsCorpus(class_mappings, src_onto, tgt_onto, annotation_property_iris, negative_ratio=4)

save(save_path)

synonym_sampling_from_mappings()

nonsynonym_sampling_from_mappings(num_samples, max_iter=5)

TextSemanticsCorpora(src_onto, tgt_onto, annotation_property_iris, class_mappings=None, auxiliary_ontos=None)

save(save_path)

add_samples_from_sub_corpus(sub_corpus)

BERTSynonymClassifier(loaded_path, output_path, eval_mode, max_length_for_input, num_epochs_for_training=None, batch_size_for_training=None, batch_size_for_prediction=None, training_data=None, validation_data=None)

train(resume_from_checkpoint=None)

eval()

predict(sent_pairs)

load_dataset(data, split)

process_inputs(sent_pairs)

compute_metrics(pred) staticmethod

get_device(device_num=0) staticmethod

set_seed(seed_val=888) staticmethod

MappingPredictor(output_path, tokenizer_path, src_annotation_index, tgt_annotation_index, bert_synonym_classifier, num_raw_candidates, num_best_predictions, batch_size_for_prediction, logger, enlighten_manager, enlighten_status, ignored_class_index=None)

bert_mapping_score(src_class_annotations, tgt_class_annotations)

edit_similarity_mapping_score(src_class_annotations, tgt_class_annotations, string_match_only=False) staticmethod

mapping_prediction_for_src_class(src_class_iri)

mapping_prediction()

MappingRefiner(output_path, src_onto, tgt_onto, mapping_predictor, mapping_extension_threshold, mapping_filtered_threshold, logger, enlighten_manager, enlighten_status)

mapping_extension(max_iter=10)

one_hop_extend(src_class_iri, tgt_class_iri, pool_size=200)

mapping_repair()

logmap_repair_formatting()

`BERTMapPipeline(src_onto, tgt_onto, config)`

`load_or_construct(data_file, data_name, construct_func, *args, **kwargs)`

`load_text_semantics_corpora()`

`load_finetune_data()`

`load_bert_synonym_classifier()`

`load_best_checkpoint()`

`load_bertmap_config(config_file=None)` `staticmethod`

`save_bertmap_config(config, config_file)` `staticmethod`

`AnnotationThesaurus(onto, annotation_property_iris, apply_transitivity=False)`

`get_synonym_pairs(synonym_group, remove_duplicates=True)` `staticmethod`

`merge_synonym_groups_by_transitivity(synonym_groups)` `staticmethod`

`connected_annotations(synonym_pairs)` `staticmethod`

`synonym_sampling(num_samples=None)`

`soft_nonsynonym_sampling(num_samples, max_iter=5)`

`weighted_random_choices_of_sibling_groups(k=1)`

`hard_nonsynonym_sampling(num_samples, max_iter=5)`

`IntraOntologyTextSemanticsCorpus(onto, annotation_property_iris, soft_negative_ratio=2, hard_negative_ratio=2)`

`save(save_path)`

`CrossOntologyTextSemanticsCorpus(class_mappings, src_onto, tgt_onto, annotation_property_iris, negative_ratio=4)`

`save(save_path)`

`synonym_sampling_from_mappings()`

`nonsynonym_sampling_from_mappings(num_samples, max_iter=5)`

`TextSemanticsCorpora(src_onto, tgt_onto, annotation_property_iris, class_mappings=None, auxiliary_ontos=None)`

`save(save_path)`

`add_samples_from_sub_corpus(sub_corpus)`

`BERTSynonymClassifier(loaded_path, output_path, eval_mode, max_length_for_input, num_epochs_for_training=None, batch_size_for_training=None, batch_size_for_prediction=None, training_data=None, validation_data=None)`

`train(resume_from_checkpoint=None)`

`eval()`

`predict(sent_pairs)`

`load_dataset(data, split)`

`process_inputs(sent_pairs)`

`compute_metrics(pred)` `staticmethod`

`get_device(device_num=0)` `staticmethod`

`set_seed(seed_val=888)` `staticmethod`

`MappingPredictor(output_path, tokenizer_path, src_annotation_index, tgt_annotation_index, bert_synonym_classifier, num_raw_candidates, num_best_predictions, batch_size_for_prediction, logger, enlighten_manager, enlighten_status, ignored_class_index=None)`

`bert_mapping_score(src_class_annotations, tgt_class_annotations)`

`edit_similarity_mapping_score(src_class_annotations, tgt_class_annotations, string_match_only=False)` `staticmethod`

`mapping_prediction_for_src_class(src_class_iri)`

`mapping_prediction()`

`MappingRefiner(output_path, src_onto, tgt_onto, mapping_predictor, mapping_extension_threshold, mapping_filtered_threshold, logger, enlighten_manager, enlighten_status)`

`mapping_extension(max_iter=10)`

`one_hop_extend(src_class_iri, tgt_class_iri, pool_size=200)`

`mapping_repair()`

`logmap_repair_formatting()`