Skip to content

BERTMap

Paper

\(\textsf{BERTMap}\) is proposed in the paper: BERTMap: A BERT-based Ontology Alignment System (AAAI-2022).

@inproceedings{he2022bertmap,
    title={BERTMap: a BERT-based ontology alignment system},
    author={He, Yuan and Chen, Jiaoyan and Antonyrajah, Denvar and Horrocks, Ian},
    booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
    volume={36},
    number={5},
    pages={5684--5691},
    year={2022}
}

\(\textsf{BERTMap}\) is a BERT-based ontology matching (OM) system consisting of following components:

  • Text semantics corpora construction from input ontologies, and optionally from input mappings and other auxiliary ontologies.
  • BERT synonym classifier training on synonym and non-synonym samples in text semantics corpora.
  • Sub-word Inverted Index construction from the tokenised class annotations for candidate selection in mapping prediction.
  • Mapping Predictor which integrates a simple edit distance-based string matching module and the fine-tuned BERT synonym classifier for mapping scoring. For each source ontology class, narrow down target class candidates using the sub-word inverted index, apply string matching for "easy" mappings and then apply BERT matching.
  • Mapping Refiner which consists of the mapping extension and mapping repair modules. Mapping extension is an iterative process based on the locality principle. Mapping repair utilises the LogMap's debugger.

\(\textsf{BERTMapLt}\) is a light-weight version of \(\textsf{BERTMap}\) without the BERT module and mapping refiner.

See the tutorial for \(\textsf{BERTMap}\) here.

BERTMapPipeline(src_onto, tgt_onto, config)

Class for the whole ontology alignment pipeline of \(\textsf{BERTMap}\) and \(\textsf{BERTMapLt}\) models.

Note

Parameters related to BERT training are None by default. They will be constructed for \(\textsf{BERTMap}\) and stay as None for \(\textsf{BERTMapLt}\).

Attributes:

Name Type Description
config CfgNode

The configuration for BERTMap or BERTMapLt.

name str

The name of the model, either bertmap or bertmaplt.

output_path str

The path to the output directory.

src_onto Ontology

The source ontology to be matched.

tgt_onto Ontology

The target ontology to be matched.

annotation_property_iris List[str]

The annotation property IRIs used for extracting synonyms and nonsynonyms.

src_annotation_index dict

A dictionary that stores the (class_iri, class_annotations) pairs from src_onto according to annotation_property_iris.

tgt_annotation_index dict

A dictionary that stores the (class_iri, class_annotations) pairs from tgt_onto according to annotation_property_iris.

known_mappings List[ReferenceMapping]

List of known mappings for constructing the cross-ontology corpus.

auxliary_ontos List[Ontology]

List of auxiliary ontolgoies for constructing any auxiliary corpus.

corpora dict

A dictionary that stores the summary of built text semantics corpora and the sampled synonyms and nonsynonyms.

finetune_data dict

A dictionary that stores the training and validation splits of samples from corpora.

bert BERTSynonymClassifier

A BERT model for synonym classification and mapping prediction.

best_checkpoint str

The path to the best BERT checkpoint which will be loaded after training.

mapping_predictor MappingPredictor

The predictor function based on class annotations, used for global matching or mapping scoring.

Parameters:

Name Type Description Default
src_onto Ontology

The source ontology for alignment.

required
tgt_onto Ontology

The target ontology for alignment.

required
config CfgNode

The configuration for BERTMap or BERTMapLt.

required
Source code in src/deeponto/align/bertmap/pipeline.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
def __init__(self, src_onto: Ontology, tgt_onto: Ontology, config: CfgNode):
    """Initialise the BERTMap or BERTMapLt model.

    Args:
        src_onto (Ontology): The source ontology for alignment.
        tgt_onto (Ontology): The target ontology for alignment.
        config (CfgNode): The configuration for BERTMap or BERTMapLt.
    """
    # load the configuration and confirm model name is valid
    self.config = config
    self.name = self.config.model
    if not self.name in MODEL_OPTIONS.keys():
        raise RuntimeError(f"`model` {self.name} in the config file is not one of the supported.")

    # create the output directory, e.g., experiments/bertmap
    self.config.output_path = "." if not self.config.output_path else self.config.output_path
    self.config.output_path = os.path.abspath(self.config.output_path)
    self.output_path = os.path.join(self.config.output_path, self.name)
    create_path(self.output_path)

    # create logger and progress manager (hidden attribute) 
    self.logger = create_logger(self.name, self.output_path)
    self.enlighten_manager = enlighten.get_manager()

    # ontology
    self.src_onto = src_onto
    self.tgt_onto = tgt_onto
    self.annotation_property_iris = self.config.annotation_property_iris
    self.logger.info(f"Load the following configurations:\n{print_dict(self.config)}")
    config_path = os.path.join(self.output_path, "config.yaml")
    self.logger.info(f"Save the configuration file at {config_path}.")
    self.save_bertmap_config(self.config, config_path)

    # build the annotation thesaurus
    self.src_annotation_index, _ = self.src_onto.build_annotation_index(self.annotation_property_iris, apply_lowercasing=True)
    self.tgt_annotation_index, _ = self.tgt_onto.build_annotation_index(self.annotation_property_iris, apply_lowercasing=True)
    if (not self.src_annotation_index) or (not self.tgt_annotation_index):
        raise RuntimeError("No class annotations found in input ontologies; unable to produce alignment.")

    # provided mappings if any
    self.known_mappings = self.config.known_mappings
    if self.known_mappings:
        self.known_mappings = ReferenceMapping.read_table_mappings(self.known_mappings)

    # auxiliary ontologies if any
    self.auxiliary_ontos = self.config.auxiliary_ontos
    if self.auxiliary_ontos:
        self.auxiliary_ontos = [Ontology(ao) for ao in self.auxiliary_ontos]

    self.data_path = os.path.join(self.output_path, "data")
    # load or construct the corpora
    self.corpora_path = os.path.join(self.data_path, "text-semantics.corpora.json")
    self.corpora = self.load_text_semantics_corpora()

    # load or construct fine-tune data
    self.finetune_data_path = os.path.join(self.data_path, "fine-tune.data.json")
    self.finetune_data = self.load_finetune_data()

    # load the bert model and train
    self.bert_config = self.config.bert
    self.bert_pretrained_path = self.bert_config.pretrained_path
    self.bert_finetuned_path = os.path.join(self.output_path, "bert")
    self.bert_resume_training = self.bert_config.resume_training
    self.bert_synonym_classifier = None
    self.best_checkpoint = None
    if self.name == "bertmap":
        self.bert_synonym_classifier = self.load_bert_synonym_classifier()
        # train if the loaded classifier is not in eval mode
        if self.bert_synonym_classifier.eval_mode == False:
            self.logger.info(
                f"Data statistics:\n \
                {print_dict(self.bert_synonym_classifier.data_stat)}"
            )
            self.bert_synonym_classifier.train(self.bert_resume_training)
            # turn on eval mode after training
            self.bert_synonym_classifier.eval()
        # NOTE potential redundancy here: after training, load the best checkpoint
        self.best_checkpoint = self.load_best_checkpoint()
        if not self.best_checkpoint:
            raise RuntimeError(f"No best checkpoint found for the BERT synonym classifier model.")
        self.logger.info(f"Fine-tuning finished, found best checkpoint at {self.best_checkpoint}.")
    else:
        self.logger.info(f"No training needed; skip BERT fine-tuning.")

    # pretty progress bar tracking
    self.enlighten_status = self.enlighten_manager.status_bar(
        status_format=u'Global Matching{fill}Stage: {demo}{fill}{elapsed}',
        color='bold_underline_bright_white_on_lightslategray',
        justify=enlighten.Justify.CENTER, demo='Initializing',
        autorefresh=True, min_delta=0.5
    )

    # mapping predictions
    self.global_matching_config = self.config.global_matching

    # build ignored class index for OAEI
    self.ignored_class_index = None  
    if self.global_matching_config.for_oaei:
        self.ignored_class_index = defaultdict(lambda: False)
        for src_class_iri, src_class in self.src_onto.owl_classes.items():
            use_in_alignment = self.src_onto.get_annotations(src_class, "http://oaei.ontologymatching.org/bio-ml/ann/use_in_alignment")
            if use_in_alignment and str(use_in_alignment[0]).lower() == "false":
                self.ignored_class_index[src_class_iri] = True
        for tgt_class_iri, tgt_class in self.tgt_onto.owl_classes.items():
            use_in_alignment = self.tgt_onto.get_annotations(tgt_class, "http://oaei.ontologymatching.org/bio-ml/ann/use_in_alignment")
            if use_in_alignment and str(use_in_alignment[0]).lower() == "false":
                self.ignored_class_index[tgt_class_iri] = True

    self.mapping_predictor = MappingPredictor(
        output_path=self.output_path,
        tokenizer_path=self.bert_config.pretrained_path,
        src_annotation_index=self.src_annotation_index,
        tgt_annotation_index=self.tgt_annotation_index,
        bert_synonym_classifier=self.bert_synonym_classifier,
        num_raw_candidates=self.global_matching_config.num_raw_candidates,
        num_best_predictions=self.global_matching_config.num_best_predictions,
        batch_size_for_prediction=self.bert_config.batch_size_for_prediction,
        logger=self.logger,
        enlighten_manager=self.enlighten_manager,
        enlighten_status=self.enlighten_status,
        ignored_class_index=self.ignored_class_index,
    )
    self.mapping_refiner = None

    # if global matching is disabled (potentially used for class pair scoring)
    if self.config.global_matching.enabled:
        self.mapping_predictor.mapping_prediction()  # mapping prediction
        if self.name == "bertmap":
            self.mapping_refiner = MappingRefiner(
                output_path=self.output_path,
                src_onto=self.src_onto,
                tgt_onto=self.tgt_onto,
                mapping_predictor=self.mapping_predictor,
                mapping_extension_threshold=self.global_matching_config.mapping_extension_threshold,
                mapping_filtered_threshold=self.global_matching_config.mapping_filtered_threshold,
                logger=self.logger,
                enlighten_manager=self.enlighten_manager,
                enlighten_status=self.enlighten_status
            )
            self.mapping_refiner.mapping_extension()  # mapping extension
            self.mapping_refiner.mapping_repair()  # mapping repair
        self.enlighten_status.update(demo="Finished")  
    else:
        self.enlighten_status.update(demo="Skipped")  

    self.enlighten_status.close()

load_or_construct(data_file, data_name, construct_func, *args, **kwargs)

Load existing data or construct a new one.

An auxlirary function that checks the existence of a data file and loads it if it exists. Otherwise, construct new data with the input construct_func which is supported generate a local data file.

Source code in src/deeponto/align/bertmap/pipeline.py
220
221
222
223
224
225
226
227
228
229
230
231
232
233
def load_or_construct(self, data_file: str, data_name: str, construct_func: Callable, *args, **kwargs):
    """Load existing data or construct a new one.

    An auxlirary function that checks the existence of a data file and loads it if it exists.
    Otherwise, construct new data with the input `construct_func` which is supported generate
    a local data file.
    """
    if os.path.exists(data_file):
        self.logger.info(f"Load existing {data_name} from {data_file}.")
    else:
        self.logger.info(f"Construct new {data_name} and save at {data_file}.")
        construct_func(*args, **kwargs)
    # load the data file that is supposed to be saved locally
    return load_file(data_file)

load_text_semantics_corpora()

Load or construct text semantics corpora.

See TextSemanticsCorpora.

Source code in src/deeponto/align/bertmap/pipeline.py
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def load_text_semantics_corpora(self):
    """Load or construct text semantics corpora.

    See [`TextSemanticsCorpora`][deeponto.align.bertmap.text_semantics.TextSemanticsCorpora].
    """
    data_name = "text semantics corpora"

    if self.name == "bertmap":

        def construct():
            corpora = TextSemanticsCorpora(
                src_onto=self.src_onto,
                tgt_onto=self.tgt_onto,
                annotation_property_iris=self.annotation_property_iris,
                class_mappings=self.known_mappings,
                auxiliary_ontos=self.auxiliary_ontos,
            )
            self.logger.info(str(corpora))
            corpora.save(self.data_path)

        return self.load_or_construct(self.corpora_path, data_name, construct)

    self.logger.info(f"No training needed; skip the construction of {data_name}.")
    return None

load_finetune_data()

Load or construct fine-tuning data from text semantics corpora.

Steps of constructing fine-tuning data from text semantics:

  1. Mix synonym and nonsynonym data.
  2. Randomly sample 90% as training samples and 10% as validation.
Source code in src/deeponto/align/bertmap/pipeline.py
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
def load_finetune_data(self):
    r"""Load or construct fine-tuning data from text semantics corpora.

    Steps of constructing fine-tuning data from text semantics:

    1. Mix synonym and nonsynonym data.
    2. Randomly sample 90% as training samples and 10% as validation.
    """
    data_name = "fine-tuning data"

    if self.name == "bertmap":

        def construct():
            finetune_data = dict()
            samples = self.corpora["synonyms"] + self.corpora["nonsynonyms"]
            random.shuffle(samples)
            split_index = int(0.9 * len(samples))  # split at 90%
            finetune_data["training"] = samples[:split_index]
            finetune_data["validation"] = samples[split_index:]
            save_file(finetune_data, self.finetune_data_path)

        return self.load_or_construct(self.finetune_data_path, data_name, construct)

    self.logger.info(f"No training needed; skip the construction of {data_name}.")
    return None

load_bert_synonym_classifier()

Load the BERT model from a pre-trained or a local checkpoint.

  • If loaded from pre-trained, it means to start training from a pre-trained model such as bert-uncased.
  • If loaded from local, turn on the eval mode for mapping predictions.
  • If self.bert_resume_training is True, it will be loaded from the latest saved checkpoint.
Source code in src/deeponto/align/bertmap/pipeline.py
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def load_bert_synonym_classifier(self):
    """Load the BERT model from a pre-trained or a local checkpoint.

    - If loaded from pre-trained, it means to start training from a pre-trained model such as `bert-uncased`.
    - If loaded from local, turn on the `eval` mode for mapping predictions.
    - If `self.bert_resume_training` is `True`, it will be loaded from the latest saved checkpoint.
    """
    checkpoint = self.load_best_checkpoint()  # load the best checkpoint or nothing
    eval_mode = True
    # if no checkpoint has been found, start training from scratch OR resume training
    # no point to load the best checkpoint if resume training (will automatically search for the latest checkpoint)
    if not checkpoint or self.bert_resume_training:
        checkpoint = self.bert_pretrained_path
        eval_mode = False  # since it is for training now

    return BERTSynonymClassifier(
        loaded_path=checkpoint,
        output_path=self.bert_finetuned_path,
        eval_mode=eval_mode,
        max_length_for_input=self.bert_config.max_length_for_input,
        num_epochs_for_training=self.bert_config.num_epochs_for_training,
        batch_size_for_training=self.bert_config.batch_size_for_training,
        batch_size_for_prediction=self.bert_config.batch_size_for_prediction,
        training_data=self.finetune_data["training"],
        validation_data=self.finetune_data["validation"],
    )

load_best_checkpoint()

Find the best checkpoint by searching for trainer states in each checkpoint file.

Source code in src/deeponto/align/bertmap/pipeline.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def load_best_checkpoint(self) -> Optional[str]:
    """Find the best checkpoint by searching for trainer states in each checkpoint file."""
    best_checkpoint = -1

    if os.path.exists(self.bert_finetuned_path):
        for file in os.listdir(self.bert_finetuned_path):
            # load trainer states from each checkpoint file
            if file.startswith("checkpoint"):
                trainer_state = load_file(
                    os.path.join(self.bert_finetuned_path, file, "trainer_state.json")
                )
                checkpoint = int(trainer_state["best_model_checkpoint"].split("/")[-1].split("-")[-1])
                # find the latest best checkpoint
                if checkpoint > best_checkpoint:
                    best_checkpoint = checkpoint

    if best_checkpoint == -1:
        best_checkpoint = None
    else:
        best_checkpoint = os.path.join(self.bert_finetuned_path, f"checkpoint-{best_checkpoint}")

    return best_checkpoint

load_bertmap_config(config_file=None) staticmethod

Load the BERTMap configuration in .yaml. If the file is not provided, use the default configuration.

Source code in src/deeponto/align/bertmap/pipeline.py
336
337
338
339
340
341
342
343
344
345
346
@staticmethod
def load_bertmap_config(config_file: Optional[str] = None):
    """Load the BERTMap configuration in `.yaml`. If the file
    is not provided, use the default configuration.
    """
    if not config_file:
        config_file = DEFAULT_CONFIG_FILE
        print(f"Use the default configuration at {DEFAULT_CONFIG_FILE}.")  
    if not config_file.endswith(".yaml"):
        raise RuntimeError("Configuration file should be in `yaml` format.")
    return CfgNode(load_file(config_file))

save_bertmap_config(config, config_file) staticmethod

Save the BERTMap configuration in .yaml.

Source code in src/deeponto/align/bertmap/pipeline.py
348
349
350
351
352
@staticmethod
def save_bertmap_config(config: CfgNode, config_file: str):
    """Save the BERTMap configuration in `.yaml`."""
    with open(config_file, "w") as c:
        config.dump(stream=c, sort_keys=False, default_flow_style=False)

AnnotationThesaurus(onto, annotation_property_iris, apply_transitivity=False)

A thesaurus class for synonyms and non-synonyms extracted from an ontology.

Some related definitions of arguments here:

  • A synonym_group is a set of annotation phrases that are synonymous to each other;
  • The transitivity of synonyms means if A and B are synonymous and B and C are synonymous, then A and C are synonymous. This is achieved by a connected graph-based algorithm.
  • A synonym_pair is a pair synonymous annotation phrase which can be extracted from the cartesian product of a synonym_group and itself. NOTE that reflexivity and symmetry are preserved meaning that (i) every phrase A is a synonym of itself and (ii) if (A, B) is a synonym pair then (B, A) is a synonym pair, too.

Attributes:

Name Type Description
onto Ontology

An ontology to construct the annotation thesaurus from.

annotation_index Dict[str, Set[str]]

An index of the class annotations with (class_iri, annotations) pairs.

annotation_property_iris List[str]

A list of annotation property IRIs used to extract the annotations.

average_number_of_annotations_per_class int

The average number of (extracted) annotations per ontology class.

apply_transitivity bool

Apply synonym transitivity to merge synonym groups or not.

synonym_groups List[Set[str]]

The list of synonym groups extracted from the ontology according to specified annotation properties.

Parameters:

Name Type Description Default
onto Ontology

The input ontology to extract annotations from.

required
annotation_property_iris List[str]

Specify which annotation properties to be used.

required
apply_transitivity bool

Apply synonym transitivity to merge synonym groups or not. Defaults to False.

False
Source code in src/deeponto/align/bertmap/text_semantics.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def __init__(self, onto: Ontology, annotation_property_iris: List[str], apply_transitivity: bool = False):
    r"""Initialise a thesaurus for ontology class annotations.

    Args:
        onto (Ontology): The input ontology to extract annotations from.
        annotation_property_iris (List[str]): Specify which annotation properties to be used.
        apply_transitivity (bool, optional): Apply synonym transitivity to merge synonym groups or not. Defaults to `False`.
    """

    self.onto = onto
    # build the annotation index to extract synonyms from `onto`
    # the input property iris may not exist in this ontology
    # the output property iris will be truncated to the existing ones
    index, iris = self.onto.build_annotation_index(
        annotation_property_iris=annotation_property_iris,
        entity_type="Classes",
        apply_lowercasing=True,
    )
    self.annotation_index = index
    self.annotation_property_iris = iris
    total_number_of_annotations = sum([len(v) for v in self.annotation_index.values()])
    self.average_number_of_annotations_per_class = total_number_of_annotations / len(self.annotation_index)

    # synonym groups
    self.apply_transitivity = apply_transitivity
    self.synonym_groups = list(self.annotation_index.values())
    if self.apply_transitivity:
        self.synonym_groups = self.merge_synonym_groups_by_transitivity(self.synonym_groups)

    # summary
    self.info = {
        type(self).__name__: {
            "ontology": self.onto.info[type(self.onto).__name__],
            "average_number_of_annotations_per_class": round(self.average_number_of_annotations_per_class, 3),
            "number_of_synonym_groups": len(self.synonym_groups),
        }
    }

get_synonym_pairs(synonym_group, remove_duplicates=True) staticmethod

Get synonym pairs from a synonym group through a cartesian product.

Parameters:

Name Type Description Default
synonym_group Set[str]

A set of annotation phrases that are synonymous to each other.

required

Returns:

Type Description
List[Tuple[str, str]]

A list of synonym pairs.

Source code in src/deeponto/align/bertmap/text_semantics.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
@staticmethod
def get_synonym_pairs(synonym_group: Set[str], remove_duplicates: bool = True):
    """Get synonym pairs from a synonym group through a cartesian product.

    Args:
        synonym_group (Set[str]): A set of annotation phrases that are synonymous to each other.

    Returns:
        (List[Tuple[str, str]]): A list of synonym pairs.
    """
    synonym_pairs = list(itertools.product(synonym_group, synonym_group))
    if remove_duplicates:
        return uniqify(synonym_pairs)
    else:
        return synonym_pairs

merge_synonym_groups_by_transitivity(synonym_groups) staticmethod

Merge synonym groups by transitivity.

Synonym groups that share a common annotation phrase will be merged. NOTE that for multiple ontologies, we can merge their synonym groups by first concatenating them then use this function.

Note

In \(\textsf{BERTMap}\) experiments we have considered this as a data augmentation approach but it does not bring a significant performance improvement. However, if the overall number of annotations is not large enough then this could be a good option.

Parameters:

Name Type Description Default
synonym_groups List[Set[str]]

A sequence of synonym groups to be merged.

required

Returns:

Type Description
List[Set[str]]

A list of merged synonym groups.

Source code in src/deeponto/align/bertmap/text_semantics.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
@staticmethod
def merge_synonym_groups_by_transitivity(synonym_groups: List[Set[str]]):
    r"""Merge synonym groups by transitivity.

    Synonym groups that share a common annotation phrase will be merged. NOTE that for
    multiple ontologies, we can merge their synonym groups by first concatenating them
    then use this function.

    !!! note

        In $\textsf{BERTMap}$ experiments we have considered this as a data augmentation approach
        but it does not bring a significant performance improvement. However, if the
        overall number of annotations is not large enough then this could be a good option.

    Args:
        synonym_groups (List[Set[str]]): A sequence of synonym groups to be merged.

    Returns:
        (List[Set[str]]): A list of merged synonym groups.
    """
    synonym_pairs = []
    for synonym_group in synonym_groups:
        # gather synonym pairs from the self-product of a synonym group
        synonym_pairs += AnnotationThesaurus.get_synonym_pairs(synonym_group, remove_duplicates=False)
    synonym_pairs = uniqify(synonym_pairs)
    merged_grouped_synonyms = AnnotationThesaurus.connected_labels(synonym_pairs)
    return merged_grouped_synonyms

connected_annotations(synonym_pairs) staticmethod

Build a graph for adjacency among the class annotations (labels) such that the transitivity of synonyms is ensured.

Auxiliary function for merge_synonym_groups_by_transitivity.

Parameters:

Name Type Description Default
synonym_pairs List[Tuple[str, str]]

List of pairs of phrases that are synonymous.

required

Returns:

Type Description
List[Set[str]]

A list of synonym groups.

Source code in src/deeponto/align/bertmap/text_semantics.py
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
@staticmethod
def connected_annotations(synonym_pairs: List[Tuple[str, str]]):
    """Build a graph for adjacency among the class annotations (labels) such that
    the **transitivity** of synonyms is ensured.

    Auxiliary function for [`merge_synonym_groups_by_transitivity`][deeponto.align.bertmap.text_semantics.AnnotationThesaurus.merge_synonym_groups_by_transitivity].

    Args:
        synonym_pairs (List[Tuple[str, str]]): List of pairs of phrases that are synonymous.

    Returns:
        (List[Set[str]]): A list of synonym groups.
    """
    graph = nx.Graph()
    graph.add_edges_from(synonym_pairs)
    # nx.draw(G, with_labels = True)
    connected = list(nx.connected_components(graph))
    return connected

synonym_sampling(num_samples=None)

Sample synonym pairs from a list of synonym groups extracted from the input ontology.

According to the \(\textsf{BERTMap}\) paper, synonyms are defined as label pairs that belong to the same ontology class.

NOTE this has been validated for getting the same results as in the original \(\textsf{BERTMap}\) repository.

Parameters:

Name Type Description Default
num_samples int

The (maximum) number of unique samples extracted. Defaults to None.

None

Returns:

Type Description
List[Tuple[str, str]]

A list of unique synonym pair samples.

Source code in src/deeponto/align/bertmap/text_semantics.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
def synonym_sampling(self, num_samples: Optional[int] = None):
    r"""Sample synonym pairs from a list of synonym groups extracted from the input ontology.

    According to the $\textsf{BERTMap}$ paper, **synonyms** are defined as label pairs that belong
    to the same ontology class.

    NOTE this has been validated for getting the same results as in the original $\textsf{BERTMap}$ repository.

    Args:
        num_samples (int, optional): The (maximum) number of **unique** samples extracted. Defaults to `None`.

    Returns:
        (List[Tuple[str, str]]): A list of unique synonym pair samples.
    """
    synonym_pool = []
    for synonym_group in self.synonym_groups:
        # do not remove duplicates in the loop to save time
        synonym_pairs = self.get_synonym_pairs(synonym_group, remove_duplicates=False)
        synonym_pool += synonym_pairs
    # remove duplicates afer the loop
    synonym_pool = uniqify(synonym_pool)

    if (not num_samples) or (num_samples >= len(synonym_pool)):
        # print("Return all synonym pairs without downsampling.")
        return synonym_pool
    else:
        return random.sample(synonym_pool, num_samples)

soft_nonsynonym_sampling(num_samples, max_iter=5)

Sample soft non-synonyms from a list of synonym groups extracted from the input ontology.

According to the \(\textsf{BERTMap}\) paper, soft non-synonyms are defined as label pairs from two different synonym groups that are randomly selected.

Parameters:

Name Type Description Default
num_samples int

The (maximum) number of unique samples extracted; this is required unlike for synonym sampling because the non-synonym pool is significantly larger (considering random combinations of different synonym groups).

required
max_iter int

The maximum number of iterations for conducting sampling. Defaults to 5.

5

Returns:

Type Description
List[Tuple[str, str]]

A list of unique (soft) non-synonym pair samples.

Source code in src/deeponto/align/bertmap/text_semantics.py
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
def soft_nonsynonym_sampling(self, num_samples: int, max_iter: int = 5):
    r"""Sample **soft** non-synonyms from a list of synonym groups extracted from the input ontology.

    According to the $\textsf{BERTMap}$ paper, **soft non-synonyms** are defined as label pairs
    from two *different* synonym groups that are **randomly** selected.

    Args:
        num_samples (int): The (maximum) number of **unique** samples extracted; this is
            required **unlike for synonym sampling** because the non-synonym pool is **significantly
            larger** (considering random combinations of different synonym groups).
        max_iter (int): The maximum number of iterations for conducting sampling. Defaults to `5`.

    Returns:
        (List[Tuple[str, str]]): A list of unique (soft) non-synonym pair samples.
    """
    nonsyonym_pool = []
    # randomly select disjoint synonym group pairs from all
    for _ in range(num_samples):
        left_synonym_group, right_synonym_group = tuple(random.sample(self.synonym_groups, 2))
        try:
            # randomly choose one label from a synonym group
            left_label = random.choice(list(left_synonym_group))
            right_label = random.choice(list(right_synonym_group))
            nonsyonym_pool.append((left_label, right_label))
        except:
            # skip if there are no class labels
            continue

    # DataUtils.uniqify is too slow so we should avoid operating it too often
    nonsyonym_pool = uniqify(nonsyonym_pool)

    while len(nonsyonym_pool) < num_samples and max_iter > 0:
        max_iter = max_iter - 1  # reduce the iteration to prevent exhausting loop
        nonsyonym_pool += self.soft_nonsynonym_sampling(num_samples - len(nonsyonym_pool), max_iter)
        nonsyonym_pool = uniqify(nonsyonym_pool)

    return nonsyonym_pool

weighted_random_choices_of_sibling_groups(k=1)

Randomly (weighted) select a number of sibling class groups.

The weights are computed according to the sizes of the sibling class groups.

Source code in src/deeponto/align/bertmap/text_semantics.py
226
227
228
229
230
231
232
233
def weighted_random_choices_of_sibling_groups(self, k: int = 1):
    """Randomly (weighted) select a number of sibling class groups.

    The weights are computed according to the sizes of the sibling class groups.
    """
    weights = [len(s) for s in self.onto.sibling_class_groups]
    weights = [w / sum(weights) for w in weights]  # normalised
    return random.choices(self.onto.sibling_class_groups, weights=weights, k=k)

hard_nonsynonym_sampling(num_samples, max_iter=5)

Sample hard non-synonyms from sibling classes of the input ontology.

According to the \(\textsf{BERTMap}\) paper, hard non-synonyms are defined as label pairs that belong to two disjoint ontology classes. For practical reason, the condition is eased to two sibling ontology classes.

Parameters:

Name Type Description Default
num_samples int

The (maximum) number of unique samples extracted; this is required unlike for synonym sampling because the non-synonym pool is significantly larger (considering random combinations of different synonym groups).

required
max_iter int

The maximum number of iterations for conducting sampling. Defaults to 5.

5

Returns:

Type Description
List[Tuple[str, str]]

A list of unique (hard) non-synonym pair samples.

Source code in src/deeponto/align/bertmap/text_semantics.py
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
def hard_nonsynonym_sampling(self, num_samples: int, max_iter: int = 5):
    r"""Sample **hard** non-synonyms from sibling classes of the input ontology.

    According to the $\textsf{BERTMap}$ paper, **hard non-synonyms** are defined as label pairs
    that belong to two **disjoint** ontology classes. For practical reason, the condition
    is eased to two **sibling** ontology classes.

    Args:
        num_samples (int): The (maximum) number of **unique** samples extracted; this is
            required **unlike for synonym sampling** because the non-synonym pool is **significantly
            larger** (considering random combinations of different synonym groups).
        max_iter (int): The maximum number of iterations for conducting sampling. Defaults to `5`.

    Returns:
        (List[Tuple[str, str]]): A list of unique (hard) non-synonym pair samples.
    """
    # intialise the sibling class groups
    self.onto.sibling_class_groups

    if not self.onto.sibling_class_groups:
        warnings.warn("Skip hard negative sampling as no sibling class groups are defined.")
        return []

    # flatten the disjointness groups into all pairs of hard neagtives
    nonsynonym_pool = []
    # randomly (weighted) select a number of sibling class groups with replacement
    sibling_class_groups = self.weighted_random_choices_of_sibling_groups(k=num_samples)

    for sibling_class_group in sibling_class_groups:
        # random select two sibling classes; no weights this time
        left_class_iri, right_class_iri = tuple(random.sample(sibling_class_group, 2))
        try:
            # random select a label for each of them
            left_label = random.choice(list(self.annotation_index[left_class_iri]))
            right_label = random.choice(list(self.annotation_index[right_class_iri]))
            # add the label pair to the pool
            nonsynonym_pool.append((left_label, right_label))
        except:
            # skip them if there are no class labels
            continue

    # DataUtils.uniqify is too slow so we should avoid operating it too often
    nonsynonym_pool = uniqify(nonsynonym_pool)

    while len(nonsynonym_pool) < num_samples and max_iter > 0:
        max_iter = max_iter - 1  # reduce the iteration to prevent exhausting loop
        nonsynonym_pool += self.hard_nonsynonym_sampling(num_samples - len(nonsynonym_pool), max_iter)
        nonsynonym_pool = uniqify(nonsynonym_pool)

    return nonsynonym_pool

IntraOntologyTextSemanticsCorpus(onto, annotation_property_iris, soft_negative_ratio=2, hard_negative_ratio=2)

Class for creating the intra-ontology text semantics corpus from an ontology.

As defined in the \(\textsf{BERTMap}\) paper, the intra-ontology text semantics corpus consists of synonym and non-synonym pairs extracted from the ontology class annotations.

Attributes:

Name Type Description
onto Ontology

An ontology to construct the intra-ontology text semantics corpus from.

annotation_property_iris List[str]

Specify which annotation properties to be used.

soft_negative_ratio int

The expected negative sample ratio of the soft non-synonyms to the extracted synonyms. Defaults to 2.

hard_negative_ratio int

The expected negative sample ratio of the hard non-synonyms to the extracted synonyms. Defaults to 2. However, hard non-synonyms are sometimes insufficient given an ontology's hierarchy, the soft ones are used to compensate the number in this case.

Source code in src/deeponto/align/bertmap/text_semantics.py
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
def __init__(
    self,
    onto: Ontology,
    annotation_property_iris: List[str],
    soft_negative_ratio: int = 2,
    hard_negative_ratio: int = 2,
):
    self.onto = onto
    # $\textsf{BERTMap}$ does not apply synonym transitivity
    self.thesaurus = AnnotationThesaurus(onto, annotation_property_iris, apply_transitivity=False)

    self.synonyms = self.thesaurus.synonym_sampling()
    # sample hard negatives first as they might not be enough
    num_hard = hard_negative_ratio * len(self.synonyms)
    self.hard_nonsynonyms = self.thesaurus.hard_nonsynonym_sampling(num_hard)
    # compensate the number of hard negatives as soft negatives are almost always available
    num_soft = (soft_negative_ratio + hard_negative_ratio) * len(self.synonyms) - len(self.hard_nonsynonyms)
    self.soft_nonsynonyms = self.thesaurus.soft_nonsynonym_sampling(num_soft)

    self.info = {
        type(self).__name__: {
            "num_synonyms": len(self.synonyms),
            "num_nonsynonyms": len(self.soft_nonsynonyms) + len(self.hard_nonsynonyms),
            "num_soft_nonsynonyms": len(self.soft_nonsynonyms),
            "num_hard_nonsynonyms": len(self.hard_nonsynonyms),
            "annotation_thesaurus": self.thesaurus.info["AnnotationThesaurus"],
        }
    }

save(save_path)

Save the intra-ontology corpus (a .json file for label pairs and its summary) in the specified directory.

Source code in src/deeponto/align/bertmap/text_semantics.py
334
335
336
337
338
339
340
341
342
343
344
def save(self, save_path: str):
    """Save the intra-ontology corpus (a `.json` file for label pairs
    and its summary) in the specified directory.
    """
    create_path(save_path)
    save_json = {
        "summary": self.info,
        "synonyms": [(pos[0], pos[1], 1) for pos in self.synonyms],
        "nonsynonyms": [(neg[0], neg[1], 0) for neg in self.soft_nonsynonyms + self.hard_nonsynonyms],
    }
    save_file(save_json, os.path.join(save_path, "intra-onto.corpus.json"))

CrossOntologyTextSemanticsCorpus(class_mappings, src_onto, tgt_onto, annotation_property_iris, negative_ratio=4)

Class for creating the cross-ontology text semantics corpus from two ontologies and provided mappings between them.

As defined in the \(\textsf{BERTMap}\) paper, the cross-ontology text semantics corpus consists of synonym and non-synonym pairs extracted from the annotations/labels of class pairs involved in the provided cross-ontology mappigns.

Attributes:

Name Type Description
class_mappings List[ReferenceMapping]

A list of cross-ontology class mappings.

src_onto Ontology

The source ontology whose class IRIs are heads of the class_mappings.

tgt_onto Ontology

The target ontology whose class IRIs are tails of the class_mappings.

annotation_property_iris List[str]

A list of annotation property IRIs used to extract the annotations.

negative_ratio int

The expected negative sample ratio of the non-synonyms to the extracted synonyms. Defaults to 4. NOTE that we do not have hard non-synonyms at the cross-ontology level.

Source code in src/deeponto/align/bertmap/text_semantics.py
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
def __init__(
    self,
    class_mappings: List[ReferenceMapping],
    src_onto: Ontology,
    tgt_onto: Ontology,
    annotation_property_iris: List[str],
    negative_ratio: int = 4,
):
    self.class_mappings = class_mappings
    self.src_onto = src_onto
    self.tgt_onto = tgt_onto
    # build the annotation thesaurus for each ontology
    self.src_thesaurus = AnnotationThesaurus(src_onto, annotation_property_iris)
    self.tgt_thesaurus = AnnotationThesaurus(tgt_onto, annotation_property_iris)
    self.negative_ratio = negative_ratio

    self.synonyms = self.synonym_sampling_from_mappings()
    num_negative = negative_ratio * len(self.synonyms)
    self.nonsynonyms = self.nonsynonym_sampling_from_mappings(num_negative)

    self.info = {
        type(self).__name__: {
            "num_synonyms": len(self.synonyms),
            "num_nonsynonyms": len(self.nonsynonyms),
            "num_mappings": len(self.class_mappings),
            "src_annotation_thesaurus": self.src_thesaurus.info["AnnotationThesaurus"],
            "tgt_annotation_thesaurus": self.tgt_thesaurus.info["AnnotationThesaurus"],
        }
    }

save(save_path)

Save the cross-ontology corpus (a .json file for label pairs and its summary) in the specified directory.

Source code in src/deeponto/align/bertmap/text_semantics.py
396
397
398
399
400
401
402
403
404
405
406
def save(self, save_path: str):
    """Save the cross-ontology corpus (a `.json` file for label pairs
    and its summary) in the specified directory.
    """
    create_path(save_path)
    save_json = {
        "summary": self.info,
        "synonyms": [(pos[0], pos[1], 1) for pos in self.synonyms],
        "nonsynonyms": [(neg[0], neg[1], 0) for neg in self.nonsynonyms],
    }
    save_file(save_json, os.path.join(save_path, "cross-onto.corpus.json"))

synonym_sampling_from_mappings()

Sample synonyms from cross-ontology class mappings.

Arguments of this method are all class attributes. See CrossOntologyTextSemanticsCorpus.

According to the \(\textsf{BERTMap}\) paper, cross-ontology synonyms are defined as label pairs that belong to two matched classes. Suppose the class \(C\) from the source ontology and the class \(D\) from the target ontology are matched according to one of the class_mappings, then the cartesian product of labels of \(C\) and labels of \(D\) form cross-ontology synonyms. Note that identity synonyms in the form of \((a, a)\) are removed because they have been covered in the intra-ontology case.

Returns:

Type Description
List[Tuple[str, str]]

A list of unique synonym pair samples from ontology class mappings.

Source code in src/deeponto/align/bertmap/text_semantics.py
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
def synonym_sampling_from_mappings(self):
    r"""Sample synonyms from cross-ontology class mappings.

    Arguments of this method are all class attributes.
    See [`CrossOntologyTextSemanticsCorpus`][deeponto.align.bertmap.text_semantics.CrossOntologyTextSemanticsCorpus].

    According to the $\textsf{BERTMap}$ paper, **cross-ontology synonyms** are defined as label pairs
    that belong to two **matched** classes. Suppose the class $C$ from the source ontology
    and the class $D$ from the target ontology are matched according to one of the `class_mappings`,
    then the cartesian product of labels of $C$ and labels of $D$ form cross-ontology synonyms.
    Note that **identity synonyms** in the form of $(a, a)$ are removed because they have been covered
    in the intra-ontology case.

    Returns:
        (List[Tuple[str, str]]): A list of unique synonym pair samples from ontology class mappings.
    """
    synonym_pool = []

    for class_mapping in self.class_mappings:
        src_class_iri, tgt_class_iri = class_mapping.to_tuple()
        src_class_annotations = self.src_thesaurus.annotation_index[src_class_iri]
        tgt_class_annotations = self.tgt_thesaurus.annotation_index[tgt_class_iri]
        synonym_pairs = list(itertools.product(src_class_annotations, tgt_class_annotations))
        # remove the identity synonyms as the have been covered in the intra-ontology case
        synonym_pairs = [(l, r) for l, r in synonym_pairs if l != r]
        backward_synonym_pairs = [(r, l) for l, r in synonym_pairs]
        synonym_pool += synonym_pairs + backward_synonym_pairs

    synonym_pool = uniqify(synonym_pool)
    return synonym_pool

nonsynonym_sampling_from_mappings(num_samples, max_iter=5)

Sample non-synonyms from cross-ontology class mappings.

Arguments of this method are all class attributes. See CrossOntologyTextSemanticsCorpus.

According to the \(\textsf{BERTMap}\) paper, cross-ontology non-synonyms are defined as label pairs that belong to two unmatched classes. Assume that the provided class mappings are self-contained in the sense that they are complete for the classes involved in them, then we can randomly sample two cross-ontology classes that are not matched according to the mappings and take their labels as nonsynonyms. In practice, it is quite unlikely to obtain false negatives since the number of incorrect mappings is much larger than the number of correct ones.

Returns:

Type Description
List[Tuple[str, str]]

A list of unique nonsynonym pair samples from ontology class mappings.

Source code in src/deeponto/align/bertmap/text_semantics.py
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
def nonsynonym_sampling_from_mappings(self, num_samples: int, max_iter: int = 5):
    r"""Sample non-synonyms from cross-ontology class mappings.

    Arguments of this method are all class attributes.
    See [`CrossOntologyTextSemanticsCorpus`][deeponto.align.bertmap.text_semantics.CrossOntologyTextSemanticsCorpus].

    According to the $\textsf{BERTMap}$ paper, **cross-ontology non-synonyms** are defined as label pairs
    that belong to two **unmatched** classes. Assume that the provided class mappings are self-contained
    in the sense that they are complete for the classes involved in them, then we can randomly
    sample two cross-ontology classes that are not matched according to the mappings and take
    their labels as nonsynonyms. In practice, it is quite unlikely to obtain false negatives since
    the number of incorrect mappings is much larger than the number of correct ones.

    Returns:
        (List[Tuple[str, str]]): A list of unique nonsynonym pair samples from ontology class mappings.
    """
    nonsynonym_pool = []

    # form cross-ontology synonym groups
    cross_onto_synonym_group_pair = []
    for class_mapping in self.class_mappings:
        src_class_iri, tgt_class_iri = class_mapping.to_tuple()
        src_class_annotations = self.src_thesaurus.annotation_index[src_class_iri]
        tgt_class_annotations = self.tgt_thesaurus.annotation_index[tgt_class_iri]
        # let each matched class pair's annotations form a synonym group_pair
        cross_onto_synonym_group_pair.append((src_class_annotations, tgt_class_annotations))

    # randomly select disjoint synonym group pairs from all
    for _ in range(num_samples):
        left_class_pair, right_class_pair = tuple(random.sample(cross_onto_synonym_group_pair, 2))
        try:
            # randomly choose one label from a synonym group
            left_label = random.choice(list(left_class_pair[0]))  # choosing the src side by [0]
            right_label = random.choice(list(right_class_pair[1]))  # choosing the tgt side by [1]
            nonsynonym_pool.append((left_label, right_label))
        except:
            # skip if there are no class labels
            continue

    # DataUtils.uniqify is too slow so we should avoid operating it too often
    nonsynonym_pool = uniqify(nonsynonym_pool)
    while len(nonsynonym_pool) < num_samples and max_iter > 0:
        max_iter = max_iter - 1  # reduce the iteration to prevent exhausting loop
        nonsynonym_pool += self.nonsynonym_sampling_from_mappings(num_samples - len(nonsynonym_pool), max_iter)
        nonsynonym_pool = uniqify(nonsynonym_pool)
    return nonsynonym_pool

TextSemanticsCorpora(src_onto, tgt_onto, annotation_property_iris, class_mappings=None, auxiliary_ontos=None)

Class for creating the collection text semantics corpora.

As defined in the \(\textsf{BERTMap}\) paper, the collection of text semantics corpora contains at least two intra-ontology sub-corpora from the source and target ontologies, respectively. If some class mappings are provided, then a cross-ontology sub-corpus will be created. If some additional auxiliary ontologies are provided, the intra-ontology corpora created from them will serve as the auxiliary sub-corpora.

Attributes:

Name Type Description
src_onto Ontology

The source ontology to be matched or aligned.

tgt_onto Ontology

The target ontology to be matched or aligned.

annotation_property_iris List[str]

A list of annotation property IRIs used to extract the annotations.

class_mappings List[ReferenceMapping]

A list of cross-ontology class mappings between the source and the target ontologies. Defaults to None.

auxiliary_ontos List[Ontology]

A list of auxiliary ontologies for augmenting more synonym/non-synonym samples. Defaults to None.

Source code in src/deeponto/align/bertmap/text_semantics.py
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
def __init__(
    self,
    src_onto: Ontology,
    tgt_onto: Ontology,
    annotation_property_iris: List[str],
    class_mappings: Optional[List[ReferenceMapping]] = None,
    auxiliary_ontos: Optional[List[Ontology]] = None,
):
    self.synonyms = []
    self.nonsynonyms = []

    # build intra-ontology corpora
    # negative sample ratios are by default
    self.intra_src_onto_corpus = IntraOntologyTextSemanticsCorpus(src_onto, annotation_property_iris)
    self.add_samples_from_sub_corpus(self.intra_src_onto_corpus)
    self.intra_tgt_onto_corpus = IntraOntologyTextSemanticsCorpus(tgt_onto, annotation_property_iris)
    self.add_samples_from_sub_corpus(self.intra_tgt_onto_corpus)

    # build cross-ontolgoy corpora
    self.class_mappings = class_mappings
    self.cross_onto_corpus = None
    if self.class_mappings:
        self.cross_onto_corpus = CrossOntologyTextSemanticsCorpus(
            class_mappings, src_onto, tgt_onto, annotation_property_iris
        )
        self.add_samples_from_sub_corpus(self.cross_onto_corpus)

    # build auxiliary ontology corpora (same as intra-ontology)
    self.auxiliary_ontos = auxiliary_ontos
    self.auxiliary_onto_corpora = []
    if self.auxiliary_ontos:
        for auxiliary_onto in self.auxiliary_ontos:
            self.auxiliary_onto_corpora.append(
                IntraOntologyTextSemanticsCorpus(auxiliary_onto, annotation_property_iris)
            )
    for auxiliary_onto_corpus in self.auxiliary_onto_corpora:
        self.add_samples_from_sub_corpus(auxiliary_onto_corpus)

    # DataUtils.uniqify the samples
    self.synonyms = uniqify(self.synonyms)
    self.nonsynonyms = uniqify(self.nonsynonyms)
    # remove invalid nonsynonyms
    self.nonsynonyms = list(set(self.nonsynonyms) - set(self.synonyms))

    # summary
    self.info = {
        type(self).__name__: {
            "num_synonyms": len(self.synonyms),
            "num_nonsynonyms": len(self.nonsynonyms),
            "intra_src_onto_corpus": self.intra_src_onto_corpus.info["IntraOntologyTextSemanticsCorpus"],
            "intra_tgt_onto_corpus": self.intra_tgt_onto_corpus.info["IntraOntologyTextSemanticsCorpus"],
            "cross_onto_corpus": self.cross_onto_corpus.info["CrossOntologyTextSemanticsCorpus"]
            if self.cross_onto_corpus
            else None,
            "auxiliary_onto_corpora": [
                a.info["IntraOntologyTextSemanticsCorpus"] for a in self.auxiliary_onto_corpora
            ],
        }
    }

save(save_path)

Save the overall text semantics corpora (a .json file for label pairs and its summary) in the specified directory.

Source code in src/deeponto/align/bertmap/text_semantics.py
568
569
570
571
572
573
574
575
576
577
578
def save(self, save_path: str):
    """Save the overall text semantics corpora (a `.json` file for label pairs
    and its summary) in the specified directory.
    """
    create_path(save_path)
    save_json = {
        "summary": self.info,
        "synonyms": [(pos[0], pos[1], 1) for pos in self.synonyms],
        "nonsynonyms": [(neg[0], neg[1], 0) for neg in self.nonsynonyms],
    }
    save_file(save_json, os.path.join(save_path, "text-semantics.corpora.json"))

add_samples_from_sub_corpus(sub_corpus)

Add synonyms and non-synonyms from each sub-corpus to the overall collection.

Source code in src/deeponto/align/bertmap/text_semantics.py
580
581
582
583
584
585
586
587
588
def add_samples_from_sub_corpus(
    self, sub_corpus: Union[IntraOntologyTextSemanticsCorpus, CrossOntologyTextSemanticsCorpus]
):
    """Add synonyms and non-synonyms from each sub-corpus to the overall collection."""
    self.synonyms += sub_corpus.synonyms
    if isinstance(sub_corpus, IntraOntologyTextSemanticsCorpus):
        self.nonsynonyms += sub_corpus.soft_nonsynonyms + sub_corpus.hard_nonsynonyms
    else:
        self.nonsynonyms += sub_corpus.nonsynonyms

BERTSynonymClassifier(loaded_path, output_path, eval_mode, max_length_for_input, num_epochs_for_training=None, batch_size_for_training=None, batch_size_for_prediction=None, training_data=None, validation_data=None)

Class for BERT synonym classifier.

The main scoring module of \(\textsf{BERTMap}\) consisting of a BERT model and a binary synonym classifier.

Attributes:

Name Type Description
loaded_path str

The path to the checkpoint of a pre-trained BERT model.

output_path str

The path to the output BERT model (usually fine-tuned).

eval_mode bool

Set to False if the model is loaded for training.

max_length_for_input int

The maximum length of an input sequence.

num_epochs_for_training int

The number of epochs for training a BERT model.

batch_size_for_training int

The batch size for training a BERT model.

batch_size_for_prediction int

The batch size for making predictions.

training_data Dataset

Data for training the model if for_training is set to True. Defaults to None.

validation_data Dataset

Data for validating the model if for_training is set to True. Defaults to None.

training_args TrainingArguments

Training arguments for training the model if for_training is set to True. Defaults to None.

trainer Trainer

The model trainer fed with training_args and data samples. Defaults to None.

softmax torch.nn.SoftMax

The softmax layer used for normalising synonym scores. Defaults to None.

Source code in src/deeponto/align/bertmap/bert_classifier.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def __init__(
    self,
    loaded_path: str,
    output_path: str,
    eval_mode: bool,
    max_length_for_input: int,
    num_epochs_for_training: Optional[float] = None,
    batch_size_for_training: Optional[int] = None,
    batch_size_for_prediction: Optional[int] = None,
    training_data: Optional[List[Tuple[str, str, int]]] = None,  # (sentence1, sentence2, label)
    validation_data: Optional[List[Tuple[str, str, int]]] = None,
):
    # Load the pretrained BERT model from the given path
    self.loaded_path = loaded_path
    print(f"Loading a BERT model from: {self.loaded_path}.")
    self.model = AutoModelForSequenceClassification.from_pretrained(
        self.loaded_path, output_hidden_states=eval_mode
    )
    self.tokenizer = Tokenizer.from_pretrained(loaded_path)

    self.output_path = output_path
    self.eval_mode = eval_mode
    self.max_length_for_input = max_length_for_input
    self.num_epochs_for_training = num_epochs_for_training
    self.batch_size_for_training = batch_size_for_training
    self.batch_size_for_prediction = batch_size_for_prediction
    self.training_data = None
    self.validation_data = None
    self.data_stat = {}
    self.training_args = None
    self.trainer = None
    self.softmax = None

    # load the pre-trained BERT model and set it to eval mode (static)
    if self.eval_mode:
        self.eval()
    # load the pre-trained BERT model for fine-tuning
    else:
        if not training_data:
            raise RuntimeError("Training data should be provided when `for_training` is `True`.")
        if not validation_data:
            raise RuntimeError("Validation data should be provided when `for_training` is `True`.")
        # load data (max_length is used for truncation)
        self.training_data = self.load_dataset(training_data, "training")
        self.validation_data = self.load_dataset(validation_data, "validation")
        self.data_stat = {
            "num_training": len(self.training_data),
            "num_validation": len(self.validation_data),
        }

        # generate training arguments
        epoch_steps = len(self.training_data) // self.batch_size_for_training  # total steps of an epoch
        if torch.cuda.device_count() > 0:
            epoch_steps = epoch_steps // torch.cuda.device_count()  # to deal with multi-gpus case
        # keep logging steps consisitent even for small batch size
        # report logging on every 0.02 epoch
        logging_steps = int(epoch_steps * 0.02)
        # eval on every 0.2 epoch
        eval_steps = 10 * logging_steps
        # generate the training arguments
        self.training_args = TrainingArguments(
            output_dir=self.output_path,
            num_train_epochs=self.num_epochs_for_training,
            per_device_train_batch_size=self.batch_size_for_training,
            per_device_eval_batch_size=self.batch_size_for_training,
            warmup_ratio=0.0,
            weight_decay=0.01,
            logging_steps=logging_steps,
            logging_dir=f"{self.output_path}/tensorboard",
            eval_steps=eval_steps,
            evaluation_strategy="steps",
            do_train=True,
            do_eval=True,
            save_steps=eval_steps,
            save_total_limit=2,
            load_best_model_at_end=True,
        )
        # build the trainer
        self.trainer = Trainer(
            model=self.model,
            args=self.training_args,
            train_dataset=self.training_data,
            eval_dataset=self.validation_data,
            compute_metrics=self.compute_metrics,
            tokenizer=self.tokenizer._tokenizer,
        )

train(resume_from_checkpoint=None)

Start training the BERT model.

Source code in src/deeponto/align/bertmap/bert_classifier.py
136
137
138
139
140
def train(self, resume_from_checkpoint: Optional[Union[bool, str]] = None):
    """Start training the BERT model."""
    if self.eval_mode:
        raise RuntimeError("Training cannot be started in `eval` mode.")
    self.trainer.train(resume_from_checkpoint=resume_from_checkpoint)

eval()

To eval mode.

Source code in src/deeponto/align/bertmap/bert_classifier.py
142
143
144
145
146
147
148
149
def eval(self):
    """To eval mode."""
    print("The BERT model is set to eval mode for making predictions.")
    self.model.eval()
    # TODO: to implement multi-gpus for inference
    self.device = self.get_device(device_num=0)
    self.model.to(self.device)
    self.softmax = torch.nn.Softmax(dim=1).to(self.device)

predict(sent_pairs)

Run prediction pipeline for synonym classification.

Return the softmax probailities of predicting pairs as synonyms (index=1).

Source code in src/deeponto/align/bertmap/bert_classifier.py
151
152
153
154
155
156
157
158
def predict(self, sent_pairs: List[Tuple[str, str]]):
    r"""Run prediction pipeline for synonym classification.

    Return the `softmax` probailities of predicting pairs as synonyms (`index=1`).
    """
    inputs = self.process_inputs(sent_pairs)
    with torch.no_grad():
        return self.softmax(self.model(**inputs).logits)[:, 1]

load_dataset(data, split)

Load the list of (annotation1, annotation2, label) samples into a datasets.Dataset.

Source code in src/deeponto/align/bertmap/bert_classifier.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def load_dataset(self, data: List[Tuple[str, str, int]], split: str) -> Dataset:
    r"""Load the list of `(annotation1, annotation2, label)` samples into a `datasets.Dataset`."""

    def iterate():
        for sample in data:
            yield {"annotation1": sample[0], "annotation2": sample[1], "labels": sample[2]}

    dataset = Dataset.from_generator(iterate)
    # NOTE: no padding here because the Trainer class supports dynamic padding
    dataset = dataset.map(
        lambda examples: self.tokenizer._tokenizer(
            examples["annotation1"], examples["annotation2"], max_length=self.max_length_for_input, truncation=True
        ),
        batched=True,
        desc=f"Load {split} data:",
    )
    return dataset

process_inputs(sent_pairs)

Process input sentence pairs for the BERT model.

Transform the sentences into BERT input embeddings and load them into the device. This function is called only when the BERT model is about to make predictions (eval mode).

Source code in src/deeponto/align/bertmap/bert_classifier.py
178
179
180
181
182
183
184
185
186
187
188
189
190
def process_inputs(self, sent_pairs: List[Tuple[str, str]]):
    r"""Process input sentence pairs for the BERT model.

    Transform the sentences into BERT input embeddings and load them into the device.
    This function is called only when the BERT model is about to make predictions (`eval` mode).
    """
    return self.tokenizer._tokenizer(
        sent_pairs,
        return_tensors="pt",
        max_length=self.max_length_for_input,
        padding=True,
        truncation=True,
    ).to(self.device)

compute_metrics(pred) staticmethod

Add more evaluation metrics into the training log.

Source code in src/deeponto/align/bertmap/bert_classifier.py
192
193
194
195
196
197
198
199
@staticmethod
def compute_metrics(pred):
    """Add more evaluation metrics into the training log."""
    # TODO: currently only accuracy is added, will expect more in the future if needed
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}

get_device(device_num=0) staticmethod

Get a device (GPU or CPU) for the torch model

Source code in src/deeponto/align/bertmap/bert_classifier.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
@staticmethod
def get_device(device_num: int = 0):
    """Get a device (GPU or CPU) for the torch model"""
    # If there's a GPU available...
    if torch.cuda.is_available():
        # Tell PyTorch to use the GPU.
        device = torch.device(f"cuda:{device_num}")
        print("There are %d GPU(s) available." % torch.cuda.device_count())
        print("We will use the GPU:", torch.cuda.get_device_name(device_num))
    # If not...
    else:
        print("No GPU available, using the CPU instead.")
        device = torch.device("cpu")
    return device

set_seed(seed_val=888) staticmethod

Set random seed for reproducible results.

Source code in src/deeponto/align/bertmap/bert_classifier.py
216
217
218
219
220
221
222
@staticmethod
def set_seed(seed_val: int = 888):
    """Set random seed for reproducible results."""
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

MappingPredictor(output_path, tokenizer_path, src_annotation_index, tgt_annotation_index, bert_synonym_classifier, num_raw_candidates, num_best_predictions, batch_size_for_prediction, logger, enlighten_manager, enlighten_status, ignored_class_index=None)

Class for the mapping prediction module of \(\textsf{BERTMap}\) and \(\textsf{BERTMapLt}\) models.

Attributes:

Name Type Description
tokenizer Tokenizer

The tokenizer used for constructing the inverted annotation index and candidate selection.

src_annotation_index dict

A dictionary that stores the (class_iri, class_annotations) pairs from src_onto according to annotation_property_iris.

tgt_annotation_index dict

A dictionary that stores the (class_iri, class_annotations) pairs from tgt_onto according to annotation_property_iris.

tgt_inverted_annotation_index InvertedIndex

The inverted index built from tgt_annotation_index used for target class candidate selection.

bert_synonym_classifier BERTSynonymClassifier

The BERT synonym classifier fine-tuned on text semantics corpora.

num_raw_candidates int

The maximum number of selected target class candidates for a source class.

num_best_predictions int

The maximum number of best scored mappings presevred for a source class.

batch_size_for_prediction int

The batch size of class annotation pairs for computing synonym scores.

ignored_class_index dict

OAEI arguemnt, a dictionary that stores the (class_iri, used_in_alignment) pairs.

Source code in src/deeponto/align/bertmap/mapping_prediction.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def __init__(
    self,
    output_path: str,
    tokenizer_path: str,
    src_annotation_index: dict,
    tgt_annotation_index: dict,
    bert_synonym_classifier: Optional[BERTSynonymClassifier],
    num_raw_candidates: Optional[int],
    num_best_predictions: Optional[int],
    batch_size_for_prediction: int,
    logger: Logger,
    enlighten_manager: enlighten.Manager,
    enlighten_status: enlighten.StatusBar,
    ignored_class_index: Optional[dict] = None,
):
    self.logger = logger
    self.enlighten_manager = enlighten_manager
    self.enlighten_status = enlighten_status

    self.tokenizer = Tokenizer.from_pretrained(tokenizer_path)

    self.logger.info("Build inverted annotation index for candidate selection.")
    self.src_annotation_index = src_annotation_index
    self.tgt_annotation_index = tgt_annotation_index
    self.tgt_inverted_annotation_index = Ontology.build_inverted_annotation_index(
        tgt_annotation_index, self.tokenizer
    )
    # the fundamental judgement for whether bertmap or bertmaplt is loaded
    self.bert_synonym_classifier = bert_synonym_classifier
    self.num_raw_candidates = num_raw_candidates
    self.num_best_predictions = num_best_predictions
    self.batch_size_for_prediction = batch_size_for_prediction
    self.output_path = output_path

    # for the OAEI, adding in check for classes that are not used in alignment
    self.ignored_class_index = ignored_class_index

    self.init_class_mapping = lambda head, tail, score: EntityMapping(head, tail, "<EquivalentTo>", score)

bert_mapping_score(src_class_annotations, tgt_class_annotations)

\(\textsf{BERTMap}\)'s main mapping score module which utilises the fine-tuned BERT synonym classifier.

Compute the synonym score for each pair of src-tgt class annotations, and return the average score as the mapping score. Apply string matching before applying the BERT module to filter easy mappings (with scores \(1.0\)).

Source code in src/deeponto/align/bertmap/mapping_prediction.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def bert_mapping_score(
    self,
    src_class_annotations: Set[str],
    tgt_class_annotations: Set[str],
):
    r"""$\textsf{BERTMap}$'s main mapping score module which utilises the fine-tuned BERT synonym
    classifier.

    Compute the **synonym score** for each pair of src-tgt class annotations, and return
    the **average** score as the mapping score. Apply string matching before applying the
    BERT module to filter easy mappings (with scores $1.0$).
    """

    if not src_class_annotations or not tgt_class_annotations:
        warnings.warn("Return zero score due to empty input class annotations...")
        return 0.0

    # apply string matching before applying the bert module
    prelim_score = self.edit_similarity_mapping_score(
        src_class_annotations,
        tgt_class_annotations,
        string_match_only=True,
    )
    if prelim_score == 1.0:
        return prelim_score
    # apply BERT classifier and define mapping score := Average(SynonymScores)
    class_annotation_pairs = list(itertools.product(src_class_annotations, tgt_class_annotations))
    synonym_scores = self.bert_synonym_classifier.predict(class_annotation_pairs)
    # only one element tensor is able to be extracted as a scalar by .item()
    return float(torch.mean(synonym_scores).item())

edit_similarity_mapping_score(src_class_annotations, tgt_class_annotations, string_match_only=False) staticmethod

\(\textsf{BERTMap}\)'s string match module and \(\textsf{BERTMapLt}\)'s mapping prediction function.

Compute the normalised edit similarity (1 - normalised edit distance) for each pair of src-tgt class annotations, and return the maximum score as the mapping score.

Source code in src/deeponto/align/bertmap/mapping_prediction.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
@staticmethod
def edit_similarity_mapping_score(
    src_class_annotations: Set[str],
    tgt_class_annotations: Set[str],
    string_match_only: bool = False,
):
    r"""$\textsf{BERTMap}$'s string match module and $\textsf{BERTMapLt}$'s mapping prediction function.

    Compute the **normalised edit similarity** `(1 - normalised edit distance)` for each pair
    of src-tgt class annotations, and return the **maximum** score as the mapping score.
    """

    if not src_class_annotations or not tgt_class_annotations:
        warnings.warn("Return zero score due to empty input class annotations...")
        return 0.0

    # edge case when src and tgt classes have an exact match of annotation
    if len(src_class_annotations.intersection(tgt_class_annotations)) > 0:
        return 1.0
    # a shortcut to save time for $\textsf{BERTMap}$
    if string_match_only:
        return 0.0
    annotation_pairs = itertools.product(src_class_annotations, tgt_class_annotations)
    sim_scores = [levenshtein.normalized_similarity(src, tgt) for src, tgt in annotation_pairs]
    return max(sim_scores)

mapping_prediction_for_src_class(src_class_iri)

Predict \(N\) best scored mappings for a source ontology class, where \(N\) is specified in self.num_best_predictions.

  1. Apply the string matching module to compute "easy" mappings.
  2. Return the mappings if found any, or if there is no BERT synonym classifier as in \(\textsf{BERTMapLt}\).
  3. If using the BERT synonym classifier module:

    • Generate batches for class annotation pairs. Each batch contains the combinations of the source class annotations and \(M\) target candidate classes' annotations. \(M\) is determined by batch_size_for_prediction, i.e., stop adding annotations of a target class candidate into the current batch if this operation will cause the size of current batch to exceed the limit.
    • Compute the synonym scores for each batch and aggregate them into mapping scores; preserve \(N\) best scored candidates and update them in the next batch. By this dynamic process, we eventually get \(N\) best scored mappings for a source ontology class.
Source code in src/deeponto/align/bertmap/mapping_prediction.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
def mapping_prediction_for_src_class(self, src_class_iri: str) -> List[EntityMapping]:
    r"""Predict $N$ best scored mappings for a source ontology class, where
    $N$ is specified in `self.num_best_predictions`.

    1. Apply the **string matching** module to compute "easy" mappings.
    2. Return the mappings if found any, or if there is no BERT synonym classifier
    as in $\textsf{BERTMapLt}$.
    3. If using the BERT synonym classifier module:

        - Generate batches for class annotation pairs. Each batch contains the combinations of the
        source class annotations and $M$ target candidate classes' annotations. $M$ is determined
        by `batch_size_for_prediction`, i.e., stop adding annotations of a target class candidate into
        the current batch if this operation will cause the size of current batch to exceed the limit.
        - Compute the synonym scores for each batch and aggregate them into mapping scores; preserve
        $N$ best scored candidates and update them in the next batch. By this dynamic process, we eventually
        get $N$ best scored mappings for a source ontology class.
    """

    src_class_annotations = self.src_annotation_index[src_class_iri]
    # previously wrongly put tokenizer again !!!
    tgt_class_candidates = self.tgt_inverted_annotation_index.idf_select(
        list(src_class_annotations), pool_size=len(self.tgt_annotation_index.keys())
    )  # [(tgt_class_iri, idf_score)]
    # if some classes are set to be ignored, remove them from the candidates
    if self.ignored_class_index:
        tgt_class_candidates = [(iri, idf_score) for iri, idf_score in tgt_class_candidates if not self.ignored_class_index[iri]]
    # select a truncated number of candidates
    tgt_class_candidates = tgt_class_candidates[:self.num_raw_candidates]
    best_scored_mappings = []

    # for string matching: save time if already found string-matched candidates
    def string_match():
        """Compute string-matched mappings."""
        string_matched_mappings = []
        for tgt_candidate_iri, _ in tgt_class_candidates:
            tgt_candidate_annotations = self.tgt_annotation_index[tgt_candidate_iri]
            prelim_score = self.edit_similarity_mapping_score(
                src_class_annotations,
                tgt_candidate_annotations,
                string_match_only=True,
            )
            if prelim_score > 0.0:
                # if src_class_annotations.intersection(tgt_candidate_annotations):
                string_matched_mappings.append(
                    self.init_class_mapping(src_class_iri, tgt_candidate_iri, prelim_score)
                )

        return string_matched_mappings

    best_scored_mappings += string_match()
    # return string-matched mappings if found or if there is no bert module (bertmaplt)
    if best_scored_mappings or not self.bert_synonym_classifier:
        self.logger.info(f"The best scored class mappings for {src_class_iri} are\n{best_scored_mappings}")
        return best_scored_mappings

    def generate_batched_annotations(batch_size: int):
        """Generate batches of class annotations for the input source class and its
        target candidates.
        """
        batches = []
        # the `nums`` parameter determines how the annotations are grouped
        current_batch = CfgNode({"annotations": [], "nums": []})
        for i, (tgt_candidate_iri, _) in enumerate(tgt_class_candidates):
            tgt_candidate_annotations = self.tgt_annotation_index[tgt_candidate_iri]
            annotation_pairs = list(itertools.product(src_class_annotations, tgt_candidate_annotations))
            current_batch.annotations += annotation_pairs
            num_annotation_pairs = len(annotation_pairs)
            current_batch.nums.append(num_annotation_pairs)
            # collect when the batch is full or for the last target class candidate
            if sum(current_batch.nums) > batch_size or i == len(tgt_class_candidates) - 1:
                batches.append(current_batch)
                current_batch = CfgNode({"annotations": [], "nums": []})
        return batches

    def bert_match():
        """Compute mappings with fine-tuned BERT synonym classifier."""
        bert_matched_mappings = []
        class_annotation_batches = generate_batched_annotations(self.batch_size_for_prediction)
        batch_base_candidate_idx = (
            0  # after each batch, the base index will be increased by # of covered target candidates
        )
        device = self.bert_synonym_classifier.device

        # intialize N prediction scores and N corresponding indices w.r.t `tgt_class_candidates`
        final_best_scores = torch.tensor([-1] * self.num_best_predictions).to(device)
        final_best_idxs = torch.tensor([-1] * self.num_best_predictions).to(device)

        for annotation_batch in class_annotation_batches:

            synonym_scores = self.bert_synonym_classifier.predict(annotation_batch.annotations)
            # aggregating to mappings cores
            grouped_synonym_scores = torch.split(
                synonym_scores,
                split_size_or_sections=annotation_batch.nums,
            )
            mapping_scores = torch.stack([torch.mean(chunk) for chunk in grouped_synonym_scores])
            assert len(mapping_scores) == len(annotation_batch.nums)

            # preserve N best scored mappings
            # scale N in case there are less than N tgt candidates in this batch
            N = min(len(mapping_scores), self.num_best_predictions)
            batch_best_scores, batch_best_idxs = torch.topk(mapping_scores, k=N)
            batch_best_idxs += batch_base_candidate_idx

            # we do the substitution for every batch to prevent from memory overflow
            final_best_scores, _idxs = torch.topk(
                torch.cat([batch_best_scores, final_best_scores]),
                k=self.num_best_predictions,
            )
            final_best_idxs = torch.cat([batch_best_idxs, final_best_idxs])[_idxs]

            # update the index for target candidate classes
            batch_base_candidate_idx += len(annotation_batch.nums)

        for candidate_idx, mapping_score in zip(final_best_idxs, final_best_scores):
            # ignore intial values (-1.0) for dummy mappings
            # the threshold 0.9 is for mapping extension
            if mapping_score.item() >= 0.9:
                tgt_candidate_iri = tgt_class_candidates[candidate_idx.item()][0]
                bert_matched_mappings.append(
                    self.init_class_mapping(
                        src_class_iri,
                        tgt_candidate_iri,
                        mapping_score.item(),
                    )
                )

        assert len(bert_matched_mappings) <= self.num_best_predictions
        self.logger.info(f"The best scored class mappings for {src_class_iri} are\n{bert_matched_mappings}")
        return bert_matched_mappings

    return bert_match()

mapping_prediction()

Apply global matching for each class in the source ontology.

See mapping_prediction_for_src_class.

If this process is accidentally stopped, it can be resumed from already saved predictions. The progress bar keeps track of the number of source ontology classes that have been matched.

Source code in src/deeponto/align/bertmap/mapping_prediction.py
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
def mapping_prediction(self):
    r"""Apply global matching for each class in the source ontology.

    See [`mapping_prediction_for_src_class`][deeponto.align.bertmap.mapping_prediction.MappingPredictor.mapping_prediction_for_src_class].

    If this process is accidentally stopped, it can be resumed from already saved predictions. The progress
    bar keeps track of the number of source ontology classes that have been matched.
    """
    self.logger.info("Start global matching for each class in the source ontology.")

    match_dir = os.path.join(self.output_path, "match")
    try:
        mapping_index = load_file(os.path.join(match_dir, "raw_mappings.json"))
        self.logger.info("Load the existing mapping prediction file.")
    except:
        mapping_index = dict()
        create_path(match_dir)

    progress_bar = self.enlighten_manager.counter(
        total=len(self.src_annotation_index), desc="Mapping Prediction", unit="per src class"
    )
    self.enlighten_status.update(demo="Mapping Prediction")

    for i, src_class_iri in enumerate(self.src_annotation_index.keys()):
        # skip computed classes
        if src_class_iri in mapping_index.keys():
            self.logger.info(f"[Class {i}] Skip matching {src_class_iri} as already computed.")
            progress_bar.update()
            continue
        # for OAEI
        if self.ignored_class_index and self.ignored_class_index[src_class_iri]:
            self.logger.info(f"[Class {i}] Skip matching {src_class_iri} as marked as not used in alignment.")
            progress_bar.update()
            continue
        mappings = self.mapping_prediction_for_src_class(src_class_iri)
        mapping_index[src_class_iri] = [m.to_tuple(with_score=True) for m in mappings]

        if i % 100 == 0 or i == len(self.src_annotation_index) - 1:
            save_file(mapping_index, os.path.join(match_dir, "raw_mappings.json"))
            # also save a .tsv version
            mapping_in_tuples = list(itertools.chain.from_iterable(mapping_index.values()))
            mapping_df = pd.DataFrame(mapping_in_tuples, columns=["SrcEntity", "TgtEntity", "Score"])
            mapping_df.to_csv(os.path.join(match_dir, "raw_mappings.tsv"), sep="\t", index=False)
            self.logger.info("Save currently computed mappings to prevent undesirable loss.")

        progress_bar.update()

    self.logger.info("Finished mapping prediction for each class in the source ontology.")
    progress_bar.close()

MappingRefiner(output_path, src_onto, tgt_onto, mapping_predictor, mapping_extension_threshold, mapping_filtered_threshold, logger, enlighten_manager, enlighten_status)

Class for the mapping refinement module of \(\textsf{BERTMap}\).

\(\textsf{BERTMapLt}\) does not go through mapping refinement for its being "light". All the attributes of this class are supposed to be passed from BERTMapPipeline.

Attributes:

Name Type Description
src_onto Ontology

The source ontology to be matched.

tgt_onto Ontology

The target ontology to be matched.

mapping_predictor MappingPredictor

The mapping prediction module of BERTMap.

mapping_extension_threshold float

Mappings with scores \(\geq\) this value will be considered in the iterative mapping extension process.

raw_mappings List[EntityMapping]

List of raw class mappings predicted in the global matching phase.

mapping_score_dict dict

A dynamic dictionary that keeps track of mappings (with scores) that have already been computed.

mapping_filter_threshold float

Mappings with scores \(\geq\) this value will be preserved for the final mapping repairing.

Source code in src/deeponto/align/bertmap/mapping_refinement.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    output_path: str,
    src_onto: Ontology,
    tgt_onto: Ontology,
    mapping_predictor: MappingPredictor,
    mapping_extension_threshold: float,
    mapping_filtered_threshold: float,
    logger: Logger,
    enlighten_manager: enlighten.Manager,
    enlighten_status: enlighten.StatusBar
):
    self.output_path = output_path
    self.logger = logger
    self.enlighten_manager = enlighten_manager
    self.enlighten_status = enlighten_status

    self.src_onto = src_onto
    self.tgt_onto = tgt_onto

    # iterative mapping extension
    self.mapping_predictor = mapping_predictor
    self.mapping_extension_threshold = mapping_extension_threshold  # \kappa
    self.raw_mappings = EntityMapping.read_table_mappings(
        os.path.join(self.output_path, "match", "raw_mappings.tsv"),
        threshold=self.mapping_extension_threshold,
        relation="<EquivalentTo>",
    )
    # keep track of already scored mappings to prevent duplicated predictions
    self.mapping_score_dict = dict()
    for m in self.raw_mappings:
        src_class_iri, tgt_class_iri, score = m.to_tuple(with_score=True)
        self.mapping_score_dict[(src_class_iri, tgt_class_iri)] = score

    # the threshold for final filtering the extended mappings
    self.mapping_filtered_threshold = mapping_filtered_threshold  # \lambda

    # logmap mapping repair folder
    self.logmap_repair_path = os.path.join(self.output_path, "match", "logmap-repair")

    # paths for mapping extension and repair
    self.extended_mapping_path = os.path.join(self.output_path, "match", "extended_mappings.tsv")
    self.filtered_mapping_path = os.path.join(self.output_path, "match", "filtered_mappings.tsv")
    self.repaired_mapping_path = os.path.join(self.output_path, "match", "repaired_mappings.tsv")

mapping_extension(max_iter=10)

Iterative mapping extension based on the locality principle.

For each class pair \((c, c')\) (scored in the global matching phase) with score \(\geq \kappa\), search for plausible mappings between the parents of \(c\) and \(c'\), and between the children of \(c\) and \(c'\). This is an iterative process as the set newly discovered mappings can act renew the frontier for searching. Terminate if no new mappings with score \(\geq \kappa\) can be found or the limit max_iter has been reached. Note that \(\kappa\) is set to \(0.9\) by default (can be altered in the configuration file). The mapping extension progress bar keeps track of the total number of extended mappings (including the previously predicted ones).

A further filtering will be performed by only preserving mappings with score \(\geq \lambda\), in the original BERTMap paper, \(\lambda\) is determined by the validation mappings, but in practice \(\lambda\) is not a sensitive hyperparameter and validation mappings are often not available. Therefore, we manually set \(\lambda\) to \(0.9995\) by default (can be altered in the configuration file). The mapping filtering progress bar keeps track of the total number of filtered mappings (this bar is purely for logging purpose).

Parameters:

Name Type Description Default
max_iter int

The maximum number of mapping extension iterations. Defaults to 10.

10
Source code in src/deeponto/align/bertmap/mapping_refinement.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
def mapping_extension(self, max_iter: int = 10):
    r"""Iterative mapping extension based on the locality principle.

    For each class pair $(c, c')$ (scored in the global matching phase) with score 
    $\geq \kappa$, search for plausible mappings between the parents of $c$ and $c'$,
    and between the children of $c$ and $c'$. This is an iterative process as the set 
    newly discovered mappings can act renew the frontier for searching. Terminate if
    no new mappings with score $\geq \kappa$ can be found or the limit `max_iter` has 
    been reached. Note that $\kappa$ is set to $0.9$ by default (can be altered
    in the configuration file). The mapping extension progress bar keeps track of the 
    total number of extended mappings (including the previously predicted ones).

    A further filtering will be performed by only preserving mappings with score $\geq \lambda$,
    in the original BERTMap paper, $\lambda$ is determined by the validation mappings, but
    in practice $\lambda$ is not a sensitive hyperparameter and validation mappings are often
    not available. Therefore, we manually set $\lambda$ to $0.9995$ by default (can be altered
    in the configuration file). The mapping filtering progress bar keeps track of the 
    total number of filtered mappings (this bar is purely for logging purpose).

    Args:
        max_iter (int, optional): The maximum number of mapping extension iterations. Defaults to `10`.
    """

    num_iter = 0
    self.enlighten_status.update(demo="Mapping Extension")
    extension_progress_bar = self.enlighten_manager.counter(
        desc=f"Mapping Extension [Iteration #{num_iter}]", unit="mapping"
    )
    filtering_progress_bar = self.enlighten_manager.counter(
        desc=f"Mapping Filtering", unit="mapping"
    )

    if os.path.exists(self.extended_mapping_path) and os.path.exists(self.filtered_mapping_path):
        self.logger.info(
            f"Found extended and filtered mapping files at {self.extended_mapping_path}"
            + f" and {self.filtered_mapping_path}.\nPlease check file integrity; if incomplete, "
            + "delete them and re-run the program."
        )

        # for animation purposes
        extension_progress_bar.desc = f"Mapping Extension"
        for _ in EntityMapping.read_table_mappings(self.extended_mapping_path):
            extension_progress_bar.update()

        self.enlighten_status.update(demo="Mapping Filtering")
        for _ in EntityMapping.read_table_mappings(self.filtered_mapping_path):
            filtering_progress_bar.update()

        extension_progress_bar.close()
        filtering_progress_bar.close()

        return
    # intialise the frontier, explored, final expansion sets with the raw mappings
    # NOTE be careful of address pointers
    frontier = [m.to_tuple() for m in self.raw_mappings]
    expansion = [m.to_tuple(with_score=True) for m in self.raw_mappings]
    # for animation purposes
    for _ in range(len(expansion)):
        extension_progress_bar.update()

    self.logger.info(
        f"Start mapping extension for each class pair with score >= {self.mapping_extension_threshold}."
    )
    while frontier and num_iter < max_iter:
        new_mappings = []
        for src_class_iri, tgt_class_iri in frontier:
            # one hop extension makes sure new mappings are really "new"
            cur_new_mappings = self.one_hop_extend(src_class_iri, tgt_class_iri)
            extension_progress_bar.update(len(cur_new_mappings))
            new_mappings += cur_new_mappings
        # add new mappings to the expansion set
        expansion += new_mappings
        # renew frontier with the newly discovered mappings
        frontier = [(x, y) for x, y, _ in new_mappings]

        self.logger.info(f"Add {len(new_mappings)} mappings at iteration #{num_iter}.")
        num_iter += 1
        extension_progress_bar.desc = f"Mapping Extension [Iteration #{num_iter}]"

    num_extended = len(expansion) - len(self.raw_mappings)
    self.logger.info(
        f"Finished iterative mapping extension with {num_extended} new mappings and in total {len(expansion)} extended mappings."
    )

    extended_mapping_df = pd.DataFrame(expansion, columns=["SrcEntity", "TgtEntity", "Score"])
    extended_mapping_df.to_csv(self.extended_mapping_path, sep="\t", index=False)

    self.enlighten_status.update(demo="Mapping Filtering")

    filtered_expansion = [
        (src, tgt, score) for src, tgt, score in expansion if score >= self.mapping_filtered_threshold
    ]
    self.logger.info(
        f"Filtered the extended mappings by a threshold of {self.mapping_filtered_threshold}."
        + f"There are {len(filtered_expansion)} mappings left for mapping repair."
    )

    for _ in range(len(filtered_expansion)):
        filtering_progress_bar.update()

    filtered_mapping_df = pd.DataFrame(filtered_expansion, columns=["SrcEntity", "TgtEntity", "Score"])
    filtered_mapping_df.to_csv(self.filtered_mapping_path, sep="\t", index=False)

    extension_progress_bar.close()
    filtering_progress_bar.close()
    return filtered_expansion

one_hop_extend(src_class_iri, tgt_class_iri, pool_size=200)

Extend mappings from a scored class pair \((c, c')\) by searching from one-hop neighbors.

Search for plausible mappings between the parents of \(c\) and \(c'\), and between the children of \(c\) and \(c'\). Mappings that are not already computed (recorded in self.mapping_score_dict) and have a score \(\geq\) self.mapping_extension_threshold will be returned as new mappings.

Parameters:

Name Type Description Default
src_class_iri str

The IRI of the source ontology class \(c\).

required
tgt_class_iri str

The IRI of the target ontology class \(c'\).

required
pool_size int

The maximum number of plausible mappings to be extended. Defaults to 200.

200

Returns:

Type Description
List[EntityMapping]

A list of one-hop extended mappings.

Source code in src/deeponto/align/bertmap/mapping_refinement.py
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
def one_hop_extend(self, src_class_iri: str, tgt_class_iri: str, pool_size: int = 200):
    r"""Extend mappings from a scored class pair $(c, c')$ by
    searching from one-hop neighbors.

    Search for plausible mappings between the parents of $c$ and $c'$,
    and between the children of $c$ and $c'$. Mappings that are not
    already computed (recorded in `self.mapping_score_dict`) and have
    a score $\geq$ `self.mapping_extension_threshold` will be returned as
    **new** mappings.

    Args:
        src_class_iri (str): The IRI of the source ontology class $c$.
        tgt_class_iri (str): The IRI of the target ontology class $c'$.
        pool_size (int, optional): The maximum number of plausible mappings to be extended. Defaults to 200.

    Returns:
        (List[EntityMapping]): A list of one-hop extended mappings.
    """

    def get_iris(owl_objects):
        return [str(x.getIRI()) for x in owl_objects]

    src_class = self.src_onto.get_owl_object(src_class_iri)
    src_class_parent_iris = get_iris(self.src_onto.get_asserted_parents(src_class, named_only=True))
    src_class_children_iris = get_iris(self.src_onto.get_asserted_children(src_class, named_only=True))

    tgt_class = self.tgt_onto.get_owl_object(tgt_class_iri)
    tgt_class_parent_iris = get_iris(self.tgt_onto.get_asserted_parents(tgt_class, named_only=True))
    tgt_class_children_iris = get_iris(self.tgt_onto.get_asserted_children(tgt_class, named_only=True))

    # pair up parents and children, respectively; NOTE set() might not be necessary
    parent_pairs = list(set(itertools.product(src_class_parent_iris, tgt_class_parent_iris)))
    children_pairs = list(set(itertools.product(src_class_children_iris, tgt_class_children_iris)))

    candidate_pairs = parent_pairs + children_pairs
    # downsample if the number of candidates is too large
    if len(candidate_pairs) > pool_size:
        candidate_pairs = random.sample(candidate_pairs, pool_size)

    extended_mappings = []
    for src_candidate_iri, tgt_candidate_iri in parent_pairs + children_pairs:

        # if already computed meaning that it is not a new mapping
        if (src_candidate_iri, tgt_candidate_iri) in self.mapping_score_dict:
            continue

        src_candidate_annotations = self.mapping_predictor.src_annotation_index[src_candidate_iri]
        tgt_candidate_annotations = self.mapping_predictor.tgt_annotation_index[tgt_candidate_iri]
        score = self.mapping_predictor.bert_mapping_score(src_candidate_annotations, tgt_candidate_annotations)
        # add to already scored collection
        self.mapping_score_dict[(src_candidate_iri, tgt_candidate_iri)] = score

        # skip mappings with low scores
        if score < self.mapping_extension_threshold:
            continue

        extended_mappings.append((src_candidate_iri, tgt_candidate_iri, score))

    self.logger.info(
        f"New mappings (in tuples) extended from {(src_class_iri, tgt_class_iri)} are:\n" + f"{extended_mappings}"
    )

    return extended_mappings

mapping_repair()

Repair the filtered mappings with LogMap's debugger.

Note

A sub-folder under match named logmap-repair contains LogMap-related intermediate files.

Source code in src/deeponto/align/bertmap/mapping_refinement.py
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
def mapping_repair(self):
    """Repair the filtered mappings with LogMap's debugger.

    !!! note

        A sub-folder under `match` named `logmap-repair` contains LogMap-related intermediate files.
    """

    # progress bar for animation purposes
    self.enlighten_status.update(demo="Mapping Repairing")
    repair_progress_bar = self.enlighten_manager.counter(
        desc=f"Mapping Repairing", unit="mapping"
    )

    # skip repairing if already found the file
    if os.path.exists(self.repaired_mapping_path):
        self.logger.info(
            f"Found the repaired mapping file at {self.repaired_mapping_path}."
            + "\nPlease check file integrity; if incomplete, "
            + "delete it and re-run the program."
        )
        # update progress bar for animation purposes
        for _ in EntityMapping.read_table_mappings(self.repaired_mapping_path):
            repair_progress_bar.update()
        repair_progress_bar.close()
        return 

    # start mapping repair
    self.logger.info("Repair the filtered mappings with LogMap debugger.")
    # formatting the filtered mappings
    self.logmap_repair_formatting()

    # run the LogMap repair module on the extended mappings
    run_logmap_repair(
        self.src_onto.owl_path,
        self.tgt_onto.owl_path,
        os.path.join(self.logmap_repair_path, f"filtered_mappings_for_LogMap_repair.txt"),
        self.logmap_repair_path,
        Ontology.get_max_jvm_memory()
    )

    # create table mappings from LogMap repair outputs
    with open(os.path.join(self.logmap_repair_path, "mappings_repaired_with_LogMap.tsv"), "r") as f:
        lines = f.readlines()
    with open(os.path.join(self.output_path, "match", "repaired_mappings.tsv"), "w+") as f:
        f.write("SrcEntity\tTgtEntity\tScore\n")
        for line in lines:
            src_ent_iri, tgt_ent_iri, score = line.split("\t")
            f.write(f"{src_ent_iri}\t{tgt_ent_iri}\t{score}")
            repair_progress_bar.update()

    self.logger.info("Mapping repair finished.")
    repair_progress_bar.close()

logmap_repair_formatting()

Transform the filtered mapping file into the LogMap format.

An auxiliary function of the mapping repair module which requires mappings to be formatted as LogMap's input format.

Source code in src/deeponto/align/bertmap/mapping_refinement.py
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
def logmap_repair_formatting(self):
    """Transform the filtered mapping file into the LogMap format.

    An auxiliary function of the mapping repair module which requires mappings
    to be formatted as LogMap's input format.
    """
    # read the filtered mapping file and convert to tuples
    filtered_mappings = EntityMapping.read_table_mappings(self.filtered_mapping_path)
    filtered_mappings_in_tuples = [m.to_tuple(with_score=True) for m in filtered_mappings]

    # write the mappings into logmap format
    lines = []
    for src_class_iri, tgt_class_iri, score in filtered_mappings_in_tuples:
        lines.append(f"{src_class_iri}|{tgt_class_iri}|=|{score}|CLS\n")

    # create a path to prevent error
    create_path(self.logmap_repair_path)
    formatted_file = os.path.join(self.logmap_repair_path, f"filtered_mappings_for_LogMap_repair.txt")
    with open(formatted_file, "w") as f:
        f.writelines(lines)

    return lines

Last update: March 7, 2023
Created: January 13, 2023
GitHub: @Lawhy   Personal Page: yuanhe.wiki