Text Utilities
Tokenizer(tokenizer_type)
A Tokenizer class for both sub-word (pre-trained) and word (rule-based) level tokenization.
Source code in src/deeponto/utils/text_utils.py
96 97 98 99 |
|
from_pretrained(pretrained_path='bert-base-uncased')
classmethod
(Based on transformers) Load a sub-word level tokenizer from pre-trained model.
Source code in src/deeponto/utils/text_utils.py
107 108 109 110 111 112 113 |
|
from_rule_based()
classmethod
(Based on spacy) Load a word-level (rule-based) tokenizer.
Source code in src/deeponto/utils/text_utils.py
115 116 117 118 119 120 121 122 |
|
InvertedIndex(index, tokenizer)
Inverted index built from a text index.
Attributes:
Name | Type | Description |
---|---|---|
tokenizer |
Tokenizer
|
A tokenizer instance to be used. |
original_index |
defaultdict
|
A dictionary where the values are text strings to be tokenized. |
constructed_index |
defaultdict
|
A dictionary that acts as the inverted index of |
Source code in src/deeponto/utils/text_utils.py
134 135 136 137 138 139 140 141 |
|
idf_select(texts, pool_size=200)
Given a list of tokens, select a set candidates based on the inverted document frequency (idf) scores.
We use idf
instead of tf
because labels have different lengths and thus tf is not a fair measure.
Source code in src/deeponto/utils/text_utils.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
process_annotation_literal(annotation_literal, apply_lowercasing=False, normalise_identifiers=False)
Pre-process an annotation literal string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
annotation_literal |
str
|
A literal string of an entity's annotation. |
required |
apply_lowercasing |
bool
|
A boolean that determines lowercasing or not. Defaults to |
False
|
normalise_identifiers |
bool
|
Whether to normalise annotation text that is in the Java identifier format. Defaults to |
False
|
Returns:
Type | Description |
---|---|
str
|
the processed annotation literal string. |
Source code in src/deeponto/utils/text_utils.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
|
split_java_identifier(java_style_identifier)
Split words in java's identifier style into natural language phrase.
Examples:
"SuperNaturalPower"
\(\rightarrow\)"Super Natural Power"
"APIReference"
\(\rightarrow\)"API Reference"
"Covid19"
\(\rightarrow\)"Covid 19"
Source code in src/deeponto/utils/text_utils.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
Created: January 14, 2023