Problem
The objective is to identify specific financial terms (categories) in a given text line. Let’s assume we have a fixed set of predefined categories that represent all possible terms of interest, such as:
["revenues", "operating expenses", "operating profit", "depreciation", "interest", "net profit", "tax", "profit after tax", "metric 1"]
Given an input line like:
"operating profit, net profit and profit after tax"
We aim to detect which identifiers appear in this line.
Semantic Matching with LASER
Instead of relying on exact or fuzzy text matches, we use semantic similarity. This approach leverages LASER embeddings to capture the semantic meaning of text and compares it using cosine similarity.
Implementation
Preprocessing the Text
Before embedding, the text is preprocessed by converting it to lowercase and removing extra spaces. This ensures uniformity.
def preprocess(text):
return text.lower().strip()
Embedding Identifiers and Input Line
The LASER encoder generates normalized embeddings for both the list of identifiers and the input/OCR line.
identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True)
ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]
Ranking Identifiers by Specificity
Longer identifiers are prioritized by sorting them based on word count. This helps handle nested matches, where longer identifiers might subsume shorter ones (e.g., “profit after tax” subsumes “profit”).
ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True)
ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)
Calculating Similarity
Using cosine similarity, we measure how semantically similar each identifier is to the input line. Identifiers with similarity above a specified threshold are considered matches.
matches = []
threshold = 0.6
for idx, identifier_embedding in enumerate(ranked_embeddings):
similarity = cosine_similarity([identifier_embedding], [ocr_line_embedding])[0][0]
if similarity >= threshold:
matches.append((ranked_identifiers[idx], similarity))
Resolving Nested Matches
To handle overlapping identifiers, longer matches are prioritized, ensuring shorter matches within them are excluded.
resolved_matches = []
for identifier, score in sorted(matches, key=lambda x: x[1], reverse=True):
if not any(identifier in longer_id and len(identifier) < len(longer_id) for longer_id, _ in resolved_matches):
resolved_matches.append((identifier, score))

