Date:

Python LASER Embeddings for Text Identifier Matching

Problem

The objective is to identify specific financial terms (categories) in a given text line. Let’s assume we have a fixed set of predefined categories that represent all possible terms of interest, such as:

["revenues", "operating expenses", "operating profit", "depreciation", "interest", "net profit", "tax", "profit after tax", "metric 1"]

Given an input line like:

"operating profit, net profit and profit after tax"

We aim to detect which identifiers appear in this line.

Semantic Matching with LASER

Instead of relying on exact or fuzzy text matches, we use semantic similarity. This approach leverages LASER embeddings to capture the semantic meaning of text and compares it using cosine similarity.

Implementation

Preprocessing the Text

Before embedding, the text is preprocessed by converting it to lowercase and removing extra spaces. This ensures uniformity.

def preprocess(text):
    return text.lower().strip()
Enter fullscreen mode

Exit fullscreen mode

Embedding Identifiers and Input Line

The LASER encoder generates normalized embeddings for both the list of identifiers and the input/OCR line.

identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True)
ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]
Enter fullscreen mode

Exit fullscreen mode

Ranking Identifiers by Specificity

Longer identifiers are prioritized by sorting them based on word count. This helps handle nested matches, where longer identifiers might subsume shorter ones (e.g., “profit after tax” subsumes “profit”).

ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True)
ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)
Enter fullscreen mode

Exit fullscreen mode

Calculating Similarity

Using cosine similarity, we measure how semantically similar each identifier is to the input line. Identifiers with similarity above a specified threshold are considered matches.

matches = []
threshold = 0.6

for idx, identifier_embedding in enumerate(ranked_embeddings):
    similarity = cosine_similarity([identifier_embedding], [ocr_line_embedding])[0][0]
    if similarity >= threshold:
        matches.append((ranked_identifiers[idx], similarity))
Enter fullscreen mode

Exit fullscreen mode

Resolving Nested Matches

To handle overlapping identifiers, longer matches are prioritized, ensuring shorter matches within them are excluded.

resolved_matches = []
for identifier, score in sorted(matches, key=lambda x: x[1], reverse=True):
    if not any(identifier in longer_id and len(identifier) < len(longer_id) for longer_id, _ in resolved_matches):
        resolved_matches.append((identifier, score))
Enter fullscreen mode Post Views: 38

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here