Python LASER Embeddings for Text Identifier Matching

Problem

The objective is to identify specific financial terms (categories) in a given text line. Let’s assume we have a fixed set of predefined categories that represent all possible terms of interest, such as:

["revenues", "operating expenses", "operating profit", "depreciation", "interest", "net profit", "tax", "profit after tax", "metric 1"]

Given an input line like:

"operating profit, net profit and profit after tax"

We aim to detect which identifiers appear in this line.

Semantic Matching with LASER

Instead of relying on exact or fuzzy text matches, we use semantic similarity. This approach leverages LASER embeddings to capture the semantic meaning of text and compares it using cosine similarity.

Implementation

Preprocessing the Text

Before embedding, the text is preprocessed by converting it to lowercase and removing extra spaces. This ensures uniformity.

def preprocess(text):
    return text.lower().strip()

Embedding Identifiers and Input Line

The LASER encoder generates normalized embeddings for both the list of identifiers and the input/OCR line.

identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True)
ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]

Ranking Identifiers by Specificity

Longer identifiers are prioritized by sorting them based on word count. This helps handle nested matches, where longer identifiers might subsume shorter ones (e.g., “profit after tax” subsumes “profit”).

ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True)
ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)

Calculating Similarity

Using cosine similarity, we measure how semantically similar each identifier is to the input line. Identifiers with similarity above a specified threshold are considered matches.

matches = []
threshold = 0.6

for idx, identifier_embedding in enumerate(ranked_embeddings):
    similarity = cosine_similarity([identifier_embedding], [ocr_line_embedding])[0][0]
    if similarity >= threshold:
        matches.append((ranked_identifiers[idx], similarity))

Resolving Nested Matches

To handle overlapping identifiers, longer matches are prioritized, ensuring shorter matches within them are excluded.

resolved_matches = []
for identifier, score in sorted(matches, key=lambda x: x[1], reverse=True):
    if not any(identifier in longer_id and len(identifier) < len(longer_id) for longer_id, _ in resolved_matches):
        resolved_matches.append((identifier, score))

Post Views: 73

Parker Lennox

Python LASER Embeddings for Text Identifier Matching

Problem

Semantic Matching with LASER

Implementation

Preprocessing the Text

Embedding Identifiers and Input Line

Ranking Identifiers by Specificity

Calculating Similarity

Resolving Nested Matches

Enterprise AI Agents Are Taking Over – Is Your Infrastructure Built to Last?

How AI agents are transforming go-to-market operations for robotics companies

Can AI build a jet engine? JARVIS Challenge tests role of AI copilots in tough-tech engineering | MIT News

Generate single title from this title The best time to be a teacher: Leading with AI and technology in the age of personalized learning...

Signs your commercial LED lights need maintenance

Enterprise AI Agents Are Taking Over – Is Your Infrastructure Built to Last?

How AI agents are transforming go-to-market operations for robotics companies

Can AI build a jet engine? JARVIS Challenge tests role of AI copilots in tough-tech engineering | MIT News

Generate single title from this title The best time to be a teacher: Leading with AI and technology in the age of personalized learning...

Signs your commercial LED lights need maintenance

AI agents create virtual playgrounds to help robots get crucial training data | MIT News

Why physician workforce matching remains the biggest challenge in healthcare automation

Robot.com unveils R-noid humanoid robot for logistics, hospitality and manufacturing

LEAVE A REPLY Cancel reply

Latest

Enterprise AI Agents Are Taking Over – Is Your Infrastructure Built to Last?

How AI agents are transforming go-to-market operations for robotics companies

Can AI build a jet engine? JARVIS Challenge tests role of AI copilots in tough-tech engineering | MIT News

Categories

Useful Links

Our Newsletter