Skip to content

MinHash LSH deduplication and negation-based contradiction detection APIs.

Deduplicator

MinHash LSH-based near-duplicate detection for text content.

Constructor

python
from arriadne import Deduplicator

dedup = Deduplicator(threshold=0.8, num_perm=128)
ParameterTypeDefaultDescription
thresholdfloat0.8Jaccard similarity threshold (0.0–1.0)
num_permint128Number of MinHash permutations

TIP

Higher num_perm improves accuracy but uses more memory. 128 is a good default; use 64 for speed or 256 for precision.

add()

Add content to the deduplication index.

python
doc_id = dedup.add(content, doc_id="memory_42")
ParameterTypeDefaultDescription
contentstrrequiredText to index
doc_idstr | NoneNoneDocument ID (auto-generated if None)

Returns: str — The document ID.

remove()

Remove a document from the index.

python
removed = dedup.remove("memory_42")
ParameterTypeDefaultDescription
doc_idstrrequiredDocument ID to remove

Returns: boolTrue if removed, False if not found.

is_duplicate()

Check if content is a near-duplicate of any indexed content.

python
is_dup = dedup.is_duplicate("Deploy to production using kubectl")
ParameterTypeDefaultDescription
contentstrrequiredContent to check

Returns: boolTrue if a duplicate exists above threshold.

find_duplicates()

Find all near-duplicates with similarity scores.

python
duplicates = dedup.find_duplicates("Deploy to production via kubectl")
ParameterTypeDefaultDescription
contentstrrequiredContent to find duplicates for

Returns: list[dict]

python
[
    {
        "id": str,              # Document ID
        "content": str,         # Original content
        "similarity": float,    # Jaccard similarity (0.0–1.0)
    }
]

Results are sorted by similarity (descending).

Find loosely related content using a lower effective threshold.

python
related = dedup.find_related("deployment configuration", limit=10)
ParameterTypeDefaultDescription
contentstrrequiredContent to find related items for
limitint10Maximum results

Returns: list[dict] — Same format as find_duplicates().

size

Property returning the number of indexed documents.

python
print(f"Indexed: {dedup.size}")

ContradictionDetector

Detects contradictions between text statements using negation pattern matching and fact extraction.

Constructor

python
from arriadne import ContradictionDetector

detector = ContradictionDetector()

detect_contradictions()

Detect contradictions between two text statements.

python
contradictions = detector.detect_contradictions(
    "Python is a compiled language",
    "Python is not a compiled language",
)
ParameterTypeDefaultDescription
text_astrrequiredFirst text statement
text_bstrrequiredSecond text statement

Returns: list[dict]

python
[
    {
        "subject": str,           # e.g., "python"
        "predicate": str,         # e.g., "a compiled language"
        "statement_a": str,       # Original clause from text_a
        "statement_b": str,       # Original clause from text_b
        "negated_in_a": bool,     # Whether fact_a is negated
        "negated_in_b": bool,     # Whether fact_b is negated
    }
]

is_contradictory()

Quick boolean check for contradictions.

python
is_contra = detector.is_contradictory(
    "Redis is a relational database",
    "Redis is not a relational database",
)
print(is_contra)  # True
ParameterTypeDefaultDescription
text_astrrequiredFirst text
text_bstrrequiredSecond text

Returns: boolTrue if contradictions found.

extract_facts()

Extract factual claims from text.

python
facts = detector.extract_facts(
    "Python is dynamic. Java is compiled. Go has garbage collection."
)
ParameterTypeDefaultDescription
textstrrequiredInput text

Returns: list[dict]

python
[
    {
        "subject": str,       # e.g., "python"
        "predicate": str,     # e.g., "dynamic"
        "negated": bool,      # Whether the fact is negated
        "original": str,      # Original clause text
    }
]

Negation Patterns

The detector recognizes these negation patterns:

PatternExample
not, no, never, neither, nor"Python is not slow"
without, cannot, can't, won't"Can't use Java"
wouldn't, shouldn't, don't"Don't support threads"
doesn't, didn't, isn't, aren't"Isn't compiled"
wasn't, weren't, hasn't, haven't"Hasn't been deprecated"
hadn't"Hadn't been tested"

Fact Patterns

Facts are extracted using these patterns:

PatternExample
X is/are/was/were Y"Python is dynamic"
X has/have/had Y"Go has garbage collection"
X can/could/may/might Y"SQLite can run in-memory"
X does/did Y"Node.js does event-driven I/O"

Integration with AriadneMemory

Both Deduplicator and ContradictionDetector are used automatically by AriadneMemory:

python
from arriadne import AriadneMemory

mem = AriadneMemory(db_path="memory.db")

# Automatic dedup + contradiction detection
result = mem.remember("Python is a compiled language", importance=0.8)
# Status: created

result = mem.remember("Python is a compiled language", importance=0.8)
# Status: duplicate

result = mem.remember("Python is not a compiled language", importance=0.5)
# Status: created, with contradictions detected

Advanced: Standalone Usage

python
from arriadne import Deduplicator, ContradictionDetector

# Build a standalone dedup index
dedup = Deduplicator(threshold=0.8, num_perm=128)

# Index a corpus
documents = [
    "Deploy to production using kubectl apply",
    "Deploy to prod via kubectl apply -f",
    "User prefers dark mode",
    "User likes dark theme",
]

for i, doc in enumerate(documents):
    dedup.add(doc, doc_id=f"doc_{i}")

# Find duplicates
dups = dedup.find_duplicates("Deploy to production using kubectl")
for d in dups:
    print(f"  {d['id']}: {d['similarity']:.4f}")

# Detect contradictions between any two texts
contra = ContradictionDetector()
result = contra.detect_contradictions(
    "PostgreSQL is a NoSQL database",
    "PostgreSQL is not a NoSQL database",
)
print(f"Contradictions: {len(result)}")

Released under the MIT License.