NLQ Engine

The NLQ Engine is the core orchestrator that transforms natural language questions into validated, executed SQL. It routes questions through multiple trust tiers, generates SQL with semantic context, validates it before execution, and self-corrects on failure.

How It Works

Query Router (`query_router.py`)

The query router classifies incoming questions by intent using keyword scoring and pattern matching. It returns a RouteDecision with the classified intent and matched ontology classes.

Intent classes:

Intent	Description	LLM Required
`knowledge`	Direct definition lookup (acronyms, concepts)	No
`vocabulary`	Ontology context + LLM synthesis for plain English	Yes
`discovery`	Schema exploration ("what tables exist?")	No
`analytical`	Full SQL generation pipeline	Yes
`impact`	Backward/forward propagation analysis	No
`ontop`	SPARQL template match against Virtual Knowledge Graph	No
`metric`	Composable metric computation	Yes

The classify_intent() function uses keyword-to-class mappings that can be customized per workspace. Keywords are scored and matched against registered ontology classes to determine the best routing path.

SemanticEngine (`engine.py`)

The SemanticEngine class is the main orchestrator. Its answer() method:

Sanitizes the input question (PII scrubbing, injection guards)
Calls the query router to classify intent
Routes to the appropriate trust tier handler
Returns an NLQResult with SQL, data, confidence score, and metadata

Key features:

Circuit breaker — Prevents cascading failures when LLM or database is down
Soft timeout — Queries that exceed EPISTOM_QUERY_TIMEOUT_SECONDS are cancelled
PII masking — Results are scanned for PII patterns and redacted
Multi-source routing — Questions spanning multiple data sources can route through Trino federation

NLQ Engine (`nlq_engine.py`)

The NLQEngine class handles the LLM interaction for SQL generation:

Builds a prompt with semantic context (schema, definitions, verified queries)
Sends it to the configured LLM via the LLMAdapter
Extracts SQL from the response
Passes it through the SQL validator

Prompt Assembler (`prompt_assembler.py`)

Assembles the LLM prompt by combining:

Database schema (relevant tables and columns)
Ontology definitions (what concepts mean)
Verified query examples (few-shot patterns)
Business rules (constraints and aggregation rules)
Question sanitization (removes PII patterns, injection attempts)

Self-Correction (`self_correction.py`)

When the SQL validator rejects generated SQL, the self-correction module:

Analyzes the validation errors
Appends error context to the prompt
Asks the LLM to regenerate with corrections
Re-validates the corrected SQL

This loop runs up to 2 times before returning a failure.

SQL Validator (`sql_validator.py`)

The pre-execution validation gate checks SQL before it touches the database:

Column existence — Every referenced column must exist in the schema
Table existence — Every referenced table must be in a registered source
Join path validation — JOIN conditions must use valid foreign key relationships
PII column check — Queries selecting PII-annotated columns are flagged or blocked
Anti-pattern detection — Common LLM SQL mistakes (HAVING without GROUP BY, correlated subqueries, etc.)
Multi-statement rejection — Only single SELECT statements are allowed
SQL injection guard — DDL, DML, and system commands are rejected

The validator uses both AST parsing (via sqlglot) and regex fallback for maximum coverage.

result = validate_sql(sql, schema, source="demo_postgres")
if result.valid:
    # Safe to execute
else:
    # result.errors contains structured error descriptions

Configuration

EPISTOM_LLM_PROVIDER=anthropic
EPISTOM_LLM_MODEL_ID=claude-sonnet-4-5
EPISTOM_MAX_QUERY_ROWS=1000
EPISTOM_QUERY_TIMEOUT_SECONDS=300
EPISTOM_SQL_READONLY=true

On this page