Research Quality — Known Issues¶

Status: P0–P2 fixes implemented (2026-05-22)
Recorded: 2026-05-22
Triggering run: Interactive query transformer attention mechanisms
Branch context: feat/multi-stage-research-pipeline

Summary¶

Reports could look factually wrong or off-topic even when the pipeline completed successfully. For the reference run, this was not primarily an Ollama/LLM accuracy failure: no LLM calls were made (llm_tokens_in: 0, synthesis.llm_enabled=false). The pipeline retrieved a mix of relevant and irrelevant papers, ranked some off-topic work highly, and assembled a misleading executive summary using heuristic synthesis and abstract snippet extraction.

The fixes below are query-agnostic — they use embedding similarity, adaptive corpus-relative thresholds, and generic keyword-collision detection rather than hardcoded NLP/ML domain lists. Multi-domain regression tests live in tests/test_research_quality.py.

Reference Run — Observed Symptoms (Pre-Fix)¶

Symptom	Example from output
Executive summary off-topic	Opens with “Air pollution poses a critical global public health challenge…” for an NLP/transformer query
Irrelevant papers in top results	Cervical cancer (CerviFormer), tea evapotranspiration, Arabic sign language, boring machining (PhyDT)
Suspicious canonical paper metadata	Attention Is All You Need listed as 2025 with DOI `10.65215/2q58a426`
Fragmented clustering	Many themes prefixed `Unclustered: …` (8 thin clusters)
Generic analysis placeholders	“Details inferred from abstract only”, “Full disagreement analysis requires LLM synthesis”
No relevance pruning	25 papers kept, 0 filtered by relevance stage
High retrieval volume	96 papers retrieved → 76 after dedup → 25 ranked

Pipeline duration: ~7.4s (11 stages, no failures)
Debug artifact: logs/debug/pipeline_*_20260522_*.json (query: transformer attention mechanisms)
Ranking top score: 0.8922 (misleadingly high for topical fit)

Fix Status¶

ID	Component	Status	What changed
RC-1	Query expansion	Fixed	Phrase-aware synonyms, Jaccard gate, broad-term guard; ML-specific templates removed
RC-2	Ranking	Fixed	Embedding weight 30%; embedding outlier + keyword-collision penalties (no ML branch)
RC-3	Relevance gate	Fixed	Adaptive embedding floor (percentile + gap-from-top); configurable `min_embedding_similarity`
RC-4	Synthesis & summary	Fixed	Top-quartile agreements; template executive summary wired to config thresholds
RC-5	Clustering	Fixed	Macro-cluster merge when HDBSCAN noise > 50%; `Theme:` labels instead of singleton spam
RC-6	Metadata / dedup	Fixed	Generic year/DOI sanity; `canonical_boost: 0.0` default (registry opt-in)
RC-7	LLM mode	Fixed	`llm_mode: auto \\| on \\| off` via `resolve_llm_features.py`

Architecture Context¶

Query → expansion (heuristic) → retrieval (OpenAlex + Semantic Scholar)
     → dedup → ranking → relevance_scoring → clustering → synthesis (heuristic)
     → gap_analysis (heuristic) → report_generation

Stage	LLM used?	Quality notes
Query expansion	Auto (off on 3B Ollama)	Heuristic variants with quality gates
Retrieval	No	Keyword/API search
Ranking	No	Weighted signals + embedding cache
Relevance	No	Adaptive embedding floor + concept match
Synthesis	Auto	Heuristic path uses embedding-aligned agreements
Report	No	Template summary + validated snippet

Root Cause Analysis (Historical)¶

These sections document why the reference run failed. Each maps to a fix above.

RC-1: Query expansion produces noisy search strings¶

Location: src/research/query_expansion.py

For query transformer attention mechanisms, heuristic expansion previously yielded redundant variants like attention mechanism attention mechanisms. Fixes: phrase-aware replacement, Jaccard overlap gate, broad-term guard, concept bigrams. No transformer/attention-specific templates.

RC-2: Ranking mis-weighted for topical precision¶

Location: src/research/ranking.py, config/default.yaml

Previously embedding_similarity was 5% and semantic_relevance was keyword-only. Now embedding weight is 30%, with generic penalties for:

Partial core-concept coverage
Embedding outliers (far below top-5 mean similarity)
Keyword collision (high overlap, low embedding similarity)

Config: outlier_embedding_gap: 0.12, keyword_collision_max_sim: 0.40.

RC-3: Relevance filter was effectively disabled¶

Location: src/research/relevance_scoring.py

Previously filtered only at rank_score < 0.05. Now uses composite gate with adaptive embedding floor:

floor = max(min_embedding_similarity, percentile(sims, keep_percentile), top_sim - gap_from_top)

Percentile skipped when corpus size < 8. Default min_embedding_similarity: 0.35.

RC-4: Heuristic synthesis built misleading narratives¶

Location: src/analysis/synthesis.py, src/reporting/report_generation.py

Previously agreements[0] from rank order drove the executive summary. Now:

Agreements drawn from top embedding-similarity quartile
Executive summary uses query + cluster themes template
Validated snippet only when embedding similarity ≥ configured floor

RC-5: Clustering fragmented into “Unclustered” singletons¶

Location: src/research/clustering.py

When HDBSCAN noise ratio > noise_merge_threshold (0.5), noise papers merge into ≤4 keyword macro-clusters labeled Theme: ….

RC-6: Scholarly metadata quality¶

Location: src/research/metadata_sanity.py, src/retrieval/deduplication.py

Generic rules: future-year correction, DOI format checks, richer duplicate preference. Canonical work registry is opt-in (canonical_boost: 0.0 default).

RC-7: LLM synthesis disabled by default¶

Location: src/config/resolve_llm_features.py, config/default.yaml

Tri-state llm_mode: auto enables LLM on cloud providers and capable local models (8B+). Small Ollama models (e.g. llama3.2:3b) stay heuristic-only unless explicitly overridden.

Quality Modes¶

Mode	Typical setup	Synthesis	Expansion
Heuristic-only	`llama3.2:3b`, `llm_mode: auto`	Off	Off
Balanced local	`llama3.1:8b`, `llm_mode: auto`	On	On
Cloud	`openai` / `anthropic`, `llm_mode: auto`	On	On

Override with RA_SYNTHESIS__LLM_MODE=on|off|auto or legacy RA_SYNTHESIS__LLM_ENABLED=true|false.

Reproduction Checklist¶

Run multi-domain regression (no network):

pipenv run pytest tests/test_research_quality.py -q

Full pipeline smoke test:

pipenv run python -m src "transformer attention mechanisms"

Expect (post-fix):

[ ] Executive summary uses query + cluster themes; no off-topic decoy domain in lead
[ ] Keyword-collision decoys demoted in ranking and filtered by relevance
[ ] High HDBSCAN noise yields ≤4 Theme: macro-clusters, not N singletons
[ ] llm_tokens_in: 0 when running heuristic-only mode on 3B model

Inspect expansion:

pipenv run python -c "
from src.research.query_expansion import expand_query_heuristic, extract_core_concepts
q = 'transformer attention mechanisms'
print(expand_query_heuristic(q, extract_core_concepts(q)))
"

Setting	Default	Quality impact
`synthesis.llm_mode`	`auto`	Model-aware LLM enable (RC-7)
`query_expansion.llm_mode`	`auto`	Model-aware expansion (RC-1)
`ranking.weights.embedding_similarity`	`0.30`	Primary semantic steering (RC-2)
`ranking.outlier_embedding_gap`	`0.12`	Demote embedding outliers (RC-2)
`ranking.keyword_collision_max_sim`	`0.40`	Keyword-without-semantics penalty (RC-2)
`ranking.canonical_boost`	`0.0`	Opt-in landmark boost (RC-6)
`relevance_scoring.min_embedding_similarity`	`0.35`	Base embedding floor (RC-3)
`relevance_scoring.adaptive_embedding`	`true`	Corpus-relative floor (RC-3)
`clustering.noise_merge_threshold`	`0.5`	Macro-cluster merge trigger (RC-5)

Environment overrides:

RA_RELEVANCE_SCORING__MIN_EMBEDDING_SIMILARITY
RA_SYNTHESIS__LLM_MODE=auto
RA_RANKING__CANONICAL_BOOST=0.05 (opt-in)

Change Log¶

Date	Action
2026-05-22	Initial documentation from interactive run analysis
2026-05-22	P0–P2 fixes implemented; multi-domain tests in `test_research_quality.py`