Testing¶

pytest suite covering configuration, pipeline stages, retrieval providers, LLM layer, CLI, and extensibility. All tests run offline with mocks — no live LLM or scholarly API calls in CI.

Source: tests/test_*.py (28 files). Internal index: docs/_analysis/test-behavior-index.md (repo-only; not published on the docs site).

Run tests¶

pipenv install --dev
pipenv run pytest                    # full suite
pipenv run pytest -v                 # verbose
pipenv run pytest -m "not slow"      # skip subprocess integration tests
pipenv run pytest tests/test_synthesis.py -v   # single file

Async tests use pytest-asyncio (configured for auto mode on async test functions).

Test map by domain¶

Domain	Files	What they verify
Config / LLM resolution	`test_config_settings.py`, `test_resolve_llm_features.py`, `test_model_selection.py`	YAML merge, env overrides, auto LLM flags, Ollama catalog selection
Pipeline core	`test_pipeline_core.py`, `test_paper_adapters.py`	Stage ordering, partial failure, disabled stages, metrics
Research stages	`test_research_stages.py`, `test_research_quality.py`	Expansion, ranking, relevance, clustering, dedup; multi-domain quality
Retrieval	`test_retrieval_stage.py`, `test_providers.py`	Provider failure tolerance, normalization, health checks
Synthesis / gaps	`test_synthesis.py`	Heuristic + LLM paths, stage recovery, timeout handling
Reporting	`test_reporting.py`, `test_export.py`	Markdown/JSON/HTML, citations, executive summary
LLM providers	`test_llm_providers.py`, `test_graceful_response_handling.py`	Provider registry, base URL normalization, JSON retry/fallback utils
CLI / interactive	`test_main_mode_detection.py`, `test_interactive_mode.py`, `test_complete_workflow.py`, others	Mode detection, session UX, subprocess flows
Memory / filters	`test_memory.py`, `test_interactive_filters.py`	SQLite sessions, follow-up filters
Extensibility	`test_phase3_extensibility.py`	Registry bootstrap, stub providers, API scaffold, events
Progress	`test_progress_reporter.py`	TTY detection, stage labels

Mocking strategy¶

LLM calls¶

All LLM integration tests mock pydantic-ai — no Ollama or cloud API required:

Pattern	Example location
`patch create_llm_agent`	`test_synthesis.py`
`patch` OpenAI/Pydantic AI constructors	`test_llm_providers.py`
`MagicMock(EnhancedResponseHandler)`	synthesis workflow tests

This keeps CI fast and deterministic. Manual LLM verification uses the CLI with real providers.

Retrieval providers¶

Pattern	Purpose
`SuccessProvider` / `FailingProvider` / `EmptyProvider` stubs	Stage-level retrieval tests
`patch get_enabled_providers`	Control which providers run
`AsyncMock` aiohttp sessions	Provider health check tests
Normalization unit tests	Raw API payload → `RetrievedPaper` mapping

Embeddings¶

Fixture	Purpose
`FixedEmbeddingProvider`	Deterministic vectors for ranking/relevance/quality tests
`MockEmbeddingProvider`	Lightweight stub for stage tests
`patch.object(provider, "_load_model")`	Skip sentence-transformers model load

Pipeline stubs¶

Stub	Purpose
`EchoStage`, `FailingStage`, `PartialStage`	Pipeline core behavior
`RetrievalStub`	End-to-end stage chain without HTTP
`mock_pipeline_result` (`tests/helpers/pipeline_mocks.py`)	Orchestrator output tests

CLI / subprocess¶

Pattern	Notes
`patch sys.argv` + `patch asyncio.run`	Unit-test `__main__` without subprocess
`@pytest.mark.slow` subprocess tests	`test_complete_workflow.py` — real `python -m src`
`capsys`	Assert stdout/stderr formatting

Skip slow tests in quick loops: pytest -m "not slow".

Key test behaviors¶

LLM feature resolution (`test_resolve_llm_features.py`)¶

Test	Confirms
`test_llm_mode_auto_8b`	Ollama 8B + auto → LLM on, `max_llm_papers=5`
`test_llm_mode_auto_3b`	Ollama 3B + auto → LLM off
`test_cloud_provider_auto_enables_llm`	OpenAI + auto → LLM on
`test_env_llm_enabled_overrides_mode`	Env bool beats `llm_mode: off`

Multi-domain quality (`test_research_quality.py`)¶

Parametrized cases across NLP, biomedical, climate, and economics domains:

No degenerate query variants
Embedding outlier demotes homonym decoys
Adaptive relevance filter drops off-topic papers
Executive summary excludes decoy terms
No hardcoded ML-specific branch constants in source

Extensibility (`test_phase3_extensibility.py`)¶

Stub providers (PubMed, CORE, DBLP) registered but NotImplementedError on search
bootstrap_default_plugins() registers 7 providers + 11 stages
StageEventCollector fires start/complete events
FastAPI create_app requires optional dependency

Fixtures and helpers¶

Path	Role
`tests/helpers/pipeline_mocks.py`	`mock_pipeline_result()` for orchestrator tests
`catalog_dir` fixture	Temp `ollama_models.yaml` for selection tests
`temp_config_dir` fixture	YAML overlay merge tests
`memory_store` fixture	Tmp SQLite for session tests

Coverage gaps¶

Document these when adding tests:

Gap	Detail
Query understanding	No dedicated unit test file
API routes	Scaffold tests only — no HTTP integration tests
`EnhancedResponseHandler`	Subcomponents tested; not end-to-end
Live LLM / API	All mocked in unit tests
Subprocess tests	Marked `@pytest.mark.slow`; may skip in tight CI

Writing new tests¶

Import style: use absolute imports (from src.module import ...) in tests — see Import conventions.
Async stages: mark with @pytest.mark.asyncio.
Avoid live network: mock aiohttp or patch provider classes.
Deterministic embeddings: prefer FixedEmbeddingProvider over real sentence-transformers loads.
Config isolation: use AppSettings(...) kwargs or monkeypatch for env — do not rely on developer .env.

Example minimal stage test:

import pytest
from src.config.settings import AppSettings
from src.core.context import PipelineContext
from src.research.query_expansion import QueryExpansionStage

@pytest.mark.asyncio
async def test_expansion_produces_variants() -> None:
    stage = QueryExpansionStage()
    ctx = PipelineContext(settings=AppSettings(), query="machine learning")
    result = await stage.run(ctx, "machine learning")
    assert len(result.output.variants) >= 1

Local development setup — install and run commands
Extensibility — registry patterns tested in phase3
Heuristic vs LLM — behavior under test in synthesis/resolve tests