Configuration Cookbook¶

Copy-paste recipes for common setups. Each recipe shows environment variables and equivalent YAML where applicable.

Precedence reminder: env > .env > YAML > defaults. See Configuration precedence.

Pick your entry point first

Batch CLI (python -m src "query") only uses OpenAlex + Semantic Scholar regardless of provider recipes below. For recipes that enable arXiv/CrossRef or extra providers, use interactive mode, the API, or programmatic run_research(). See CLI vs API.

1. Fast local (default heuristic)¶

Minimal config — Ollama auto model, heuristic synthesis, OpenAlex + S2 only on CLI batch path.

.env:

RA_LLM__PROVIDER=ollama
RA_LLM__MODEL=auto
RA_SYNTHESIS__LLM_ENABLED=false
RA_QUERY_EXPANSION__LLM_ENABLED=false
RA_PIPELINE__STREAM_PROGRESS=true

Run: README Usage or CLI reference.

Best for: Quick scans, low RAM (llama3.2:3b), offline-after-setup use.

2. High-quality local (8B + LLM synthesis)¶

Enable LLM stages when llama3.1:8b is selected (catalog sets synthesis.llm_enabled: true for 8B).

.env:

RA_LLM__PROVIDER=ollama
RA_LLM__MODEL=llama3.1:8b
RA_SYNTHESIS__LLM_ENABLED=true
RA_QUERY_EXPANSION__LLM_ENABLED=true
RA_SYNTHESIS__MAX_LLM_PAPERS=5

Verify model: Setup system (health_check --model llama3.1:8b).

Best for: Richer synthesis on capable hardware (8–10 GB RAM).

3. Cloud OpenAI quality¶

Skip Ollama; use GPT for expansion and synthesis.

.env:

RA_LLM__PROVIDER=openai
RA_LLM__MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
RA_SYNTHESIS__LLM_ENABLED=true
RA_QUERY_EXPANSION__LLM_ENABLED=true
RA_RANKING__TOP_K=30

Run: CLI reference (same query invocations as README Usage).

Best for: Maximum quality without local GPU/RAM. API costs apply.

4. Enable arXiv + CrossRef (full pipeline)¶

Not available on CLI batch shortcut

Use interactive mode, API, or programmatic run_research() — not python -m src "query" alone.

YAML (config/providers.yaml):

concurrency_limit: 4
per_provider_limit: 10
providers:
  openalex:
    enabled: true
  semantic_scholar:
    enabled: true
  arxiv:
    enabled: true
  crossref:
    enabled: true

Equivalent .env:

RA_RETRIEVAL__PROVIDERS__ARXIV__ENABLED=true
RA_RETRIEVAL__PROVIDERS__CROSSREF__ENABLED=true
RA_RETRIEVAL__PER_PROVIDER_LIMIT=10
RA_CROSSREF_MAILTO=you@example.com
S2_API_KEY=optional_for_rate_limits

Run (interactive — full settings): Interactive sessions (pipenv run python -m src with no query arg).

Or use the API / programmatic path. See CLI vs API.

5. Debug mode (inspect pipeline artifacts)¶

.env:

RA_PIPELINE__DEBUG=true
RA_PIPELINE__STREAM_PROGRESS=true
# Avoid duplicate: do not also set RA_DEBUG=1 unless intentional

Run and inspect: Logging and debug walkthrough, then:

ls -lt logs/debug/
jq '.stage_results | keys' logs/debug/pipeline_*.json

6. Retrieval-only smoke test¶

Disable analysis/report stages to validate API connectivity quickly.

YAML snippet (merge into default.yaml or use env):

pipeline:
  enabled_stages:
    query_understanding: true
    query_expansion: true
    retrieval: true
    deduplication: true
    ranking: true
    relevance_scoring: false
    clustering: false
    synthesis: false
    gap_analysis: false
    citation_export: false
    report_generation: false

Env alternative (disable later stages):

RA_PIPELINE__ENABLED_STAGES__CLUSTERING=false
RA_PIPELINE__ENABLED_STAGES__SYNTHESIS=false
RA_PIPELINE__ENABLED_STAGES__GAP_ANALYSIS=false
RA_PIPELINE__ENABLED_STAGES__CITATION_EXPORT=false
RA_PIPELINE__ENABLED_STAGES__REPORT_GENERATION=false

Use debug dumps to inspect artifacts.retrieved_papers.

7. Session memory + retrieval cache¶

.env:

RA_MEMORY__CACHE_ENABLED=true
RA_MEMORY__DB_PATH=data/research.db

Interactive: Interactive sessions.

Repeat the same query in a new session run may skip network retrieval when config hash matches.

8. Quiet CI / scripting¶

Disable progress via env and CLI flag — see Progress streaming and CLI reference. Example redirect:

RA_PIPELINE__STREAM_PROGRESS=false
RA_PIPELINE__DEBUG=false
pipenv run python -m src --no-progress --format json "query" > report.json

Recipe picker¶

Goal	Recipe
Fastest offline run	#1 Fast local
Best local quality	#2 High-quality local
Best overall quality	#3 Cloud OpenAI
More paper sources	#4 arXiv + CrossRef
Diagnose failures	#5 Debug mode
Test APIs only	#6 Retrieval smoke test
Repeat queries faster	#7 Session cache