Skip to content

Configuration Cookbook

Copy-paste recipes for common setups. Each recipe shows environment variables and equivalent YAML where applicable.

Precedence reminder: env > .env > YAML > defaults. See Configuration precedence.

Pick your entry point first

Batch CLI (python -m src "query") only uses OpenAlex + Semantic Scholar regardless of provider recipes below. For recipes that enable arXiv/CrossRef or extra providers, use interactive mode, the API, or programmatic run_research(). See CLI vs API.


1. Fast local (default heuristic)

Minimal config — Ollama auto model, heuristic synthesis, OpenAlex + S2 only on CLI batch path.

.env:

RA_LLM__PROVIDER=ollama
RA_LLM__MODEL=auto
RA_SYNTHESIS__LLM_ENABLED=false
RA_QUERY_EXPANSION__LLM_ENABLED=false
RA_PIPELINE__STREAM_PROGRESS=true

Run: README Usage or CLI reference.

Best for: Quick scans, low RAM (llama3.2:3b), offline-after-setup use.


2. High-quality local (8B + LLM synthesis)

Enable LLM stages when llama3.1:8b is selected (catalog sets synthesis.llm_enabled: true for 8B).

.env:

RA_LLM__PROVIDER=ollama
RA_LLM__MODEL=llama3.1:8b
RA_SYNTHESIS__LLM_ENABLED=true
RA_QUERY_EXPANSION__LLM_ENABLED=true
RA_SYNTHESIS__MAX_LLM_PAPERS=5

Verify model: Setup system (health_check --model llama3.1:8b).

Best for: Richer synthesis on capable hardware (8–10 GB RAM).


3. Cloud OpenAI quality

Skip Ollama; use GPT for expansion and synthesis.

.env:

RA_LLM__PROVIDER=openai
RA_LLM__MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
RA_SYNTHESIS__LLM_ENABLED=true
RA_QUERY_EXPANSION__LLM_ENABLED=true
RA_RANKING__TOP_K=30

Run: CLI reference (same query invocations as README Usage).

Best for: Maximum quality without local GPU/RAM. API costs apply.


4. Enable arXiv + CrossRef (full pipeline)

Not available on CLI batch shortcut

Use interactive mode, API, or programmatic run_research() — not python -m src "query" alone.

YAML (config/providers.yaml):

concurrency_limit: 4
per_provider_limit: 10
providers:
  openalex:
    enabled: true
  semantic_scholar:
    enabled: true
  arxiv:
    enabled: true
  crossref:
    enabled: true

Equivalent .env:

RA_RETRIEVAL__PROVIDERS__ARXIV__ENABLED=true
RA_RETRIEVAL__PROVIDERS__CROSSREF__ENABLED=true
RA_RETRIEVAL__PER_PROVIDER_LIMIT=10
RA_CROSSREF_MAILTO=you@example.com
S2_API_KEY=optional_for_rate_limits

Run (interactive — full settings): Interactive sessions (pipenv run python -m src with no query arg).

Or use the API / programmatic path. See CLI vs API.


5. Debug mode (inspect pipeline artifacts)

.env:

RA_PIPELINE__DEBUG=true
RA_PIPELINE__STREAM_PROGRESS=true
# Avoid duplicate: do not also set RA_DEBUG=1 unless intentional

Run and inspect: Logging and debug walkthrough, then:

ls -lt logs/debug/
jq '.stage_results | keys' logs/debug/pipeline_*.json

6. Retrieval-only smoke test

Disable analysis/report stages to validate API connectivity quickly.

YAML snippet (merge into default.yaml or use env):

pipeline:
  enabled_stages:
    query_understanding: true
    query_expansion: true
    retrieval: true
    deduplication: true
    ranking: true
    relevance_scoring: false
    clustering: false
    synthesis: false
    gap_analysis: false
    citation_export: false
    report_generation: false

Env alternative (disable later stages):

RA_PIPELINE__ENABLED_STAGES__CLUSTERING=false
RA_PIPELINE__ENABLED_STAGES__SYNTHESIS=false
RA_PIPELINE__ENABLED_STAGES__GAP_ANALYSIS=false
RA_PIPELINE__ENABLED_STAGES__CITATION_EXPORT=false
RA_PIPELINE__ENABLED_STAGES__REPORT_GENERATION=false

Use debug dumps to inspect artifacts.retrieved_papers.


7. Session memory + retrieval cache

.env:

RA_MEMORY__CACHE_ENABLED=true
RA_MEMORY__DB_PATH=data/research.db

Interactive: Interactive sessions.

Repeat the same query in a new session run may skip network retrieval when config hash matches.


8. Quiet CI / scripting

Disable progress via env and CLI flag — see Progress streaming and CLI reference. Example redirect:

RA_PIPELINE__STREAM_PROGRESS=false
RA_PIPELINE__DEBUG=false
pipenv run python -m src --no-progress --format json "query" > report.json

Recipe picker

Goal Recipe
Fastest offline run #1 Fast local
Best local quality #2 High-quality local
Best overall quality #3 Cloud OpenAI
More paper sources #4 arXiv + CrossRef
Diagnose failures #5 Debug mode
Test APIs only #6 Retrieval smoke test
Repeat queries faster #7 Session cache

See also: CLI vs API, Environment variables, Heuristic vs LLM, Provider matrix.