arXiv¶

arXiv covers preprints and e-prints, especially strong for CS, ML, and physics. Disabled by default — enable for the full pipeline or API (not the CLI batch shortcut).

Implementation: src/retrieval/providers/arxiv.py

HTTP API¶

Attribute	Value
Search URL	`GET https://export.arxiv.org/api/query`
Query params	`search_query`, `start=0`, `max_results`
Response format	Atom XML
Authentication	None
Timeout	60s (search), 15s (health)
Retries	3 with exponential backoff

Query syntax¶

The provider preserves arXiv field syntax when present:

User query	`search_query` sent
`ti:attention abs:transformer`	unchanged
`transformer attention`	`all:transformer attention`

Field prefixes: ti: (title), abs: (abstract), au: (author), etc.

Example request¶

GET https://export.arxiv.org/api/query?search_query=all:transformer+attention&start=0&max_results=8

Normalization¶

Atom entries parse to RetrievedPaper:

Atom element	`RetrievedPaper` field
`title`	`title`
`summary`	`abstract`
`published` year	`year`
`arxiv:primary_category`	`venue` (category label)
PDF link or `id`	`url`
`arxiv:doi` if present	`doi`
Author list	`authors`

Configuration¶

retrieval:
  providers:
    arxiv:
      enabled: true

RA_RETRIEVAL__PROVIDERS__ARXIV__ENABLED=true

CLI limitation

python -m src "query" does not enable arXiv. Use interactive mode, the API, or run_research() with custom settings.

Health check¶

Search with max_results=1, 15s timeout.

Operational notes¶

arXiv requests a 3-second delay between calls in their terms of use — the pipeline's concurrency limit helps avoid hammering the API
Preprint-heavy results may rank differently than peer-reviewed venues from OpenAlex/CrossRef
Good complement when researching very recent ML work not yet indexed elsewhere