Agent skill
insight-pilot
Literature research automation - search papers, code, and blogs, deduplicate, download PDFs, analyze and generate research reports. Supports incremental updates.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/insight-pilot
SKILL.md
Insight-Pilot Skill
A workflow automation skill for literature research. Searches papers, GitHub repos/code/issues, PubMed, Dev.to, and blogs, deduplicates results, downloads PDFs, analyzes content, and generates incremental research reports.
Setup
Run the bootstrap script (automatically checks environment, creates and installs if missing):
bash .codex/skills/insight-pilot/scripts/bootstrap_env.sh
The script automatically detects if ~/.insight-pilot-venv exists and if packages are installed, only installing when necessary. See --help for advanced options.
Usage
Before running commands, activate the environment:
source ~/.insight-pilot-venv/bin/activate
Then use the CLI:
insight-pilot <command> [options]
CLI Commands
| Command | Purpose | Required Args | Key Optional Args |
|---|---|---|---|
init |
Create research project | --topic, --output |
--keywords |
search |
Search, merge and dedup | --project, --source, --query |
--limit, --since, --until |
download |
Download PDFs + convert to Markdown | --project |
- |
analyze |
Analyze papers with LLM | --project |
--config, --force |
index |
Generate index.md | --project |
--template |
status |
Check project state | --project |
- |
sources |
Manage blog/RSS sources | --project |
--add, --remove, --config |
JSON Output Mode
Add --json flag for structured output (recommended for agents):
insight-pilot status --json --project ./research/myproject
Blog/RSS Sources Configuration
Create sources.yaml in your project root:
blogs:
- name: "Cursor Blog"
type: "ghost"
url: "https://cursor.sh/blog"
api_key: "auto"
- name: "Example WP Blog"
type: "wordpress"
url: "https://blog.example.com"
- name: "OpenAI Blog"
type: "rss"
url: "https://openai.com/blog/rss.xml"
category: "ai"
Manage sources via:
insight-pilot sources --project ./research/webagent
Environment variables:
GITHUB_TOKEN(GitHub API higher rate limit)PUBMED_EMAIL(required by NCBI)OPENALEX_MAILTO(OpenAlex polite usage)INSIGHT_PILOT_SOURCES(override sources.yaml path)
New Sources Examples
# GitHub repositories + code + issues
insight-pilot search --project $PROJECT --source github --query "agent framework" --limit 30
# PubMed (requires PUBMED_EMAIL)
insight-pilot search --project $PROJECT --source pubmed --query "clinical agents" --limit 20
# Dev.to articles
insight-pilot search --project $PROJECT --source devto --query "ai agents" --limit 20
# Blogs (Ghost/WordPress/RSS from sources.yaml)
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20
Workflow (Agent + CLI Collaboration)
This is the complete workflow for Agent + CLI collaboration.
Execution Principles:
- Run CLI commands in sequence as prescribed, no line-by-line confirmation needed.
- Agent intervention is ONLY required in Phase 2 for manual review (checking
items.jsonand settingstatus/exclude_reason).
Phase 1: Search and Initial Filtering
Execute the following commands directly, no confirmation needed:
PROJECT=./research/webagent
# Step 1: Initialize project
insight-pilot init --topic "WebAgent Research" --keywords "web agent,browser agent" --output $PROJECT
# Step 2: Search multiple sources (auto merge & dedup)
insight-pilot search --project $PROJECT --source arxiv openalex github pubmed devto blog --query "web agent" --limit 50
Phase 2: Agent Review (Manual Check)
After deduplication, the Agent needs to review the paper list and remove content unrelated to the research topic.
# Check current status
insight-pilot status --json --project $PROJECT
Agent Actions:
- Read
$PROJECT/.insight/items.json - Check
titleandabstractfor each paper - Mark unrelated papers: set
statusto"excluded"and addexclude_reason - Save the updated
items.json
{
"id": "i0023",
"title": "Unrelated Paper Title",
"status": "excluded",
"exclude_reason": "Not related to web agents, focuses on chemical agents"
}
Phase 3: Download PDFs
Execute directly, no confirmation needed:
# Step 3: Download PDFs (converts to Markdown automatically)
insight-pilot download --project $PROJECT
Download Results:
- Success:
download_status: "success", PDF saved topapers/ - Failed:
download_status: "failed", recorded in$PROJECT/.insight/download_failed.json
Failure list format:
[
{
"id": "i0015",
"title": "Paper Title",
"url": "https://...",
"error": "Connection timeout",
"failed_at": "2026-01-17T10:30:00Z"
}
]
Note: Advanced download (proxy/browser automation for failed items) is not yet implemented.
Phase 4: Analyze Papers
Precondition: Must complete Phase 3 Download PDFs first (download command automatically converts PDFs to Markdown).
MUST try LLM analysis first. If LLM is configured, run directly:
# Step 4: LLM Analysis (prefers converted Markdown, falls back to PDF text extraction)
insight-pilot analyze --project $PROJECT
Content Source Priority:
- Markdown (from
downloadauto-conversion via pymupdf4llm) - PDF Extraction (PyMuPDF)
LLM Configuration: Create .codex/skills/insight-pilot/llm.yaml:
provider: openai # openai / anthropic / ollama
model: gpt-4o-mini
api_key: sk-xxx # or set env var OPENAI_API_KEY
When LLM is not configured: Manual Analysis Required
If no LLM is configured, the Agent needs to analyze manually:
- Read PDF files in
papers/directory - Extract key information for each paper
- Write analysis results to
$PROJECT/.insight/analysis/{id}.json
Analysis File Format ($PROJECT/.insight/analysis/{id}.json):
{
"id": "i0001",
"title": "Paper Title",
"summary": "One sentence summary",
"brief_analysis": "2-3 sentences brief analysis",
"detailed_analysis": "300-500 words detailed analysis",
"contributions": ["Contribution 1", "Contribution 2"],
"methodology": "Methodology description",
"key_findings": ["Finding 1", "Finding 2"],
"limitations": ["Limitations"],
"future_work": ["Future work 1"],
"relevance_score": 8,
"tags": ["webagent", "benchmark", "multimodal"],
"analyzed_at": "2026-01-17T12:00:00Z"
}
Phase 5: Generate Incremental Report
# Step 8: Generate/Update Index
insight-pilot index --project $PROJECT
Reports are stored in $PROJECT/index.md, showing only analyzed papers and linking to reports/{id}.md detailed reports.
Report Structure:
# WebAgent Research
> **Generated**: 2026-01-18 10:30
> **Keywords**: web agent, browser agent
> **Analyzed**: 5 papers
---
## ๐ Analyzed Papers
### [Paper Title](reports/i0001.md)
**Authors**: Author A, Author B et al. | **Date**: 2026-01-15 | **Links**: arXiv/DOI | **Relevance**: 8/10
**Summary**: One sentence summary...
> 2-3 sentences brief analysis...
**Tags**: `webagent` `benchmark` `multimodal`
---
## โ ๏ธ Papers Not Available
_The following papers could not be downloaded. Only abstracts are shown._
### Paper Title
**Authors**: ... | **Date**: ... | **Links**: ...
> Abstract...
---
## ๐ Statistics
| Metric | Value |
|--------|-------|
| Papers Analyzed | 5 |
| Download Failed | 1 |
| Total Processed | 6 |
Incremental Update Workflow
For daily/weekly updates:
# 1. Search new papers (use --since for date limit, auto merge & dedup)
insight-pilot search --project $PROJECT --source arxiv openalex --query "web agent" --since 2026-01-17 --limit 20
# 2. [Agent] Review newly added papers
# 3. Download PDFs for new papers
insight-pilot download --project $PROJECT
# 4. [Agent] Analyze new papers, update reports
# 5. Regenerate index
insight-pilot index --project $PROJECT
Project Structure
research/myproject/
โโโ .insight/
โ โโโ config.yaml # ้กน็ฎ้
็ฝฎ
โ โโโ state.json # ๅทฅไฝๆต็ถๆ
โ โโโ items.json # ่ฎบๆๅ
ๆฐๆฎ๏ผๅซ status, exclude_reason๏ผ
โ โโโ raw_arxiv.json # ๅๅงๆ็ดข็ปๆ
โ โโโ raw_openalex.json
โ โโโ download_failed.json # ไธ่ฝฝๅคฑ่ดฅๅ่กจ๏ผไพ้ซ็บงไธ่ฝฝ้่ฏ๏ผ
โ โโโ analysis/ # ่ฎบๆๅๆ็ปๆ
โ โ โโโ i0001.json
โ โ โโโ i0002.json
โ โ โโโ ...
โ โโโ markdown/ # PDF ่ฝฌๆข็ปๆ๏ผpymupdf4llm๏ผ
โ โโโ i0001/
โ โ โโโ i0001.md # ่ฝฌๆขๅ็ Markdown
โ โ โโโ metadata.json
โ โโโ ...
โโโ papers/ # ๅทฒไธ่ฝฝ็ PDF
โโโ reports/ # ๅๅฒๆฅๅๅญๆกฃ
โโโ index.md # ๅฝๅ็ ็ฉถๆฅๅ๏ผๅข้ๆดๆฐ๏ผ
Data Schemas
Item (Paper)
{
"id": "i0001",
"type": "paper",
"title": "Paper Title",
"authors": ["Author One", "Author Two"],
"date": "2026-01-15",
"abstract": "...",
"status": "active|excluded|pending",
"exclude_reason": null,
"identifiers": {
"doi": "10.1234/example",
"arxiv_id": "2601.12345",
"openalex_id": "W1234567890"
},
"urls": {
"abstract": "https://arxiv.org/abs/2601.12345",
"pdf": "https://arxiv.org/pdf/2601.12345"
},
"download_status": "success|pending|failed|unavailable",
"local_path": "./papers/i0001.pdf",
"citation_count": 42,
"source": ["arxiv", "openalex"],
"collected_at": "2026-01-17T10:00:00Z"
}
Error Codes
| Code | Meaning | Retryable |
|---|---|---|
PROJECT_NOT_FOUND |
Project directory doesn't exist | No |
NO_INPUT_FILES |
Required input files missing | No |
NO_ITEMS_FILE |
items.json not found | No |
INVALID_SOURCE |
Unknown data source | No |
NETWORK_ERROR |
API request failed | Yes |
RATE_LIMITED |
API rate limit hit | Yes |
DOWNLOAD_FAILED |
PDF download failed | Yes |
CONVERSION_FAILED |
PDF to Markdown conversion failed | Yes |
MISSING_DEPENDENCY |
Required package not installed | No |
Agent Guidelines
Execution Principles:
- First run: Run bootstrap script to auto-setup environment
- CLI Commands (init, search, download, analyze, index): Run in sequence, no confirmation needed
- Agent intervention ONLY needed during Phase 2 (Review) and Manual Analysis (if no LLM)
Specific Guidelines:
- Environment Setup: Run
bash .codex/skills/insight-pilot/scripts/bootstrap_env.shfirst - Use
--jsonflag: Get structured output for parsing - Execute CLI directly: Do not ask for confirmation, follow workflow sequence
- Review: Modify
statusandexclude_reasoninitems.json - LLM Analysis First: Use
analyzecommand if configured, otherwise manually createanalysis/{id}.json - Incremental Updates: Only process new papers, keep existing analysis results
Didn't find tool you were looking for?