Agent skill

article-extractor

Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter.

Stars 2
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/drshailesh88/integrated_content_OS/tree/main/skills/cardiology/article-extractor

SKILL.md

Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

When to Use This Skill

Activate when the user:

  • Provides an article/blog URL and wants the text content
  • Asks to "download this article"
  • Wants to "extract the content from [URL]"
  • Asks to "save this blog post as text"
  • Needs clean article text without distractions

How It Works

Priority Order:

  1. Check if tools are installed (reader or trafilatura)
  2. Download and extract article using best available tool
  3. Clean up the content (remove extra whitespace, format properly)
  4. Save to file with article title as filename
  5. Confirm location and show preview

Installation Check

Check for article extraction tools in this order:

Option 1: reader (Recommended - Mozilla's Readability)

bash
command -v reader

If not installed:

bash
npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli

Option 2: trafilatura (Python-based, very good)

bash
command -v trafilatura

If not installed:

bash
pip3 install trafilatura

Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

Extraction Methods

Method 1: Using reader (Best for most articles)

bash
# Extract article
reader "URL" > article.txt

Pros:

  • Based on Mozilla's Readability algorithm
  • Excellent at removing clutter
  • Preserves article structure

Method 2: Using trafilatura (Best for blogs/news)

bash
# Extract article
trafilatura --URL "URL" --output-format txt > article.txt

# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt

Pros:

  • Very accurate extraction
  • Good with various site structures
  • Handles multiple languages

Options:

  • --no-comments: Skip comment sections
  • --no-tables: Skip data tables
  • --precision: Favor precision over recall
  • --recall: Extract more content (may include some noise)

Method 3: Fallback (curl + basic parsing)

bash
# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
        self.current_tag = None

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
                self.in_content = True
        self.current_tag = tag

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt

Note: This is less reliable but works without dependencies.

Getting Article Title

Extract title for filename:

Using reader:

bash
# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')

Using trafilatura:

bash
# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")

Using curl (fallback):

bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')

Filename Creation

Clean title for filesystem:

bash
# Get title
TITLE="Article Title from Website"

# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

# Add extension
FILENAME="${FILENAME}.txt"

Complete Workflow

bash
ARTICLE_URL="https://example.com/article"

# Check for tools
if command -v reader &> /dev/null; then
    TOOL="reader"
    echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
    TOOL="trafilatura"
    echo "Using trafilatura"
else
    TOOL="fallback"
    echo "Using fallback method (may be less accurate)"
fi

# Extract article
case $TOOL in
    reader)
        # Get content
        reader "$ARTICLE_URL" > temp_article.txt

        # Get title (first line after # in markdown)
        TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
        ;;

    trafilatura)
        # Get title from metadata
        METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
        TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")

        # Get clean content
        trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
        ;;

    fallback)
        # Get title
        TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
        TITLE=${TITLE%% - *}  # Remove site name
        TITLE=${TITLE%% | *}  # Remove site name (alternate)

        # Get content (basic extraction)
        curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main'}:
                self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\n')

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
        ;;
esac

# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"

# Move to final filename
mv temp_article.txt "$FILENAME"

# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"

Error Handling

Common Issues

1. Tool not installed

  • Try alternate tool (reader → trafilatura → fallback)
  • Offer to install: "Install reader with: npm install -g reader-cli"

2. Paywall or login required

  • Extraction tools may fail
  • Inform user: "This article requires authentication. Cannot extract."

3. Invalid URL

  • Check URL format
  • Try with and without redirects

4. No content extracted

  • Site may use heavy JavaScript
  • Try fallback method
  • Inform user if extraction fails

5. Special characters in title

  • Clean title for filesystem
  • Remove: /, :, ?, ", <, >, |
  • Replace with - or remove

Output Format

Saved File Contains:

  • Article title (if available)
  • Author (if available from tool)
  • Main article text
  • Section headings
  • No navigation, ads, or clutter

What Gets Removed:

  • Navigation menus
  • Ads and promotional content
  • Newsletter signup forms
  • Related articles sidebars
  • Comment sections (optional)
  • Social media buttons
  • Cookie notices

Tips for Best Results

1. Use reader for most articles

  • Best all-around tool
  • Based on Firefox Reader View
  • Works on most news sites and blogs

2. Use trafilatura for:

  • Academic articles
  • News sites
  • Blogs with complex layouts
  • Non-English content

3. Fallback method limitations:

  • May include some noise
  • Less accurate paragraph detection
  • Better than nothing for simple sites

4. Check extraction quality:

  • Always show preview to user
  • Ask if it looks correct
  • Offer to try different tool if needed

Example Usage

Simple extraction:

bash
# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"

With error handling:

bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi

Best Practices

  • ✅ Always show preview after extraction (first 10 lines)
  • ✅ Verify extraction succeeded before saving
  • ✅ Clean filename for filesystem compatibility
  • ✅ Try fallback method if primary fails
  • ✅ Inform user which tool was used
  • ✅ Keep filename length reasonable (< 100 chars)

After Extraction

Display to user:

  1. "✓ Extracted: [Article Title]"
  2. "✓ Saved to: [filename]"
  3. Show preview (first 10-15 lines)
  4. File size and location

Ask if needed:

  • "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
  • "Should I extract another article?"

Expand your agent's capabilities with these related and highly-rated skills.

drshailesh88/integrated_content_OS

pufferlib

This skill should be used when working with reinforcement learning tasks including high-performance RL training, custom environment development, vectorized parallel simulation, multi-agent systems, or integration with existing RL environments (Gymnasium, PettingZoo, Atari, Procgen, etc.). Use this skill for implementing PPO training, creating PufferEnv environments, optimizing RL performance, or developing policies with CNNs/LSTMs.

2 0
Explore
drshailesh88/integrated_content_OS

fluidsim

Framework for computational fluid dynamics simulations using Python. Use when running fluid dynamics simulations including Navier-Stokes equations (2D/3D), shallow water equations, stratified flows, or when analyzing turbulence, vortex dynamics, or geophysical flows. Provides pseudospectral methods with FFT, HPC support, and comprehensive output analysis.

2 0
Explore
drshailesh88/integrated_content_OS

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

2 0
Explore
drshailesh88/integrated_content_OS

geniml

This skill should be used when working with genomic interval data (BED files) for machine learning tasks. Use for training region embeddings (Region2Vec, BEDspace), single-cell ATAC-seq analysis (scEmbed), building consensus peaks (universes), or any ML-based analysis of genomic regions. Applies to BED file collections, scATAC-seq data, chromatin accessibility datasets, and region-based genomic feature learning.

2 0
Explore
drshailesh88/integrated_content_OS

zinc-database

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

2 0
Explore
drshailesh88/integrated_content_OS

astropy

Comprehensive Python library for astronomy and astrophysics. This skill should be used when working with astronomical data including celestial coordinates, physical units, FITS files, cosmological calculations, time systems, tables, world coordinate systems (WCS), and astronomical data analysis. Use when tasks involve coordinate transformations, unit conversions, FITS file manipulation, cosmological distance calculations, time scale conversions, or astronomical data processing.

2 0
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results