Last updated: Aug 1, 2025, 02:00 PM UTC

Command-Line Image Retrieval Tool for Claude Code - Technical Analysis

Generated: 2025-07-26 18:45 UTC
Status: Complete
Verified:

Overview

This document provides a comprehensive analysis and design for a command-line tool that enables Claude Code to reliably retrieve images from various sources. The tool addresses technical challenges, authentication requirements, anti-scraping measures, and provides a robust solution for image acquisition.

Executive Summary

The proposed tool, tentatively named imgfetch, is a Python-based CLI application that provides multiple strategies for image retrieval, including direct downloads, browser automation, and API integrations. It features intelligent fallback mechanisms, respects rate limits, handles authentication, and provides clear feedback to Claude Code.

Technical Architecture

Core Components

graph TD A[Claude Code] --> B[imgfetch CLI] B --> C{Strategy Selector} C --> D[Direct Download] C --> E[Browser Automation] C --> F[API Integration] C --> G[Wayback Machine] D --> H[Image Processing] E --> H F --> H G --> H H --> I[Output Handler] I --> J[Local File] I --> K[Metadata JSON] B --> L[Cache Manager] B --> M[Auth Manager] B --> N[Config Manager]

Technology Stack

Component Technology Justification
Core Language Python 3.9+ Rich ecosystem, excellent libraries
CLI Framework Click Intuitive API, good testing support
HTTP Client httpx + requests Async support, retry mechanisms
Browser Automation Playwright Headless, handles JS, better than Selenium
HTML Parsing BeautifulSoup4 Robust, handles malformed HTML
Image Processing Pillow Format conversion, validation
Caching diskcache Simple, efficient local caching
Configuration TOML/YAML Human-readable, structured

Architecture Decisions

  1. Multi-Strategy Approach: Different methods for different scenarios
  2. Async-First Design: Better performance for concurrent operations
  3. Plugin Architecture: Extensible for new strategies
  4. Structured Logging: Clear feedback for Claude Code
  5. Fail-Safe Mechanisms: Graceful degradation

Authentication and Anti-Scraping Strategies

Authentication Handling

# Example authentication configuration structure
auth_config = {
    "sites": {
        "example.com": {
            "method": "cookie",
            "cookies": {"session": "xxx"},
            "headers": {"User-Agent": "Mozilla/5.0..."}
        },
        "api.service.com": {
            "method": "bearer",
            "token": "Bearer xxx",
            "rate_limit": "100/hour"
        }
    }
}

Anti-Scraping Countermeasures

Challenge Solution Implementation
User-Agent Detection Rotate realistic user agents User agent pool with browser profiles
Rate Limiting Intelligent throttling Token bucket algorithm
JavaScript Rendering Browser automation Playwright with stealth mode
CAPTCHAs Detection and notification Return clear error for manual intervention
IP Blocking Proxy support SOCKS5/HTTP proxy configuration
Fingerprinting Browser profile randomization Canvas, WebGL spoofing

Stealth Techniques

# Playwright stealth configuration example
async def create_stealth_browser():
    playwright = await async_playwright().start()
    browser = await playwright.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-dev-shm-usage',
            '--no-sandbox',
            '--disable-web-security',
        ]
    )
    context = await browser.new_context(
        viewport={'width': 1920, 'height': 1080},
        user_agent=get_random_user_agent(),
        locale='en-US',
        timezone_id='America/New_York',
    )
    await stealth_async(context)
    return browser, context

CLI Interface Design

Command Structure

# Basic usage
imgfetch <url> [options]

# Advanced usage with authentication
imgfetch <url> --auth-config auth.toml --output image.jpg

# Batch processing
imgfetch --batch urls.txt --output-dir ./images/

# With specific strategy
imgfetch <url> --strategy browser --wait 5

# Using cache
imgfetch <url> --cache-dir ~/.imgfetch/cache --cache-ttl 3600

Options and Arguments

Option Description Default
--output, -o Output file path Generated from URL
--strategy, -s Force specific strategy auto
--timeout Request timeout (seconds) 30
--retries Number of retry attempts 3
--auth-config Authentication config file ~/.imgfetch/auth.toml
--proxy Proxy URL None
--user-agent Custom user agent Random
--wait Wait time for JS rendering 2
--format Convert image format Keep original
--max-size Maximum file size (MB) 100
--quiet, -q Suppress progress output False
--json Output metadata as JSON False
--cache-dir Cache directory ~/.imgfetch/cache
--no-cache Disable caching False
--headers Additional HTTP headers None

Output Formats

// JSON output structure (--json flag)
{
  "success": true,
  "url": "https://example.com/image.jpg",
  "file_path": "/path/to/saved/image.jpg",
  "file_size": 1048576,
  "content_type": "image/jpeg",
  "dimensions": {"width": 1920, "height": 1080},
  "strategy_used": "direct",
  "download_time": 1.234,
  "cache_hit": false,
  "metadata": {
    "title": "Image Title",
    "alt_text": "Description",
    "source_page": "https://example.com/page"
  }
}

Implementation Strategies

1. Direct Download Strategy

class DirectDownloadStrategy:
    async def download(self, url: str, headers: dict) -> bytes:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                url, 
                headers=headers,
                follow_redirects=True,
                timeout=self.timeout
            )
            response.raise_for_status()
            return response.content

When to use:

  • Direct image URLs
  • No authentication required
  • No JavaScript rendering needed

2. Browser Automation Strategy

class BrowserAutomationStrategy:
    async def download(self, url: str) -> bytes:
        browser, context = await create_stealth_browser()
        page = await context.new_page()
        
        # Navigate and wait for content
        await page.goto(url, wait_until='networkidle')
        await page.wait_for_timeout(self.wait_time)
        
        # Find and download image
        img_element = await page.query_selector('img[src*=".jpg"]')
        img_url = await img_element.get_attribute('src')
        
        # Download through browser context
        response = await page.request.get(img_url)
        return response.body()

When to use:

  • JavaScript-rendered content
  • Complex authentication flows
  • Dynamic image loading

3. API Integration Strategy

class APIStrategy:
    def __init__(self):
        self.apis = {
            'instagram': InstagramAPI(),
            'twitter': TwitterAPI(),
            'flickr': FlickrAPI(),
        }
    
    async def download(self, url: str) -> bytes:
        api = self.detect_api(url)
        if api:
            return await api.get_image(url)
        raise UnsupportedAPIError(url)

When to use:

  • Supported platforms with APIs
  • Better reliability than scraping
  • Respects platform terms of service

4. Wayback Machine Fallback

class WaybackMachineStrategy:
    async def download(self, url: str) -> bytes:
        wayback_url = f"https://web.archive.org/web/*/{url}"
        # Fetch latest snapshot
        snapshot = await self.get_latest_snapshot(wayback_url)
        if snapshot:
            return await self.download_from_snapshot(snapshot)
        raise NoSnapshotError(url)

When to use:

  • Original source unavailable
  • Historical versions needed
  • Last resort fallback

Reliability and Error Handling

Retry Mechanism

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry=retry_if_exception_type((httpx.TimeoutException, httpx.ConnectError))
)
async def download_with_retry(url: str) -> bytes:
    return await download(url)

Error Classification

Error Type Handling Strategy User Feedback
Network Timeout Retry with backoff "Retrying download (attempt 2/3)..."
404 Not Found Try Wayback Machine "Image not found, checking archives..."
403 Forbidden Try browser strategy "Access denied, attempting browser method..."
CAPTCHA Notify user "CAPTCHA detected, manual intervention required"
Rate Limited Wait and retry "Rate limited, waiting 60 seconds..."
Invalid Format Convert if possible "Converting image format..."

Fallback Chain

graph LR A[Direct Download] -->|Fails| B[API Strategy] B -->|Fails| C[Browser Automation] C -->|Fails| D[Wayback Machine] D -->|Fails| E[Error Report] A -->|Success| F[Save Image] B -->|Success| F C -->|Success| F D -->|Success| F

Legal and Ethical Considerations

Built-in Safeguards

  1. Robots.txt Compliance

    async def check_robots_txt(url: str) -> bool:
        robots_url = get_robots_url(url)
        rp = RobotFileParser()
        rp.set_url(robots_url)
        await rp.read()
        return rp.can_fetch("imgfetch", url)
    
  2. Rate Limiting

    • Default: 1 request per second per domain
    • Configurable per-site limits
    • Automatic backoff on 429 responses
  3. Terms of Service Warnings

    TOS_WARNINGS = {
        "instagram.com": "Instagram prohibits automated access. Use at your own risk.",
        "facebook.com": "Facebook requires API access for automation.",
    }
    
  4. Copyright Notice

    • Display copyright warnings in output
    • Include source URL in metadata
    • Option to check Creative Commons licenses

Ethical Usage Guidelines

## Responsible Use Guidelines

1. **Always respect copyright** - Downloaded images may be protected
2. **Follow rate limits** - Don't overwhelm servers
3. **Check robots.txt** - Respect site preferences
4. **Personal use only** - Unless you have explicit permission
5. **Attribution** - Credit original sources when sharing

Performance Optimizations

Caching Strategy

class CacheManager:
    def __init__(self, cache_dir: Path, ttl: int = 3600):
        self.cache = Cache(str(cache_dir))
        self.ttl = ttl
    
    def get_cache_key(self, url: str) -> str:
        return hashlib.sha256(url.encode()).hexdigest()
    
    @cache.memoize(expire=ttl)
    async def get_or_download(self, url: str) -> bytes:
        return await self.download_func(url)

Concurrent Downloads

async def batch_download(urls: List[str], max_concurrent: int = 5):
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def download_with_semaphore(url):
        async with semaphore:
            return await download_image(url)
    
    tasks = [download_with_semaphore(url) for url in urls]
    return await asyncio.gather(*tasks, return_exceptions=True)

Integration with Claude Code

Usage Examples

# Simple download
imgfetch "https://example.com/image.jpg" -o "local_image.jpg"

# Complex scenario with authentication
imgfetch "https://protected-site.com/image.png" \
  --auth-config ~/.imgfetch/auth.toml \
  --strategy browser \
  --wait 5 \
  --output "protected_image.png"

# Batch processing with JSON output
imgfetch --batch image_urls.txt \
  --output-dir ./downloaded_images/ \
  --json > download_results.json

# Using with Claude Code's Bash tool
result=$(imgfetch "https://example.com/img.jpg" --json)
image_path=$(echo $result | jq -r '.file_path')

Error Handling in Claude Code

# Claude Code can check exit codes
if imgfetch "$url" -o "$output"; then
    echo "Download successful"
else
    echo "Download failed, trying alternative method"
    imgfetch "$url" --strategy browser -o "$output"
fi

Configuration File Format

~/.imgfetch/config.toml

[general]
default_strategy = "auto"
timeout = 30
retries = 3
cache_ttl = 3600
max_file_size = 104857600  # 100MB

[rate_limits]
default = 1.0  # requests per second
"api.example.com" = 10.0
"slow-site.com" = 0.5

[strategies.browser]
headless = true
wait_time = 2
viewport_width = 1920
viewport_height = 1080

[strategies.direct]
chunk_size = 8192
verify_ssl = true

[logging]
level = "INFO"
file = "~/.imgfetch/logs/imgfetch.log"

Security Considerations

Input Validation

def validate_url(url: str) -> bool:
    # Prevent file:// URLs
    if url.startswith('file://'):
        raise SecurityError("Local file access not allowed")
    
    # Validate URL format
    parsed = urlparse(url)
    if not all([parsed.scheme, parsed.netloc]):
        raise ValueError("Invalid URL format")
    
    # Check for SSRF attacks
    if is_internal_ip(parsed.netloc):
        raise SecurityError("Internal IP addresses not allowed")
    
    return True

Secure Storage

  • Credentials stored in encrypted format
  • No hardcoded secrets
  • Environment variable support
  • Secure deletion of temporary files

Testing Strategy

Unit Tests

@pytest.mark.asyncio
async def test_direct_download_strategy():
    strategy = DirectDownloadStrategy()
    with aioresponses() as mocked:
        mocked.get('https://example.com/test.jpg', 
                   body=b'fake_image_data')
        result = await strategy.download('https://example.com/test.jpg')
        assert result == b'fake_image_data'

Integration Tests

  • Mock server for various scenarios
  • Real browser testing for automation
  • Rate limit testing
  • Error scenario coverage

Deployment and Distribution

Installation Methods

# Via pip
pip install imgfetch

# Via pipx (recommended for CLI tools)
pipx install imgfetch

# From source
git clone https://github.com/user/imgfetch
cd imgfetch
pip install -e .

Docker Support

FROM python:3.9-slim
RUN apt-get update && apt-get install -y \
    chromium \
    chromium-driver
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
ENTRYPOINT ["imgfetch"]

Future Enhancements

Planned Features

  1. OCR Support - Extract text from images
  2. Image Search - Find similar images
  3. Batch Processing UI - Web interface for bulk operations
  4. Cloud Storage Integration - Direct upload to S3/GCS
  5. Plugin System - Custom strategies via plugins
  6. AI Enhancement - Image upscaling/enhancement
  7. Metadata Extraction - EXIF data parsing
  8. Format Detection - Better format handling

API Extension

# Potential Python API for programmatic use
from imgfetch import ImageFetcher

fetcher = ImageFetcher()
image_data = await fetcher.fetch(
    url="https://example.com/image.jpg",
    strategy="auto",
    auth_config="path/to/auth.toml"
)

Conclusion

The proposed imgfetch tool provides a comprehensive solution for Claude Code to reliably retrieve images from various sources. Its multi-strategy approach, robust error handling, and ethical safeguards make it suitable for production use while respecting legal and technical constraints.

Key strengths:

  • Multiple fallback strategies ensure high success rates
  • Handles modern web challenges (JS rendering, authentication)
  • Respects rate limits and robots.txt
  • Clear CLI interface for Claude Code integration
  • Extensible architecture for future enhancements

The tool balances technical capability with responsible use, providing Claude Code with a powerful yet ethical image retrieval solution.