---
name: doc-decomposer
description: Decompose documents (PDF, Markdown, text) into structured content for visual generators. Use this skill when preparing content for sketchnotes or infographics, or when you need to extract key concepts from documents.
---

# Document Decomposer Skill

This skill parses documents and extracts structured content optimized for visual generation (sketchnotes, infographics). It handles the text constraints required for accurate AI image generation.

## Quick Start

```python
from doc_decomposer import DocumentDecomposer

decomposer = DocumentDecomposer()

# From a file
result = decomposer.decompose_file("/path/to/document.pdf")

# From raw text
result = decomposer.decompose_text(raw_text)

# Save for generators
result.save("/path/to/output.json")

print(result.title)
print(result.sections)
print(result.key_concepts)
```

## Output Format

The decomposer outputs a standardized JSON structure:

```json
{
  "title": "Short Title (max 40 chars)",
  "subtitle": "Optional subtitle",
  "sections": [
    {
      "header": "Section Name (max 5 words)",
      "key_points": ["Point 1 (3-6 words)", "Point 2"],
      "icon_hint": "suggested icon type"
    }
  ],
  "key_concepts": [
    {
      "term": "Concept (1-3 words)",
      "description": "Brief description (max 10 words)"
    }
  ],
  "relationships": [
    {"from": "Concept A", "to": "Concept B", "type": "leads_to"}
  ],
  "quotes": ["Notable quote (short)"],
  "metadata": {
    "source_type": "pdf|markdown|text",
    "word_count": 1234,
    "complexity": "low|medium|high"
  }
}
```

## Text Length Constraints

The decomposer enforces text length limits optimized for AI image generation:

| Element | Max Length | Rationale |
|---------|------------|-----------|
| Title | 40 characters | Fits in header banners |
| Section headers | 5 words | Clear visual hierarchy |
| Key points | 6 words | Readable in boxes |
| Concepts | 3 words | Icon labels |
| Descriptions | 10 words | Supporting text |

## Parsing Modes

### Heuristic Mode (Default)
Fast regex-based parsing. Good for well-structured markdown files.

```bash
python doc_decomposer.py document.md
```

### LLM Mode (Recommended for Complex Documents)
Uses Google Gemini for intelligent semantic parsing. Much better for:
- Slide-based PDFs
- Unstructured text
- Complex layouts
- Documents without clear headers

```bash
python doc_decomposer.py document.pdf --llm
```

Requires `GOOGLE_API_KEY` environment variable.

## Supported Formats

- **PDF**: Extracts text, detects headers by font size/weight
- **Markdown**: Parses H1-H3, lists, bold/emphasis
- **Plain Text**: Uses heuristics for structure detection
- **DOCX**: Extracts paragraphs and headings (requires python-docx)

## API Methods

### `DocumentDecomposer()`
Initialize the decomposer.

### `decompose_file(path) -> DecomposedDocument`
Parse a file and extract structure.

### `decompose_text(text, source_type="text") -> DecomposedDocument`
Parse raw text content.

### `DecomposedDocument.save(path)`
Save to JSON file for use with generators.

### `DecomposedDocument.to_dict() -> dict`
Get the structured content as a dictionary.

### `DecomposedDocument.for_sketchnote() -> dict`
Get content formatted specifically for sketchnote generator.

### `DecomposedDocument.for_infographic() -> dict`
Get content formatted specifically for infographic generator.

## Integration with Generators

### With Sketchnote Generator
```python
from doc_decomposer import DocumentDecomposer
from sketchnote_generator import SketchnoteGenerator

# Decompose
decomposer = DocumentDecomposer()
doc = decomposer.decompose_file("meeting_notes.pdf")

# Generate sketchnote
generator = SketchnoteGenerator(api_key=key)
result = generator.generate_from_decomposed(doc.for_sketchnote())
```

### With Infographic Generator
```python
from doc_decomposer import DocumentDecomposer
from infographic_generator import InfographicGenerator

# Decompose
decomposer = DocumentDecomposer()
doc = decomposer.decompose_file("article.md")

# Generate infographic
generator = InfographicGenerator(api_key=key)
result = generator.generate_from_decomposed(doc.for_infographic())
```

## CLI Usage

```bash
# Decompose a document
python doc_decomposer.py /path/to/document.pdf

# Specify output file
python doc_decomposer.py /path/to/document.pdf -o output.json

# Output for specific generator
python doc_decomposer.py /path/to/document.pdf --for sketchnote
python doc_decomposer.py /path/to/document.pdf --for infographic
```

## Configuration

### Environment Variables
```bash
DOC_DECOMPOSER_MAX_SECTIONS=6      # Max sections to extract (default: 6)
DOC_DECOMPOSER_MAX_CONCEPTS=8      # Max key concepts (default: 8)
DOC_DECOMPOSER_MAX_QUOTES=3        # Max quotes to include (default: 3)
```

## Troubleshooting

### PDF extraction issues
- Ensure PyPDF2 or pdfplumber is installed
- Scanned PDFs require OCR (not supported directly)

### Structure not detected
- Use markdown format for best results
- Add clear headers (##) to improve parsing

### Output too verbose
- Adjust max limits via environment variables
- The decomposer prioritizes the most important content

## File Locations

- Skill: `~/.claude/skills/doc-decomposer/SKILL.md`
- Implementation: `~/.claude/skills/doc-decomposer/doc_decomposer.py`
