Data Extractor

Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semistructured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and freetext documents.

Overview

The Data Extractor is a specialized utility designed for AI agents to process unstructured and semi-structured information from various file formats. Available within the TerminalSkills/skills repository, this tool enables agents like Claude, Gemini, and Codex to parse content from PDFs, DOCX files, HTML, and images. It focuses on transforming raw text, invoices, and reports into organized formats such as JSON or CSV for further analysis. By leveraging this skill, users can automate the conversion of free-text documents into machine-readable data structures. The TerminalSkills collection, which hosts this tool, currently maintains a popularity rating of 72 stars on GitHub, reflecting its utility for developers building data-driven agentic workflows and automated document processing pipelines.

Use Cases

Converting scanned invoice images into structured JSON for accounting software integration.
Parsing complex PDF reports to extract specific data points for Pandas-based analysis.
Transforming unstructured HTML or text files into CSV format for database ingestion.

Install Notes

# Review source first
open https://github.com/TerminalSkills/skills/blob/main/skills/data-extractor/SKILL.md

Copy or clone the skill folder into your agent skills directory after reviewing its instructions and scripts.

Security Notes

This skill processes document content to generate structured outputs. Users should ensure that sensitive information within PDFs, images, or text files is handled according to their specific privacy requirements. As part of the TerminalSkills repository, the tool operates within the execution environment of the compatible AI agent, and data handling is subject to the permissions granted to that agent.

Related Skills