Enterprise GraphRAG Infrastructure

AutoData by Kiwi AI

Enterprise GraphRAG Infrastructure Platform

📅 Whitepaper 📆 April 2026 🏢 Kiwi AI 🔗 meetkiwi.ai

Executive Summary

Overview

Graph-augmented retrieval-augmented generation — GraphRAG — is rapidly becoming the defining architecture for enterprise AI that must reason over connected knowledge. Where standard vector retrieval returns isolated text snippets, GraphRAG adds relational context: who is connected to what, what depends on what, what concepts co-occur across documents and systems. The improvement in answer quality for relationship-heavy enterprise queries is substantial, and adoption is accelerating.

Yet the market remains structurally incomplete. Most vendors are strong in one layer — graph traversal, vector retrieval, ontology, governance, orchestration, or cloud infrastructure — but few productize the operational middle layer that makes GraphRAG work in practice.

AutoData is the only developer-accessible GraphRAG platform in the reviewed set that productizes the entire ingestion-to-graph pipeline — covering multimedia, all major document formats, and structured data, with production safety hardening, per-file job tracking, and application-layer RBAC that competitors leave to the buyer to build.

📄

Broad File Ingestion

30+ file extensions across documents, spreadsheets, code, and multimedia — all in one system.

🎥

Multimedia GraphRAG

Audio, video, and image files processed via Gemini — a first-mover capability in the market.

🔐

App-Layer RBAC

Four-level KB role hierarchy with personal, shared, and fan-out multi-KB search.

⚙️

Production Hardening

ZIP-bomb detection, prompt-injection scrubbing, cgroup-aware RAM scaling, and POSIX locks.

🕸️

Dual-Store Indexing

Every document lands in both Weaviate (vectors) and Neo4j (graph) with graceful degradation.

🔄

Async Job System

Redis priority queuing, MongoDB state machine, UUID readiness probes — no fire-and-forget.

🎯

AutoData's wedge is more specific and more commercially durable than any single incumbent. It fills the gap between all four market camps — infrastructure, graph-native, framework, and enterprise ontology — with a productized ingestion-to-graph pipeline.

Part I

The Market Opportunity

GraphRAG Is Not a Product — It Is a Stack

Enterprise buyers who want GraphRAG quickly discover that they are not buying a single product. They are assembling at minimum seven distinct layers:

1️⃣

Document Parsing

Format handling across PDFs, spreadsheets, HTML, code, and media.

2️⃣

Chunking

Structural awareness preserving context and relationships between content blocks.

3️⃣

Vector Indexing

Embedding and storing chunks for semantic similarity search.

4️⃣

Graph Extraction

Identifying entities, relationships, and communities from indexed content.

5️⃣

Graph Storage & Traversal

Multi-hop Cypher queries over a production knowledge graph.

6️⃣

Access Control

Multi-tenancy and per-user scoping across shared knowledge bases.

7️⃣

Job Orchestration

Async queue, per-file status, retry, and observability in production.

Each layer has its own failure modes, size limits, format gaps, and operational requirements. Most vendors productize one or two layers well. The rest is left to the buyer's engineering team. That assembly cost — in engineering time, operational risk, and ongoing maintenance — is often the hidden budget line item that derails enterprise GraphRAG programs.

Market Structure

The competitive landscape fragments into four camps, none of which delivers a complete end-to-end solution.

🏗️

Infrastructure-Led GraphRAG

Microsoft GraphRAG and AWS Bedrock attract attention through ecosystem scale, but impose significant ingestion constraints (e.g. only .txt, .csv, .json for Microsoft; S3-only with Claude 3 Haiku locked for AWS).

🕸️

Graph-Native Platforms

Neo4j dominates graph traversal credibility, but expects external parsers for all formats beyond an experimental PDF loader.

🔧

Framework & Toolchain

LlamaIndex, LangChain, and Weaviate are composable building blocks. Assembly inherits the limitations of each component. Weaviate's Verba has had an unresolved GraphRAG support request since 2024.

🏛️

Enterprise Ontology Platforms

Palantir, Graphwise, and Stardog lead in RDF/OWL compliance, but require professional services and widely reported minimum contracts of approximately $1M+.

💡

The scale of these incumbents is a distribution advantage, not a product advantage at the ingestion layer. AutoData's differentiation is precisely in the parts of the stack that competitors leave undocumented because they leave them unbuilt.

Part II

AutoData — What It Is and What It Does

Product Definition

AutoData is an end-to-end GraphRAG ingestion and retrieval infrastructure platform. In a single operational system, it provides:

Broad enterprise file ingestion across documents, spreadsheets, JSON, code and configuration files, and multimedia
Dual-store indexing into Weaviate for vector retrieval and Neo4j for graph traversal
Multimedia ingestion with adaptive vectorization or Gemini-powered transcription and summarization
A dedicated spreadsheet ingestion engine with structural awareness and safety controls
An application-layer knowledge base model with role-based access control and multi-KB fan-out search
An asynchronous job system with per-file status tracking and readiness probes
Production safety hardening including ZIP-bomb detection, prompt-injection scrubbing, and container-aware memory scaling

The Positioning

AutoData is not a graph database, a vector database, a semantic web platform, or a generic RAG framework. It is the operational layer that makes enterprise GraphRAG deployable over real enterprise content.

🎯

Most enterprise buyers think they are buying "GraphRAG." In practice, they are buying — or building — an entire operating layer. AutoData's commercial positioning is strongest when framed as the system that productizes the hard, neglected part of the stack that buyers consistently struggle to operationalize themselves.

Part III

Format and Size Support

What AutoData Ingests

AutoData routes files through purpose-built handlers across more than thirty file extensions in six categories.

📝

Documents

PDF, DOC, DOCX, ODT, RTF, RST, TXT, MD, EPUB, EML, MSG, P7S — via Unstructured with optional Gemini pre-parsing for OCR and layout extraction.

📊

Spreadsheets

CSV, TSV, XLS, XLSX, XLSM, XLSB — via a dedicated spreadsheet engine entirely separate from Unstructured, with documented safety bounds and graph output at zero LLM cost.

📋

Presentations

PPT, PPTX — processed through Unstructured.

🌐

Web & Markup

HTML, HTM, XHTML, XML — processed through Unstructured.

🔗

Structured Data

JSON — via a custom leaf-walk chunker that creates one chunk per leaf node with its full JSON path as context.

💻

Code & Config

Python, JavaScript, TypeScript, Java, Go, Rust, Ruby, PHP, C, C++, SQL, YAML, TOML, INI, env, Terraform, and 20+ more — via language-aware splitting.

🎵

Audio

MP3, WAV, M4A, OGG, FLAC, AAC, Opus — Gemini File API transcription and summarization.

🎬

Video

MP4, MOV, AVI, WebM, MKV, 3GP, HEVC — Gemini multimodal transcription and summarization.

🖼️

Images

PNG, JPG, JPEG, WEBP, HEIC, HEIF — adaptive CLIP vectorization or Gemini OCR summarization.

🗓️

Planned: Archives & Big-Data Formats

Archives (ZIP, TAR, GZ, 7Z, RAR) — support is planned; contents will be expanded and each file ingested individually. Big-data formats (Parquet) — planned for columnar data ingestion.

🚫

Hard Blocked

Executable and binary formats (EXE, DLL, PKL) and non-document database files (SQLite) are rejected with clear error messages.

Format Coverage Across Competitors

Format	AutoData	MS GraphRAG	AWS Bedrock	Unstructured API	Neo4j lib	LlamaIndex	Vectara	Glean
PDF / DOCX / HTML	✅	❌ Preprocessing req. ¹	✅ ²	✅ ³	❌ Exp. PDF only ⁴	✅ via LlamaParse	✅ ⁵	✅
CSV / TSV	✅ structured	✅ flat ¹	✅ ²	✅ ³	❌ BYO parsing	✅ flat	❌ Not in core API ⁵	✅ via M365
XLSX	✅	❌ Not documented ¹	✅ .xls/.xlsx ²	✅ .xls/.xlsx ³	❌ BYO parsing	✅ via LlamaParse	❌ Not listed ⁵	✅ via M365
XLSM (macro-enabled)	✅	❌ Not documented ¹	❌ Not documented ²	❌ Not supported ³⁶	❌ Not documented	☁️ Cloud API ⁷ᵃ	❌ Not listed ⁵	❌ Not documented
XLSB (binary Excel)	✅	❌ Not documented ¹	❌ Not documented ²	❌ Not listed ³	❌ Not documented	☁️ Cloud API ⁷ᵃ	❌ Not listed ⁵	❌ Not documented
JSON (deep structure)	✅ leaf-walk	✅ flat ¹	❌ Not documented ²	❌ Not listed ³	❌ Not documented	✅ basic	❌ Not listed ⁵	✅
Code / config (30+ ext.)	✅ lang-aware	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented	Some support	❌ Not documented	✅ via GitHub
Audio (MP3/WAV/Opus…)	✅ Gemini	❌ Not documented	❌ Not documented	❌ Not listed ³	❌ Not documented	❌ Not documented	❌ Not documented	Via Zoom transcripts only
Video (MP4/MOV/MKV…)	✅ Gemini	❌ Not documented	❌ Not documented	❌ Not listed ³	❌ Not documented	❌ Not documented	❌ Not documented	Via Teams transcripts only
Images (inline + standalone)	✅ vectorize or summarize	❌ Not documented	JPEG/PNG in S3 (3.75 MB limit) ²	✅ ³	❌ Not documented	Via LlamaParse	❌ Not documented	✅
Archives / big-data (Parquet)	🗓️ planned	❌ Not listed ¹ᵃ	❌ Not listed ²	❌ Not listed ³	❌ No parser layer	❌ Not listed ⁷	❌ Not listed ⁵	❌ Not mentioned ⁸
Executables / binaries	🚫 hard blocked	❌ Not listed ¹	❌ Not listed ²	❌ Not listed ³	❌ No parser layer	❌ Not listed ⁷	❌ Not listed ⁵	❌ Not mentioned ⁸

Sources: ¹ microsoft.github.io/graphrag/index/inputs/ and GitHub Discussion #375 · ¹ᵃ MS GraphRAG outputs pipeline results as Parquet files but does not ingest Parquet as input · ² docs.aws.amazon.com/bedrock (JPEG/PNG supported for multimodal S3 sources at a separate 3.75 MB limit) · ³ docs.unstructured.io · ⁴ graphacademy.neo4j.com · ⁵ docs.vectara.com · ⁶ github.com/Unstructured-IO/unstructured/issues/4047 — XLSM confirmed "not recognized" · ⁷ developers.llamaindex.ai/python/cloud/general/supported_document_types/ · ⁷ᵃ LlamaParse lists XLSM/XLSB as "Also supported" on its canonical types page; it is a paid cloud API ($0.003/page) with no published ZIP-bomb, cgroup, or POSIX lock documentation — raises data sovereignty concerns for enterprise financial workbooks · ⁸ docs.glean.com/connectors/crawler-and-indexing-limits (size-limits page only; no supported-format matrix found in reviewed documentation)

File Size Handling

AutoData enforces explicit size controls at multiple ingestion points. Remote document downloads are capped at 200 MB. Media files are capped at 200 MB during streaming download.

⚠️

For comparison: Unstructured API enforces 10 MB per file, 10 files per job, and a maximum of 5 on-demand jobs running simultaneously per official documentation. AWS Bedrock enforces a 50 MB per-file limit for documents and a separate 3.75 MB limit for JPEG/PNG images. Vectara's officially documented upload limit is 10 MB per file. Glean items above 64 MB receive metadata-only indexing, and text extraction is capped at ~16.875 MB with content beyond that limit silently truncated.

ℹ️

AutoData uses Unstructured for document parsing (PDF, DOCX, HTML). For spreadsheets, AutoData bypasses Unstructured entirely and uses its own dedicated engine — not constrained by Unstructured's per-file limits. Spreadsheet size is constrained only by server RAM, and scales linearly with infrastructure: adding RAM directly increases the maximum processable workbook size, making this limit fully elastic in cloud and containerized deployments.

Part IV

Multimedia GraphRAG — A First-Mover Capability

Enterprise knowledge increasingly lives in non-text formats: recorded meetings, training videos, customer call recordings, product demonstrations, embedded charts, and scanned documents with annotations. Most GraphRAG platforms treat these as out of scope. AutoData treats them as first-class ingestion targets.

The Multimedia Pipeline

🖼️

Image Processing

Strategy is selected automatically at store creation. With multi2vec-clip available: base64 stored, CLIP cross-modal embeddings. Without it: Gemini OCR & description → text chunk through the same pipeline. Images embedded in PDFs and DOCX are also extracted and processed.

🎵

Audio Processing

MP3, WAV, M4A, OGG, FLAC, AAC, Opus files are streamed, uploaded to Gemini File API, transcribed, and ingested as standard chunks eligible for both vector indexing and graph extraction. Temporary files and Gemini copies are always deleted in finally blocks.

🎬

Video Processing

MP4, MOV, AVI, WebM, MKV, and more follow the same pipeline as audio via Gemini's multimodal capabilities. The resulting text flows into the same vector and graph indexing path as any document.

📄

Embedded Images

Images extracted from PDFs and DOCX documents are processed through the same adaptive image pipeline — a financial report with embedded charts produces both text chunks and image description chunks, all linked in the knowledge graph.

🏆

None of the ten benchmark competitors — Microsoft GraphRAG, AWS Bedrock, Neo4j, LlamaIndex, LangChain, Weaviate, Palantir, Databricks, GraphAware Hume, or Graphwise — publicly document an end-to-end pipeline from audio or video files to knowledge graph nodes. This is a first-mover capability in the enterprise GraphRAG market.

Part V

Spreadsheet Ingestion

Spreadsheets are among the most common enterprise knowledge sources — used by finance, operations, procurement, and back-office functions to store critical business knowledge. They are also structurally complex: merged headers, multiple tables per sheet, wide columns, macro-enabled and binary formats. AutoData uses a dedicated spreadsheet engine.

Format Coverage and Safety Bounds

AutoData's differentiation on spreadsheets is not simply which format names are listed — it is what happens inside: documented safety bounds enforced before any decompression, graph output at zero LLM cost, and no dependency on a third-party cloud parsing API.

📁

CSV & TSV

Streaming CSV parsing — no memory cliff for large flat files.

📗

XLS

Handled via xlrd for legacy Excel workbooks.

📘

XLSX & XLSM

openpyxl in read-only mode. Unstructured GitHub Issue #4047 confirms XLSM is "not recognized" by Unstructured. LlamaParse lists XLSM as supported via its paid cloud API but documents no safety controls — sending enterprise financial workbooks to a third-party cloud API raises data sovereignty concerns for most enterprise buyers.

💾

XLSB

Binary Excel format handled via pyxlsb. Not documented in AWS Bedrock, Unstructured, or Microsoft GraphRAG. LlamaParse lists XLSB as supported via its paid cloud API ($0.003/page) but documents no ZIP-bomb protection, cgroup-aware memory scaling, or POSIX locking.

Header Detection, Cell Propagation, and Table Segmentation

Identifies header rows using a density + string/numeric ratio heuristic; detects up to 3 stacked header rows and flattens them into compound labels like Revenue / Q1 / Actual
Propagates merged cell anchor values to every dependent row — no silent blank cells in any returned chunk
Detects multiple independent tables within a single sheet and segments them independently, preventing cross-table entity contamination in the knowledge graph
Captures side annotations, footnotes, and sidebar data as separate notes chunks — no content is silently dropped
Enforces an 80× ZIP expansion ratio cap, a 50,000 ZIP member cap, and a specific guard on xl/sharedStrings.xml — the single most common cause of openpyxl memory exhaustion on large workbooks

Five Chunk Types Per Workbook

📋

Workbook Summary

Lists all sheets and their detected structure.

📄

Sheet Summary

Per-sheet summary listing detected columns.

📊

Row Data Chunks

Header-aware key-value rendering of data rows.

📝

Notes Chunks

Content above, below, or between tables.

📌

Side-Notes Chunks

Content outside the main table column span.

Safety Controls

ZIP container inspected before any decompression: 80× expansion ratio cap, 50,000 ZIP member cap, specific guard on xl/sharedStrings.xml
Cross-process POSIX lock serializes heavy spreadsheet operations across worker processes — prevents memory spikes in multi-worker deployments
Container memory limits read from Linux cgroup filesystem at runtime (v1 and v2) — processing caps self-adjust to available RAM in ECS Fargate and Kubernetes environments
None of these controls are documented for any competitor in the reviewed set

Deterministic Graph at Zero LLM Cost

💰

For spreadsheets, AutoData builds a Neo4j knowledge graph using rule-based logic — sheet concept nodes, chunk adjacency edges, and document-to-sheet relationships — without any LLM call. Microsoft GraphRAG's official cost documentation shows $0.34 per 30,000-word corpus with GPT-4o-mini, with GPT-4o substantially more expensive per the same source. AutoData's deterministic path costs only compute.

Part VI

Knowledge Base Model and Access Control

🔒

Among developer-accessible GraphRAG platforms, application-layer knowledge base access control is essentially absent. AutoData is the only developer-accessible platform in the reviewed set with full application-layer KB RBAC built in.

AutoData's KB Architecture

🎭

Role Hierarchy

Four roles: READ, READ_WRITE, READ_WRITE_UPDATE, ADMIN. Role checks enforced on every ingestion and search operation. KB owner has implicit ADMIN.

🗂️

KB Scoping

Three modes: personal (private KB), kb (specific shared KB), all (fan-out search across all accessible KBs). Ingestion cannot target all KBs simultaneously.

🔍

Fuzzy KB Name Resolution

Two-stage normalization pipeline with ranked disambiguation. Detects collisions between personal KB aliases and similarly named shared KBs — prevents silent mis-scoping.

🌐

Multi-KB Fan-Out Search

Searches personal and all shared KBs concurrently, then runs a second LLM call to synthesize a coherent answer, explicitly reconciling conflicts across sources.

👥

Bulk Membership Management

Add up to 500 members in a single API call with comma/semicolon splitting, deduplication, format validation, partial success reporting, and redundant-update detection.

RBAC Comparison

Vendor	Per-user KB Scoping	Role Hierarchy	Multi-KB Fan-Out	Self-Serve
AutoData	✅ personal/kb/all	✅ four-level	✅ concurrent LLM synthesis	✅
Microsoft GraphRAG	❌ Not documented	❌ Not documented	❌ Not documented	N/A
AWS Bedrock GraphRAG	IAM KB-level only	❌ Not documented	❌ Not documented	✅
Neo4j GraphRAG lib	❌ Not built in	❌ Not built in	❌ Not built in	High effort
LlamaIndex / LangChain	❌ Not built in	❌ Not built in	❌ Not built in	✅
Weaviate	DB shard-level only	No app-layer roles	❌ Not documented	✅
Glean	Source-system ACLs; custom datasource APIs exist	Source-inherited	❌ Not documented as multi-KB GraphRAG model	Closed ecosystem
Palantir	✅ object-level	✅ fine-grained	✅	Requires team; ~$1M+ min.

Part VII

GraphRAG Architecture

Dual-Store Architecture with Graceful Degradation

AutoData loads every ingested document into both Weaviate for vector retrieval and Neo4j for graph traversal. Backend availability is tracked independently per job. If vector indexing succeeds but graph extraction fails, the job is marked PARTIAL_AVAILABLE and vector retrieval continues to work — materially better than treating any backend failure as a complete job failure.

Three-Context Answer Synthesis

🔎

Semantic Search Context

Top K chunks from Weaviate vector search with 4× over-fetching for diversity. Up to 3 retry attempts with exponential backoff for eventual consistency after recent ingestion.

🕸️

Knowledge Graph Context

Bounded graph traversal from retrieved chunk nodes in Neo4j. Structured entities and relationships formatted as human-readable context. Depth configurable from 1–5 hops.

📖

Full-Document Context

Complete content of top-ranked source documents (70% max chunk score + 30% avg score), token-aware truncation at 8,000 tokens with an explicit notice when content is cut.

LLM Provider Flexibility

AutoData supports Azure OpenAI, standard OpenAI, Anthropic Claude, Google Gemini, Perplexity, DeepSeek, and Grok — switchable via a single configuration string, with new models continuously added as the ecosystem evolves. AWS Bedrock locks graph extraction to Claude 3 Haiku with no override supported per official documentation.

GraphRAG Capability Comparison

Capability	AutoData	MS GraphRAG	AWS Bedrock	Neo4j lib	LlamaIndex
LLM provider flexibility	✅ multiple providers; continuously expanding	OpenAI or Azure OpenAI	Claude 3 Haiku locked	✅ any	✅ any
Multimedia to graph nodes	✅ audio/video/image	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented
Deterministic graph at zero LLM cost	✅ for spreadsheets	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented
Partial availability state	✅ PARTIAL_AVAILABLE	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented
Two-stage graph validation	✅ logical + schema	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented
Async job queue with per-file status	✅ full state machine	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented
Global community detection	Not yet implemented	✅ best-in-class Leiden	Partial	❌ Not documented	❌ Not documented

Part VIII

Production Safety Hardening

Open-source GraphRAG frameworks assume controlled environments with trusted inputs. Enterprise production deployments encounter adversarial inputs, resource-constrained containers, and concurrent workers. AutoData addresses each attack surface as a built-in platform capability.

💣

ZIP-Bomb Detection

XLSX and XLSM files inspected as ZIP containers before decompression. Rejects if uncompressed/compressed ratio exceeds 80× or if more than 50,000 ZIP members. sharedStrings.xml specifically capped.

💉

Prompt-Injection Scrubbing

Seven injection token patterns removed from spreadsheet cell values before chunking: <|, ###, SYSTEM:, ASSISTANT:, USER:, DEVELOPER:, and similar.

🔬

Binary File Detection

Files with NUL bytes in first 8 KB, or printable character ratio below 70%, are rejected — preventing binaries and encrypted files from entering the pipeline.

🧠

cgroup-Aware RAM Scaling

Reads container memory limits from Linux cgroup filesystem at runtime (v1 and v2). Processing caps scale with available memory — prevents OOM in ECS Fargate environments.

🔐

Cross-Process POSIX Lock

File-based lock prevents multiple worker processes from simultaneously parsing large spreadsheets, avoiding memory spikes in multi-worker deployments.

👻

Ghost Job Prevention

Job status record only created in MongoDB after S3 staging succeeds — prevents stuck QUEUED jobs that can never be processed.

🧹

Staging Cleanup

Files copied to job-specific S3 staging prefix before processing. Staging copy always deleted after completion or failure. Permanent user copy never deleted.

⚡

Transient vs. Permanent Failures

Transient failures (S3 eventual consistency, network timeouts) trigger exponential backoff and retry. Permanent failures (unsupported formats, invalid JSON) fail immediately with clear error messages — no silent swallowing of errors.

Part IX

Asynchronous Job Infrastructure and Observability

⚠️

Microsoft GraphRAG and the Neo4j GraphRAG Python library are synchronous-only tools. There is no job queue, no per-file status tracking, and no way to observe progress or handle partial failures — acceptable for research but a serious operational problem for enterprise deployments where ingestion runs for minutes or hours.

AutoData's Production Job System

📬

Two-Tier Redis Queue

High-priority queue for user-initiated uploads; low-priority for background sync jobs. Blocking pop with configurable timeout. Up to 10 concurrent jobs by default.

🔄

Per-Job State Machine

Every job tracked in MongoDB with independent status for each backend. States: QUEUED → PROCESSING → AVAILABLE / PARTIAL_AVAILABLE / FAILED.

📡

UUID-Specific Readiness Probing

After ingestion, worker probes both Weaviate and Neo4j to verify chunks are actually queryable. Weaviate: 100% UUID success threshold. Neo4j: 75% threshold. Detects replication lag before marking available.

📁

File Inventory API

Filter by status, backend status, date range, filename, and KB scope. Sort on multiple fields with pagination. Full visibility into ingestion state for enterprise administrators.

Part X

Real-Time Indexer

Beyond the primary ingestion pipeline, AutoData includes a real-time indexer that continuously monitors MongoDB collections for new and updated documents, indexing them into Weaviate and Neo4j without requiring explicit upload actions.

🎯

Configurable Collection Targets

Actions, ActionGraphs, EventResults, Chats — configurable collection monitoring.

🚀

Three Bootstrap Modes

current, hours-ago, or full-history — for different deployment scenarios.

⚡

LRU Connection Cache

Weaviate store connections cached to avoid repeated initialization overhead.

🔀

Concurrent Processing

Semaphore-controlled parallelism prevents resource exhaustion.

📍

Cursor-Based State Tracking

Resumable operation after restarts — no documents are re-processed or skipped.

💪

Poison-Pill Resilience

A single malformed document cannot stall an entire collection's indexing pipeline.

🌟

This expands AutoData from a file ingestion platform to a continuously indexed organizational memory system — automatically indexing AI agent conversation histories, workflow execution results, and action definitions as they are created, making them immediately searchable through the same GraphRAG pipeline.

Part XI

AutoData Advantages Over Competitors

AutoData's advantages over each competitor are focused on the ingestion, transformation, and governance layers that competitors leave incomplete or unbuilt.

🏢

Microsoft GraphRAG — AutoData adds multimedia, Excel, and production ops

Breadth of enterprise ingestion (multimedia, XLSM/XLSB, 30+ code formats) + application-layer KB governance + production async job lifecycle — vs. .txt/.csv/.json only with no RBAC and no job queue.

☁️

AWS Bedrock — AutoData removes the LLM lock and ingestion ceiling

Ingestion flexibility beyond S3-only, multimedia processing, deeper spreadsheet handling (XLSM/XLSB), LLM provider choice across multiple vendors, and deterministic graph fallback — vs. Claude 3 Haiku locked, 50 MB doc limit, JPEG/PNG 3.75 MB limit.

🕸️

Neo4j GraphRAG — AutoData owns the full stack above the graph layer

Owns the complete ingestion stack above Neo4j — buyers get multimedia, spreadsheet coverage with documented safety bounds, RBAC, and async jobs without assembling it themselves. AutoData uses Neo4j as its backend and inherits its traversal quality.

🔧

LlamaIndex & LangChain — AutoData replaces assembly with one system

Multimedia GraphRAG; spreadsheet ingestion with ZIP-bomb detection (80× expansion cap, 50,000 member limit, sharedStrings.xml guard), cgroup-aware memory scaling, and POSIX cross-process locking — running entirely within your infrastructure at zero per-page cost; application-layer RBAC; and a production job lifecycle — all in one system. The default LangChain/LlamaIndex spreadsheet path via Unstructured cannot parse XLSM at all (Issue #4047). LlamaParse lists XLSM/XLSB but is a paid cloud API ($0.003/page) with no published safety controls, raising data sovereignty concerns for enterprise financial content.

🔵

Weaviate / Verba — AutoData adds the pipeline Verba has not built

Complete ingestion stack, Neo4j graph layer, and application-layer RBAC on top of Weaviate's database strengths — Verba's GraphRAG support request has been unresolved since 2024.

🏛️

Palantir Foundry & AIP — AutoData delivers self-serve access without the enterprise contract

Self-serve ingestion-to-GraphRAG at a fraction of the cost, accessible to mid-market buyers without a multi-year platform commitment or ~$1M+ minimum contract.

📊

Databricks Mosaic AI — AutoData works without a lakehouse investment

Purpose-built file ingestion, graph extraction, and heterogeneous enterprise content handling without requiring a lakehouse infrastructure investment. No native GraphRAG extraction is documented for Databricks.

🔍

GraphAware Hume — AutoData is a RAG system, not a visualization tool

Full GraphRAG pipeline with vector retrieval, document ingestion at scale, and multimedia processing — not just graph visualization. Hume is an investigation tool, not a RAG system.

🌐

Graphwise & Stardog — AutoData needs no SPARQL expertise or professional services

Self-serve ingestion across 30+ formats with no SPARQL/OWL expertise required — accessible to enterprise engineering teams today without professional services.

📄

Vectara — AutoData adds Excel, graph retrieval, and no file-size ceiling

Full spreadsheet coverage across all 6 variants, multimedia ingestion, and knowledge graph retrieval — vs. no Excel support, 10 MB per-file limit, and no graph-style retrieval.

🔎

Glean — AutoData provides custom corpus ingestion without silent truncation

Custom corpus ingestion with no size truncation, multimedia-to-graph processing, and explicit graph traversal — vs. 64 MB metadata-only limit, ~16.875 MB silent text truncation, and no permissioned multi-KB GraphRAG model.

Part XII

The Eight Differentiating Capabilities

Multimedia Ingestion Into the GraphRAG Pipeline

Audio and video files transcribed via Gemini File API, standalone images vectorized or summarized, document-embedded images processed — all flowing into both Weaviate vector storage and Neo4j graph extraction. Not documented as an end-to-end capability for any reviewed competitor. A first-mover capability in the enterprise GraphRAG market.

Spreadsheet Ingestion with Documented Safety Bounds Across All Six Major Formats

AutoData's spreadsheet engine enforces an 80× ZIP expansion ratio cap, a 50,000 ZIP member cap, and a specific guard on xl/sharedStrings.xml — the single most common cause of openpyxl memory exhaustion on large workbooks. Container RAM limits are read at runtime from the Linux cgroup filesystem (v1 and v2), making processing caps self-adjusting in containerized deployments. A cross-process POSIX lock serializes concurrent workers. None of these controls are documented for any competitor.

On format coverage: Unstructured does not parse XLSM at all (GitHub Issue #4047 — "not recognized"). AWS Bedrock documents only .xls and .xlsx. Microsoft GraphRAG documents no Excel format. LlamaParse lists XLSM and XLSB as "Also supported" but is a paid cloud API ($0.003/page) with no published safety controls — a data sovereignty concern for enterprise financial workbooks.

AutoData also produces five distinct chunk types per workbook (workbook summary, sheet summary, row data, notes, side-notes) and builds Neo4j graph nodes via rule-based logic at zero LLM cost. LlamaParse produces flat text output with no graph construction.

Deterministic Graph Construction at Zero LLM Cost

For spreadsheets, rule-based NEXT edges and sheet concept nodes are built without any LLM call. This eliminates graph extraction cost for a large class of enterprise files — a direct cost advantage vs. Microsoft GraphRAG's $0.34 per 30,000 words with GPT-4o-mini (with GPT-4o substantially more expensive per the same source).

Application-Layer Multi-Tenant KB with RBAC

Personal, kb, and all scoping modes; four-level role hierarchy; fuzzy KB name resolution with ranked disambiguation; concurrent multi-KB fan-out search with LLM synthesis. Not documented as a built-in capability for Microsoft GraphRAG, Neo4j GraphRAG, LlamaIndex, or LangChain.

Multi-LLM Provider Support

Azure OpenAI, standard OpenAI, Anthropic Claude, Google Gemini, Perplexity, DeepSeek, and Grok — switchable via a single configuration string, with new models continuously added. AWS Bedrock locks graph extraction to Claude 3 Haiku with no override per official documentation. Microsoft GraphRAG supports OpenAI API and Azure OpenAI but does not offer built-in Claude, Gemini, Perplexity, DeepSeek, or Grok connectivity.

Production Safety Hardening

ZIP-bomb detection, prompt-injection scrubbing, binary file detection, cgroup-aware RAM scaling, cross-process POSIX lock, ghost job prevention, and staging cleanup. Not found in the reviewed public documentation for open-source GraphRAG frameworks.

Async Job Queue with Per-File Status and Readiness Probes

Redis priority queuing, MongoDB job tracking with AVAILABLE/PARTIAL_AVAILABLE/FAILED state machine, UUID-specific readiness probing for both backends, and exponential backoff retry. Microsoft GraphRAG and Neo4j GraphRAG Python are synchronous-only tools with no documented job lifecycle management.

Unified Pipeline Across 30+ Format Categories

Documents, all six spreadsheet variants, JSON with deep structure chunking, code and configuration files with language-aware splitting, and multimedia — all in one operational pipeline with consistent job tracking, access control, and graph extraction. No reviewed competitor covers this range without stitching multiple tools together.

Part XIII

Summary Comparison

Vendor	Primary Strength	Format Breadth	Multimedia GraphRAG	Spreadsheet Coverage	App-Layer KB RBAC	Async Jobs	Production Safety
AutoData	End-to-end GraphRAG operating layer	✅ Broadest	✅ Audio/video/image	✅ All 6 variants; safety bounds; graph output	✅ Four-level; fan-out	✅ Full state machine	✅ ZIP-bomb; injection; cgroup
Microsoft GraphRAG	Corpus summarization	❌ .txt/.csv/.json only	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented
AWS Bedrock GraphRAG	Managed AWS infra	S3-only; 50 MB docs	❌ Not documented	.xls/.xlsx only	IAM-level only	❌ Not documented	Managed
Neo4j GraphRAG lib	Graph traversal quality	Exp. PDF; all else BYO	❌ Not documented	❌ Not documented	❌ Not built in	❌ Not built in	❌ Not documented
LlamaIndex / LangChain	Framework ecosystem	Broad via assembled loaders	❌ Not documented	Flat text output; no graph construction, no safety controls documented	❌ Not built in	❌ Not built in	❌ Not documented
Weaviate / Verba	Vector database	❌ No parsing layer	CLIP at DB layer only	❌ Not documented	❌ No app layer	❌ Not documented	DB-managed
Vectara	Managed vector search	No Excel; 10 MB limit	❌ Not documented	❌ Not documented	❌ Not documented	❌ Not documented	Managed
Glean	Workplace search	64 MB limit; strips embedded media	Meeting transcripts via connectors only	XLSX via M365 only	Source ACLs; custom datasource APIs	❌ Not documented	Managed
Palantir	Operational ontology AI	Enterprise-managed	Via prof. services	Via prof. services	Object-level	Enterprise	Enterprise
Databricks	Data lakehouse	External parsers req.	❌ Not documented	❌ Not documented	Platform governance	Platform	Managed
Graphwise / Stardog	Semantic standards	Standards-focused	❌ Not documented	❌ Not documented	Semantic governance	❌ Not documented	❌ Not documented

Part XIV

The Investment Thesis

The Market Opening

Enterprise GraphRAG programs fail most often not at the retrieval or reasoning layer, but at the ingestion layer. Audio and video content is ignored entirely. Files are too large or in unsupported formats. Spreadsheets are flattened to meaningless text. Access control is bolted on as an afterthought. Partial failures leave users with no visibility into what succeeded and what did not.

AutoData solves all of these problems in one platform, while competitors either do not address them or require expensive professional services engagements to work around them.

The Competitive Moat

The multimedia GraphRAG pipeline, spreadsheet ingestion engine with documented safety bounds, and production safety hardening represent a meaningful engineering moat. These capabilities took significant investment to build correctly — MIME type handling in slim containers, adaptive vectorize-versus-summarize strategy selection, memory-safe spreadsheet processing with named thresholds (80× ZIP cap, 50,000 member cap, sharedStrings.xml guard), cgroup-aware scaling, POSIX cross-process locking, and token-aware chunking are not features that can be added quickly to a research-grade library.

The deterministic graph construction path for spreadsheets, which eliminates LLM costs for a large class of enterprise files, is a particularly defensible cost advantage.

The Addressable Market

AutoData's multimedia capability addresses any organization where knowledge is captured in calls, meetings, training content, and visual artifacts — which increasingly means every enterprise. The spreadsheet capability extends the wedge into finance, operations, procurement, compliance, planning, and back-office workflows. The unified pipeline across documents, code, configuration, and media addresses engineering and technical organizations that need to make their entire knowledge estate searchable.

AutoData is the operating layer that makes enterprise GraphRAG deployable over messy, real-world content — productizing the ingestion, transformation, access control, and reliability infrastructure that the market has left to buyers to build themselves.

Appendix

Primary Source Reference Index

Sources grouped by vendor and topic. Each chip is a direct link to the primary document. Where a capability is described as "not documented in reviewed materials," this reflects absence of documentation rather than a confirmed absence of the capability.

🏢

Microsoft GraphRAG

8 sources

Formats & Inputs

inputs/ discussions/375 issues/680 mintlify mirror

Cost & Scale

costs-explained issues/1621 discussions/595

Pipeline & Dataflow

default_dataflow/

☁️

AWS Bedrock GraphRAG

6 sources

Formats & Limits

knowledge-base-ds bedrock-quotas

GraphRAG Capabilities

build-graphs build-graphs-build

Announcements

graphrag-ga ml-blog

📄

Unstructured API

6 sources

Limits & Quotas

quickstart speed-up-large-files

Formats

supported-file-types (API) supported-file-types (OSS) issue-4047 (XLSM)

Pricing

api-overview

🕸️

Neo4j GraphRAG

3 sources

Loading & Formats

loading-data (course) api-docs github README

🔧

LlamaIndex & LangChain

6 sources

Spreadsheet & Formats

spreadsheet-options supported-doc-types multi-page-tables blog spreadsheet-agent blog

Pricing & Limits

llamaparse-pricing large-data-responses

Integrations

langchain-unstructured

📋

Vectara

3 sources

Formats & Limits

file-upload-filetypes file-upload API size-limit discussion

🔎

Glean

6 sources

Size & Indexing Limits

crawler-indexing-limits assistant-file-upload

Custom Datasource APIs

bulk-indexing custom-datasources guide custom-properties

Connectors

s3-connector

🔵

Weaviate & Verba

1 source

GraphRAG Support

verba-issue-159

📋

Evidence methodology: Most competitor claims are tied to the primary sources above. Some capability comparisons are based on the absence of documentation in reviewed public materials rather than explicit vendor denials, and are described as "not documented in reviewed materials" throughout the document. The competitive landscape in this market moves quickly; investors should verify specific competitor claims against current vendor documentation before use in external materials.