Welcome to SoMark
SoMark converts PDFs, PPTs, images, and many other document formats into machine-readable structured output with high accuracy, high speed, and strong cost efficiency, providing high-quality data for LLM training and RAG applications.99% OCR Accuracy
Industry-leading recognition accuracy with coordinate traceback to pinpoint every element in the source document.
100 Pages in 5 Seconds
High-speed parsing with horizontally scalable cluster deployment for large-scale batch workloads.
Pay As You Go
Usage-based billing or one-time licensing. Private deployment starts from a single RTX 3090 GPU.
21 Component Types
Detects headings, tables, formulas, images, chemical structures, seals, QR codes, and 14 more element types.
Multiple Output Formats
Outputs Markdown, JSON — ready for LLM training pipelines and RAG applications.
Broad Document Coverage
Supports research papers, reports, whitepapers, contracts, scanned books, government files, and more.
Supported file formats
pdf png jpg jpeg bmp tiff jp2 dib ppm pgm pbm gif heic heif webp xpm tga dds xbm doc docx ppt pptx xlsx xlsm xls
Recognized document elements
SoMark can recognize these 21 document element types:| Category | Elements |
|---|---|
| Text structure | Title title, text block text, header header, footer footer, footnote footnote |
| Figures and tables | Figure figure, figure caption figure_caption, table table, table caption table_caption |
| Specialized content | Equation equation, chemical structure cs, chemical equation cs_equation, code block code |
| Navigation and layout | Sidebar sider, table of contents cate, TOC entry cate_item |
| Education and structured items | Choice item choice, fill-in-the-blank blank, reference reference |
| Special elements | QR code qrcode, stamp stamp |
Title
Text block
Figure
Figure caption
Table
Table caption
Equation
Header
Footer
Sidebar
Footnote
TOC
TOC entry
Choice
Code block
Blank
Reference
QR code
Stamp
Chemical structure
Chemical equation

