Skip to content

C++ binding implementation using ONNX Runtime #1255

@moisnx

Description

@moisnx

C++ Binding for Magika

Motivation

I needed fast, native file type detection for my terminal editor written in C++. While Magika has excellent Python and Rust implementations, there wasn't a C++ binding available. Rather than shell out to the CLI or embed Python, I built native C++ bindings using ONNX Runtime.

What I've Built

A complete C++ implementation that:

  • Matches Python accuracy - Uses the same ONNX model with identical feature extraction
  • High performance - ~5ms inference, 1000+ files/sec throughput
  • Clean API - PIMPL pattern, exception-safe, follows modern C++ practices
  • Full CLI - Feature parity with Rust CLI (recursive, JSON output, colors, etc.)
  • Cross-platform - Tested on Linux, macOS (planned: Windows)
  • Well documented - Comprehensive README, API docs, and examples
  • Production ready - Already integrated into my editor

Implementation Highlights

Architecture:

  • ONNX Runtime C++ API for inference
  • Feature extraction ported from Python (lstrip/rstrip, padding, block_size)
  • PIMPL pattern to hide ONNX dependencies from public headers
  • CMake build system with proper install targets

Key Files:

cpp/
├── lib/                    # Core library
│   ├── include/magika/     # Public headers
│   │   ├── magika.hpp
│   │   └── types.hpp
│   └── src/                # Implementation
│       └── magika.cpp
├── cli/                    # Command-line tool
│   └── main.cpp
├── examples/               # Usage examples
└── README.md               # Full documentation

Testing:

  • Validated against Python implementation on tests_data/
  • Matches detection results for all test files
  • Handles edge cases (empty files, binary files, large files)

Example Usage

Library:

#include <magika/magika.hpp>

magika::Magika detector("/path/to/models/standard_v3_3");
auto result = detector.identify_path("test.py");

std::cout << "Type: " << result.content_type << "\n"    // "python"
          << "MIME: " << result.mime_type << "\n"        // "text/x-python"
          << "Group: " << result.group << "\n"           // "code"
          << "Confidence: " << result.score << "\n";     // 0.998

CLI:

$ build/cli/magika -r ~/projects --json 
    142 python
     89 javascript
     56 cpp
     23 markdown

Integration Example

Already working in my terminal editor:

// Auto-detect file type on load
auto result = detector.identify_path(filename);
if (result.content_type == "python") {
    enable_python_highlighting();
}

Questions for Maintainers

Before submitting a full PR, I'd appreciate guidance on:

  1. Interest Level: Would you accept a C++ binding as an official part of Magika?

  2. Dependency Management: How should ONNX Runtime be handled?

    • Current: User provides path via CMake (-DONNXRUNTIME_DIR=...)
    • Alternative: FetchContent to auto-download
    • Alternative: System package (apt-get, brew, vcpkg)
  3. Project Structure:

    • Should it be in cpp/ directory (like python/, js/)?
    • Or separate repo initially?
  4. CI/Testing:

    • Add GitHub Actions for C++ builds (Linux, macOS, Windows)?
    • Integration tests comparing against Python output?
    • What coverage is expected?
  5. API Design: Any changes needed to match project conventions?

    • Current: Magika class with identify_path(), identify_bytes()
    • Exception-based error handling vs. Result types?
  6. Documentation:

    • Is the current README sufficient?
    • Should API docs be generated (Doxygen)?

Why This Adds Value

  • Performance-critical applications: Game engines, embedded systems, high-throughput servers
  • Native integration: Editors (Vim, Emacs plugins), file managers, backup tools
  • Ecosystem gap: Go and JavaScript have bindings, C++ is a natural fit alongside Rust

Preview

Working code available at: https://github.com/moisnx/magika/tree/main/cpp

Next Steps

If you're interested, I can:

  1. Clean up any remaining issues based on feedback
  2. Add comprehensive CI workflows
  3. Submit a formal PR with detailed changelog
  4. Write integration guide for common C++ build systems

Looking forward to your thoughts! Happy to hop on a call to discuss if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions