diff --git a/_docs/index.md b/_docs/index.md index 4bdfa1be..275c7947 100644 --- a/_docs/index.md +++ b/_docs/index.md @@ -38,6 +38,13 @@ KafScale keeps brokers stateless and moves processing into add-on services, so y - [Storage format](/storage-format/) - [S3 health](/s3-health/) +## Large File Support (LFS) + +- [LFS Proxy](/lfs-proxy/) — Claim-check proxy for large binary payloads via S3 + Kafka +- [LFS Helm Deployment](/lfs-helm/) — Kubernetes deployment and configuration +- [LFS Client SDKs](/lfs-sdks/) — Java, Python, JS, and browser SDKs +- [LFS Demos](/lfs-demos/) — Runnable demos from local IDoc processing to full Kubernetes pipelines + ## Processors - [Iceberg Processor](/processors/iceberg/) diff --git a/_docs/lfs-demos.md b/_docs/lfs-demos.md new file mode 100644 index 00000000..7ed2d4d7 --- /dev/null +++ b/_docs/lfs-demos.md @@ -0,0 +1,238 @@ +--- +layout: doc +title: LFS Demos +description: Runnable demos for the KafScale Large File Support system — from local IDoc processing to full Kubernetes pipelines. +permalink: /lfs-demos/ +nav_title: LFS Demos +nav_order: 4 +nav_group: LFS +--- + + + +# LFS Demos + +KafScale ships with a set of runnable demos that exercise the LFS pipeline end-to-end. They range from a lightweight local demo (MinIO only, no cluster) to full Kubernetes deployments with industry-specific content-explosion patterns. + +## Quick reference + +| Make target | What it runs | Needs cluster? | +|---|---|---| +| `make lfs-demo` | Baseline LFS proxy flow | Yes (kind) | +| `make lfs-demo-idoc` | SAP IDoc explode via LFS | No (MinIO only) | +| `make lfs-demo-medical` | Healthcare imaging (E60) | Yes (kind) | +| `make lfs-demo-video` | Video streaming (E61) | Yes (kind) | +| `make lfs-demo-industrial` | Industrial IoT mixed payloads (E62) | Yes (kind) | +| `make e72-browser-demo` | Browser SPA uploads (E72) | Yes (kind) | + +All demos are environment-driven — override any setting via env vars without touching code. + +--- + +## Baseline LFS demo + +```bash +make lfs-demo +``` + +The baseline demo bootstraps a kind cluster, deploys MinIO and the LFS proxy, then runs the full claim-check flow: + +1. Build and load the `kafscale-lfs-proxy` image into kind +2. Deploy broker, etcd, MinIO, and LFS proxy into the `kafscale-demo` namespace +3. Create a demo topic and upload binary blobs via the LFS proxy HTTP API +4. Verify pointer envelopes arrive in Kafka and blobs exist in S3 + +| Variable | Default | Description | +|---|---|---| +| `LFS_DEMO_TOPIC` | `lfs-demo-topic` | Kafka topic for pointer records | +| `LFS_DEMO_BLOB_SIZE` | `524288` (512 KB) | Size of each test blob | +| `LFS_DEMO_BLOB_COUNT` | `5` | Number of blobs to upload | +| `LFS_DEMO_TIMEOUT_SEC` | `120` | Test timeout | + +--- + +## IDoc explode demo (local, no cluster) + +```bash +make lfs-demo-idoc +``` + +This is the fastest way to see the LFS pipeline in action. It only needs a local MinIO container — no Kubernetes cluster required. + +The demo walks through the complete data flow that the [LFS Proxy](/lfs-proxy/) performs in production: + +**Step 1 — Blob upload.** The demo uploads a realistic SAP ORDERS05 IDoc XML (a purchase order with 3 line items, 3 business partners, 3 dates, and 2 status records) to MinIO, simulating what the LFS proxy does when it receives a large payload. + +**Step 2 — Envelope creation.** A KafScale LFS envelope is generated — the compact JSON pointer record that Kafka consumers receive instead of the raw blob: + +```json +{ + "kfs_lfs": 1, + "bucket": "kafscale", + "key": "lfs/idoc-demo/idoc-inbound/0/0-idoc-sample.xml", + "content_type": "application/xml", + "size": 2706, + "sha256": "96985f1043a285..." +} +``` + +**Step 3 — Resolve and explode.** The `idoc-explode` processor reads the envelope, resolves the blob from S3 via `pkg/lfs.Resolver`, validates the SHA-256 checksum, then parses the XML and routes segments to topic-specific streams: + +``` +Segment routing: + E1EDP01, E1EDP19 -> idoc-items (order line items) + E1EDKA1 -> idoc-partners (business partners) + E1STATS -> idoc-status (processing status) + E1EDK03 -> idoc-dates (dates/deadlines) + EDI_DC40, E1EDK01 -> idoc-segments (all raw segments) + (root) -> idoc-headers (IDoc metadata) +``` + +**Step 4 — Topic streams.** Each output file maps to a Kafka topic. Routed records carry their child fields as a self-contained JSON object: + +```json +{"name":"E1EDP01","path":"IDOC/E1EDP01","fields":{ + "POSEX":"000010","MATNR":"MAT-HYD-4200", + "ARKTX":"Hydraulic Pump HP-4200","MENGE":"5", + "NETWR":"12500.00","WAERS":"EUR"}} +``` + +```json +{"name":"E1EDKA1","path":"IDOC/E1EDKA1","fields":{ + "PARVW":"AG","NAME1":"GlobalParts AG", + "STRAS":"Industriestr. 42","ORT01":"Stuttgart", + "LAND1":"DE"}} +``` + +The sample IDoc produces 94 records across 6 topics. + +--- + +## Industry demos + +The three industry demos build on the baseline flow and demonstrate the **content explosion pattern** — a single large upload that produces multiple derived topic streams for downstream analytics. + +### Medical imaging (E60) + +```bash +make lfs-demo-medical +``` + +Simulates a radiology department uploading DICOM imaging files (CT/MRI scans, whole-slide pathology images). A single scan upload explodes into: + +| Topic | Content | LFS blob? | +|---|---|---| +| `medical-images` | Original DICOM blob pointer | Yes | +| `medical-metadata` | Patient ID, modality, study date | No | +| `medical-audit` | Access timestamps, user actions | No | + +Demonstrates checksum integrity validation and audit trail logging relevant to healthcare compliance scenarios. + +### Video streaming (E61) + +```bash +make lfs-demo-video +``` + +Simulates a media platform ingesting large video files. A single video upload explodes into: + +| Topic | Content | LFS blob? | +|---|---|---| +| `video-raw` | Original video blob pointer | Yes | +| `video-metadata` | Codec, duration, resolution, bitrate | No | +| `video-frames` | Keyframe timestamps and S3 references | No | +| `video-ai-tags` | Scene detection, object labels | No | + +Uses HTTP streaming upload for memory-efficient transfer of multi-gigabyte files. + +### Industrial IoT (E62) + +```bash +make lfs-demo-industrial +``` + +Simulates a factory floor with **mixed payload sizes** flowing through the same Kafka interface — small telemetry readings alongside large thermal/visual inspection images: + +| Topic | Content | LFS blob? | Typical size | +|---|---|---|---| +| `sensor-telemetry` | Real-time sensor readings | No | ~1 KB | +| `inspection-images` | Thermal/visual inspection photos | Yes | ~200 MB | +| `defect-events` | Anomaly detection alerts | No | ~2 KB | +| `quality-reports` | Aggregated quality metrics | No | ~10 KB | + +Demonstrates automatic routing based on the `LFS_BLOB` header — small payloads flow through Kafka directly, large payloads are offloaded to S3. + +--- + +## SDK demo applications + +### E70 — Java SDK + +Located in `examples/E70_java-lfs-sdk-demo/`. Demonstrates the Java LFS producer SDK uploading files via the HTTP API, consuming pointer records from Kafka, and resolving blobs from S3. + +```bash +cd examples/E70_java-lfs-sdk-demo +make run-all # builds SDK, starts port-forwards, runs demo +``` + +Key env vars: `LFS_HTTP_ENDPOINT`, `LFS_TOPIC`, `KAFKA_BOOTSTRAP`, `LFS_PAYLOAD_SIZE`. + +### E71 — Python SDK + +Located in `examples/E71_python-lfs-sdk-demo/`. Demonstrates the Python LFS SDK with three payload size presets for video upload testing: + +```bash +cd examples/E71_python-lfs-sdk-demo +make run-small # 1 MB +make run-midsize # 50 MB +make run-large # 200 MB +``` + +Requires `pip install -e lfs-client-sdk/python` for the SDK. + +### E72 — Browser SDK + +Located in `examples/E72_browser-lfs-sdk-demo/`. A single-page application that uploads files directly from the browser to the LFS proxy — no backend server required. + +```bash +make e72-browser-demo # local with port-forward +make e72-browser-demo-k8s # in-cluster deployment +``` + +Features drag-and-drop upload, real-time progress, SHA-256 verification, presigned URL download, and an inline video player for MP4 content. + +--- + +## Common environment + +All demos share these defaults: + +| Variable | Default | Description | +|---|---|---| +| `KAFSCALE_DEMO_NAMESPACE` | `kafscale-demo` | Kubernetes namespace | +| `MINIO_BUCKET` | `kafscale` | S3 bucket | +| `MINIO_REGION` | `us-east-1` | S3 region | +| `MINIO_ROOT_USER` | `minioadmin` | MinIO credentials | +| `MINIO_ROOT_PASSWORD` | `minioadmin` | MinIO credentials | +| `LFS_PROXY_IMAGE` | `ghcr.io/kafscale/kafscale-lfs-proxy:dev` | Proxy container image | + +## Related docs + +- [LFS Proxy](/lfs-proxy/) — Architecture and configuration +- [LFS Helm Deployment](/lfs-helm/) — Kubernetes deployment guide +- [LFS Client SDKs](/lfs-sdks/) — SDK API reference for Go, Java, Python, JS diff --git a/_docs/lfs-helm.md b/_docs/lfs-helm.md new file mode 100644 index 00000000..150bc50b --- /dev/null +++ b/_docs/lfs-helm.md @@ -0,0 +1,154 @@ +--- +layout: doc +title: LFS Helm Deployment +description: Deploy and configure the LFS proxy via the KafScale Helm chart. +permalink: /lfs-helm/ +nav_title: LFS Helm +nav_order: 2 +nav_group: LFS +--- + + + +# LFS Proxy Helm Deployment + +The LFS Proxy is deployed as part of the KafScale Helm chart and provides: + +- **Kafka Protocol Support**: Transparent claim-check pattern for large messages via Kafka protocol (port 9092) +- **HTTP API**: RESTful endpoint for browser and SDK uploads (port 8080) +- **S3 Storage**: Configurable S3-compatible object storage backend +- **CORS Support**: Configurable cross-origin resource sharing for browser access +- **Metrics**: Prometheus-compatible metrics endpoint (port 9095) + +## Prerequisites + +- Kubernetes 1.24+ +- Helm 3.x +- S3-compatible storage (AWS S3 or MinIO) +- KafScale operator installed + +## Installation + +```bash +# Install with LFS proxy enabled +helm install kafscale deploy/helm/kafscale \ + -f deploy/helm/kafscale/values-lfs-demo.yaml \ + --set lfsProxy.enabled=true +``` + +## Helm values + +### Core settings + +```yaml +lfsProxy: + enabled: true + replicas: 1 + image: + repository: ghcr.io/kafscale/kafscale-lfs-proxy + tag: latest + + # S3 backend + s3: + bucket: kafscale + region: us-east-1 + endpoint: "" # Custom endpoint for MinIO + pathStyle: false # Set true for MinIO + credentialsSecretRef: s3-credentials + + # HTTP API + http: + port: 8080 + corsOrigins: "*" + maxUploadSize: 0 # 0 = unlimited + + # Kafka backend + kafka: + brokers: kafscale-broker:9092 + + # Metrics + metrics: + enabled: true + port: 9095 + serviceMonitor: + enabled: false + prometheusRule: + enabled: false + + # Resources + resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: "1" + memory: 512Mi +``` + +### TLS configuration + +```yaml +lfsProxy: + tls: + enabled: false + certSecretRef: lfs-proxy-tls + kafka: + sasl: + enabled: false + mechanism: SCRAM-SHA-256 + credentialsSecretRef: kafka-sasl +``` + +### Ingress + +```yaml +lfsProxy: + ingress: + enabled: false + className: nginx + host: lfs.example.com + tls: + enabled: false + secretName: lfs-tls +``` + +## Monitoring + +When `metrics.serviceMonitor.enabled` is set to `true`, the chart creates a Prometheus `ServiceMonitor` that scrapes the LFS proxy metrics endpoint. + +Key metrics: + +| Metric | Type | Description | +|---|---|---| +| `lfs_uploads_total` | Counter | Total upload operations | +| `lfs_downloads_total` | Counter | Total download operations | +| `lfs_upload_bytes_total` | Counter | Total bytes uploaded | +| `lfs_upload_duration_seconds` | Histogram | Upload latency | +| `lfs_s3_operations_total` | Counter | S3 operations by type | +| `lfs_s3_errors_total` | Counter | S3 operation failures | + +## Local development with Docker Compose + +For local development without Kubernetes: + +```bash +cd deploy/docker-compose +docker compose up -d +``` + +This starts MinIO, a KafScale broker, and the LFS proxy with pre-configured defaults. See `deploy/docker-compose/README.md` for details. diff --git a/_docs/lfs-proxy.md b/_docs/lfs-proxy.md new file mode 100644 index 00000000..500366dd --- /dev/null +++ b/_docs/lfs-proxy.md @@ -0,0 +1,123 @@ +--- +layout: doc +title: LFS Proxy +description: Large File Support proxy for streaming binary payloads via S3 and Kafka. +permalink: /lfs-proxy/ +nav_title: LFS Proxy +nav_order: 1 +nav_group: LFS +--- + + + +# LFS Proxy + +The **LFS (Large File Support) Proxy** enables KafScale to handle large binary payloads — medical images, video files, industrial sensor dumps, SAP IDocs — that exceed typical Kafka message size limits. + +Instead of pushing multi-megabyte blobs through Kafka, the proxy implements the **claim-check pattern**: large payloads are uploaded to S3-compatible storage, and a compact JSON envelope (pointer record) is published to Kafka. Consumers use the envelope to fetch the original object on demand. + +## How it works + +``` +Producer ──▶ LFS Proxy ──▶ S3 (blob) + │ + ▼ + Kafka (pointer envelope) + │ + ▼ +Consumer ◀── LFS SDK ──▶ S3 (fetch blob) +``` + +1. **Write path (Kafka protocol)**: The proxy intercepts Produce requests. Records tagged with an `LFS_BLOB` header are rewritten: the payload is uploaded to S3 and the Kafka record is replaced with a JSON envelope containing the S3 key, checksum, and content type. + +2. **Write path (HTTP API)**: Clients can also upload files via the REST API (`POST /v1/topics/{topic}/records`). The proxy uploads the file to S3 and publishes the envelope to Kafka in one operation. + +3. **Read path**: Consumer SDKs (Go, Java, Python, JS) detect LFS envelopes and transparently fetch the original object from S3. + +## Key features + +- **Transparent Kafka proxy** — existing producers work without code changes by adding an `LFS_BLOB` header +- **HTTP upload API** — RESTful endpoint for browser and SDK uploads with OpenAPI spec +- **Checksum verification** — SHA-256, CRC-32, or MD5 integrity checks on upload and download +- **TLS and SASL** — full TLS support for HTTP endpoints and SASL/SCRAM for Kafka backend +- **Prometheus metrics** — upload/download counters, latencies, S3 operation histograms +- **CORS support** — configurable cross-origin headers for browser-based uploads +- **Helm chart** — production-ready Kubernetes deployment via the KafScale Helm chart + +## Data flow + +### Object key format + +S3 objects are stored under a deterministic key: + +``` +lfs/{namespace}/{topic}/{partition}/{offset}-{uuid}.bin +``` + +### Envelope format + +```json +{ + "lfs_version": 1, + "s3_bucket": "kafscale", + "s3_key": "lfs/default/demo-topic/0/42-abc123.bin", + "content_type": "application/octet-stream", + "content_length": 10485760, + "checksum_algo": "sha256", + "checksum": "e3b0c44298fc1c14..." +} +``` + +## Configuration + +The LFS proxy is configured via environment variables or CLI flags: + +| Variable | Default | Description | +|---|---|---| +| `LFS_S3_BUCKET` | `kafscale` | S3 bucket for blob storage | +| `LFS_S3_REGION` | `us-east-1` | S3 region | +| `LFS_S3_ENDPOINT` | — | Custom S3 endpoint (for MinIO) | +| `LFS_S3_PATH_STYLE` | `false` | Use path-style S3 addressing | +| `LFS_KAFKA_BROKERS` | `localhost:9092` | Kafka bootstrap servers | +| `LFS_HTTP_ADDR` | `:8080` | HTTP API listen address | +| `LFS_METRICS_ADDR` | `:9095` | Prometheus metrics listen address | +| `LFS_CHECKSUM_ALGO` | `sha256` | Checksum algorithm (`sha256`, `crc32`, `md5`) | +| `LFS_MAX_UPLOAD_SIZE` | `0` (unlimited) | Maximum upload size in bytes | +| `LFS_CORS_ORIGINS` | `*` | Allowed CORS origins | + +## Quick start + +```bash +# Start MinIO + broker + LFS proxy locally +make lfs-demo + +# Upload a file via HTTP +curl -X POST http://localhost:8080/v1/topics/demo-topic/records \ + -F "file=@large-file.bin" + +# Consume the envelope +kafka-console-consumer --topic demo-topic --from-beginning +``` + +## Related docs + +- [LFS Demos](/lfs-demos/) — Runnable demos from local IDoc to full Kubernetes pipelines +- [LFS Helm deployment](/lfs-helm/) — Full Helm chart configuration reference +- [LFS Client SDKs](/lfs-sdks/) — Java, Python, JS, and browser SDKs +- [Iceberg Processor](/processors/iceberg/) — LFS-aware Iceberg sink +- [Architecture](/architecture/) — Overall KafScale architecture diff --git a/_docs/lfs-sdks.md b/_docs/lfs-sdks.md new file mode 100644 index 00000000..d7f64d3b --- /dev/null +++ b/_docs/lfs-sdks.md @@ -0,0 +1,166 @@ +--- +layout: doc +title: LFS Client SDKs +description: Multi-language SDKs for producing and consuming large files via KafScale LFS. +permalink: /lfs-sdks/ +nav_title: LFS SDKs +nav_order: 3 +nav_group: LFS +--- + + + +# LFS Client SDKs + +KafScale provides LFS client SDKs in four languages. Each SDK handles envelope encoding/decoding, HTTP upload to the LFS proxy, and transparent S3 object resolution. + +## Go (built-in) + +The Go SDK lives in `pkg/lfs/` and is used internally by the LFS proxy, console, and processors. + +```go +import "github.com/KafScale/platform/pkg/lfs" + +// Produce a large file +producer := lfs.NewProducer(lfs.ProducerConfig{ + ProxyAddr: "http://localhost:8080", + Topic: "demo-topic", +}) +err := producer.Upload(ctx, "large-file.bin", fileReader) + +// Consume and resolve +consumer := lfs.NewConsumer(lfs.ConsumerConfig{ + S3Bucket: "kafscale", + S3Endpoint: "http://localhost:9000", +}) +reader, err := consumer.Resolve(ctx, envelope) +``` + +## Java + +Maven-based SDK with retry/backoff and configurable HTTP timeouts. + +```xml + + org.kafscale + lfs-sdk + 0.1.0 + +``` + +```java +import org.kafscale.lfs.LfsProducer; + +LfsProducer producer = new LfsProducer.Builder() + .proxyUrl("http://localhost:8080") + .topic("demo-topic") + .retryAttempts(3) + .connectTimeoutMs(5000) + .build(); + +producer.upload("video.mp4", inputStream, "video/mp4"); +``` + +Build from source: + +```bash +cd lfs-client-sdk/java +mvn clean package +``` + +## Python + +Pip-installable SDK with retry/backoff and envelope codec. + +```bash +pip install lfs-client-sdk/python/ +``` + +```python +from lfs_sdk import LfsProducer + +producer = LfsProducer( + proxy_url="http://localhost:8080", + topic="demo-topic", + max_retries=3, +) +producer.upload("scan.dcm", open("scan.dcm", "rb"), content_type="application/dicom") +``` + +## JavaScript (Node.js) + +TypeScript SDK with streaming upload support. + +```bash +cd lfs-client-sdk/js +npm install && npm run build +``` + +```typescript +import { LfsProducer } from '@kafscale/lfs-sdk'; + +const producer = new LfsProducer({ + proxyUrl: 'http://localhost:8080', + topic: 'demo-topic', +}); +await producer.upload('payload.bin', readStream); +``` + +## Browser SDK + +Lightweight SDK for single-page applications. Uploads files directly from the browser to the LFS proxy HTTP API. + +```typescript +import { LfsProducer } from '@kafscale/lfs-browser-sdk'; + +const producer = new LfsProducer({ + proxyUrl: 'http://lfs.example.com', + topic: 'uploads', +}); + +// Upload from a file input +const file = document.getElementById('fileInput').files[0]; +await producer.upload(file.name, file); +``` + +See the E72 browser demo (`examples/E72_browser-lfs-sdk-demo/`) for a complete single-page application. + +## Demo applications + +| Demo | Language | Description | +|---|---|---| +| E60 | — | Medical imaging LFS design | +| E61 | — | Video streaming LFS design | +| E62 | — | Industrial IoT LFS design | +| E70 | Java | Java SDK producer with video upload | +| E71 | Python | Python SDK video upload | +| E72 | Browser | Browser-native SPA file manager | + +Run any demo: + +```bash +make lfs-demo # Core LFS stack +make lfs-demo-medical # E60 medical demo +make lfs-demo-video # E61 video demo +``` + +## Related docs + +- [LFS Demos](/lfs-demos/) — Runnable demos for each SDK and industry scenario +- [LFS Proxy](/lfs-proxy/) — Proxy architecture and configuration +- [LFS Helm deployment](/lfs-helm/) — Kubernetes deployment guide