🚀 InsightEra Big Data Cluster

A production-ready Big Data analytics platform powered by Docker Compose, optimized for 8GB RAM VPS with Azure Data Lake Storage integration.

📋 Table of Contents

Features
System Requirements
Architecture
Services Documentation
Quick Start
Profiles & Memory Management
Azure Data Lake Storage Setup
Usage Examples
Monitoring
Troubleshooting

Features

Category	Technology	Version	Purpose
Storage	Apache Hadoop HDFS	3.3.6	Distributed file storage
Storage	Azure Data Lake Gen2	-	Cloud object storage (primary)
Storage	MinIO	latest	S3-compatible storage (optional)
Processing	Apache Spark	3.5.0	Data processing engine
Table Format	Apache Iceberg	1.4.2	ACID transactions, time travel
Metastore	Apache Hive	4.0.0	Schema management
Orchestration	Apache Airflow	2.8.0	Workflow orchestration
Visualization	Apache Superset	3.1.0	BI dashboards
Monitoring	Prometheus	v2.48.0	Metrics collection
Monitoring	Grafana	10.2.2	Dashboards & alerting
Database	PostgreSQL	15	Metadata storage
Cache	Redis	7	Message broker

📊 System Requirements

Minimum (8GB RAM VPS - Optimized)

Resource	Requirement	Notes
RAM	8 GB	Uses profile-based deployment
CPU	2 Cores	Single worker mode
Storage	50 GB	Metadata only, data on Azure
Network	100 Mbps	For Azure connectivity

Recommended (Production)

Resource	Requirement	Notes
RAM	32 GB	Full cluster with monitoring
CPU	8+ Cores	Multiple workers
Storage	500 GB	Local + Azure hybrid
Network	1 Gbps	High-throughput data

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                     INSIGHTERA BIG DATA PLATFORM                                 │
│                   (Optimized for 8GB RAM / Azure Integration)                    │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  PRESENTATION LAYER (Optional - use profiles to enable)                         │
│  ┌─────────────────────┐              ┌─────────────────────┐                   │
│  │   Apache Superset   │              │       Grafana       │                   │
│  │   (--profile viz)   │              │ (--profile monitor) │                   │
│  │     Port: 8088      │              │     Port: 3000      │                   │
│  └──────────┬──────────┘              └──────────┬──────────┘                   │
│             │                                    │                               │
├─────────────┴────────────────────────────────────┴──────────────────────────────┤
│                                                                                  │
│  ORCHESTRATION LAYER                                                             │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │                         Apache Airflow 2.8.0                             │    │
│  │  ┌──────────┐   ┌──────────┐   ┌──────────┐                             │    │
│  │  │Webserver │   │Scheduler │   │  Worker  │   Memory: 512MB each        │    │
│  │  │  :8082   │   │          │   │ (Celery) │                             │    │
│  │  └──────────┘   └──────────┘   └──────────┘                             │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                         │
├────────────────────────────────────────┴────────────────────────────────────────┤
│                                                                                  │
│  PROCESSING LAYER                                                                │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │                    Apache Spark 3.5.0 + Iceberg 1.4.2                    │    │
│  │  ┌──────────────┐        ┌──────────────┐                               │    │
│  │  │ Spark Master │        │Spark Worker 1│    Memory: 512MB / 1.5GB      │    │
│  │  │    :8080     │        │              │    Worker 2: --profile scale  │    │
│  │  │   (512MB)    │        │   (1.5GB)    │                               │    │
│  │  └──────────────┘        └──────────────┘                               │    │
│  │                                                                          │    │
│  │  ┌──────────────────────────────────────────────────────────────────┐   │    │
│  │  │               Iceberg REST Catalog (256MB)                        │   │    │
│  │  │                        Port: 8181                                 │   │    │
│  │  └──────────────────────────────────────────────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                         │
├────────────────────────────────────────┴────────────────────────────────────────┤
│                                                                                  │
│  METADATA LAYER                                                                  │
│  ┌─────────────────────────┐        ┌─────────────────────────────────────┐     │
│  │   Hive Metastore 4.0.0  │        │       PostgreSQL 15 (512MB)         │     │
│  │        :9083            │◄───────│   Stores: Airflow, Hive, Superset   │     │
│  │       (512MB)           │        │           Port: 5432                │     │
│  └─────────────────────────┘        └─────────────────────────────────────┘     │
│                                                                                  │
├──────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  STORAGE LAYER                                                                   │
│  ┌───────────────────────────────────────────────────────────────────────────┐  │
│  │                    ☁️  AZURE DATA LAKE STORAGE Gen2 (PRIMARY)              │  │
│  │  ┌─────────────────────────────────────────────────────────────────────┐  │  │
│  │  │  Container: warehouse    │  Iceberg tables, processed data          │  │  │
│  │  │  Container: raw-data     │  Raw ingestion files                     │  │  │
│  │  │  Container: spark-logs   │  Application logs                        │  │  │
│  │  │  Protocol: abfss://      │  Azure Blob File System Secure           │  │  │
│  │  └─────────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
│  ┌──────────────────────────────────────────────────────────────────────────┐   │
│  │                      LOCAL STORAGE (VPS 50GB)                             │   │
│  │  ┌────────────┐  ┌────────────┐  ┌──────────────────────────────────┐    │   │
│  │  │  Hadoop    │  │   Redis    │  │        Temp/Cache Data           │    │   │
│  │  │  NameNode  │  │  (256MB)   │  │     (Spark shuffle, logs)        │    │   │
│  │  │  (512MB)   │  │            │  │                                  │    │   │
│  │  └────────────┘  └────────────┘  └──────────────────────────────────┘    │   │
│  └──────────────────────────────────────────────────────────────────────────┘   │
│                                                                                  │
└──────────────────────────────────────────────────────────────────────────────────┘
                              Network: insightera-network (172.28.0.0/16)

📚 Services Documentation

Core Services (Default Profile ~5.5GB RAM)

PostgreSQL 15

Property	Value
Container	`insightera-postgres`
Port	5432
Memory Limit	512MB
Purpose	Metadata store for Airflow, Hive, Superset
Databases	`airflow`, `hive_metastore`, `superset`

Redis 7

Property	Value
Container	`insightera-redis`
Port	6379
Memory Limit	256MB
Purpose	Celery message broker for Airflow
Config	`maxmemory 200mb`, LRU eviction

Hadoop HDFS 3.3.6

Component	Container	Port	Memory
NameNode	`insightera-namenode`	9870 (UI), 8020 (RPC)	512MB
DataNode 1	`insightera-datanode-1`	9864	512MB
DataNode 2	`insightera-datanode-2`	9864	512MB (profile: scale)

Heap Settings: HADOOP_HEAPSIZE_MIN=256m, HADOOP_HEAPSIZE_MAX=384m

Apache Spark 3.5.0

Component	Container	Port	Memory
Master	`insightera-spark-master`	8080 (UI), 7077 (RPC)	512MB
Worker 1	`insightera-spark-worker-1`	-	1536MB
Worker 2	`insightera-spark-worker-2`	-	1536MB (profile: scale)

Executor Config (Optimized for 8GB):

spark.driver.memory=512m
spark.executor.memory=768m
spark.executor.cores=1
spark.sql.shuffle.partitions=50

Apache Iceberg 1.4.2

Property	Value
Container	`insightera-iceberg-rest`
Port	8181
Memory Limit	256MB
Catalog Type	REST API
Storage	Azure Data Lake Gen2 / HDFS

Catalogs Available:

iceberg - REST catalog with Azure ADLS (default)
azure_catalog - Direct Azure Hadoop catalog
hive_catalog - Hive metastore catalog

Apache Hive 4.0.0

Component	Container	Port	Memory
Metastore	`insightera-hive-metastore`	9083	512MB
HiveServer2	`insightera-hive-server`	10000, 10002	512MB (profile: hive)

Apache Airflow 2.8.0

Component	Container	Port	Memory
Webserver	`insightera-airflow-webserver`	8082	512MB
Scheduler	`insightera-airflow-scheduler`	-	512MB
Worker	`insightera-airflow-worker`	-	512MB
Triggerer	`insightera-airflow-triggerer`	-	512MB (profile: airflow)
Flower	`insightera-flower`	5555	- (profile: flower)

Executor: CeleryExecutor with Redis broker

Optional Services (Use Profiles)

Apache Superset 3.1.0 (`--profile viz`)

Property	Value
Container	`insightera-superset`
Port	8088
Memory Limit	512MB
Default Login	admin / admin

Prometheus v2.48.0 (`--profile monitoring`)

Property	Value
Container	`insightera-prometheus`
Port	9090
Memory Limit	256MB
Retention	15 days

Grafana 10.2.2 (`--profile monitoring`)

Property	Value
Container	`insightera-grafana`
Port	3000
Memory Limit	256MB
Default Login	admin / admin
Pre-installed Plugins	clock-panel, simple-json-datasource

MinIO (`--profile minio`)

Property	Value
Container	`insightera-minio`
Ports	9000 (API), 9001 (Console)
Memory Limit	512MB
Purpose	S3-compatible storage (alternative to Azure)

🚀 Quick Start

1. Clone and Configure

cd /path/to/insightera-cluster

# Copy environment template
cp .env.example .env

# Edit with your Azure credentials
vim .env

2. Configure Azure Data Lake Storage

Edit .env with your Azure credentials:

# Required for Azure ADLS Gen2
AZURE_STORAGE_ACCOUNT=your_storage_account_name
AZURE_STORAGE_KEY=your_storage_account_key
AZURE_STORAGE_CONTAINER=warehouse

3. Start the Cluster

# Make scripts executable
chmod +x scripts/*.sh

# Start with default profile (optimized for 8GB RAM)
./scripts/start.sh default

# Check health
./scripts/healthcheck.sh

4. Access Services

Service	URL	Default Credentials
Hadoop NameNode	http://localhost:9870	-
Spark Master	http://localhost:8080	-
Airflow	http://localhost:8082	admin / admin
Iceberg REST	http://localhost:8181	-
Prometheus	http://localhost:9090	-
Hive Server UI	http://localhost:10002	-
Iceberg REST	http://localhost:8181	-

🎛️ Profiles & Memory Management

Profile System (Critical for 8GB RAM)

Profile	RAM Usage	Services Included	Use Case
`default`	~5.5GB	Core services	Recommended for 8GB VPS
`minimal`	~3GB	PostgreSQL, Redis, NameNode, Spark Master	Testing only
`monitoring`	~6.5GB	Core + Prometheus + Grafana	With monitoring
`viz`	~6GB	Core + Superset	With visualization
`minio`	~6GB	Core + MinIO	Use MinIO instead of Azure
`scale`	~7.5GB	Core + extra workers	More processing power
`full`	~8GB+	All services	Requires 16GB+ RAM

Starting with Profiles

# Recommended for 8GB VPS
./scripts/start.sh default

# Add visualization
./scripts/start.sh viz

# Add monitoring
./scripts/start.sh monitoring

# Use MinIO instead of Azure
./scripts/start.sh minio

# Full cluster (requires 16GB+ RAM)
./scripts/start.sh full

Memory Allocation Table (Default Profile)

Service	Memory Limit	Memory Reserved	CPU Limit
PostgreSQL	512MB	256MB	0.5
Redis	256MB	128MB	0.25
Hadoop NameNode	512MB	256MB	0.5
Hadoop DataNode	512MB	256MB	0.5
Spark Master	512MB	256MB	0.5
Spark Worker	1536MB	1024MB	1.0
Hive Metastore	512MB	256MB	0.5
Iceberg REST	256MB	128MB	0.25
Airflow (each)	512MB	256MB	0.5
Total	~5.5GB	~3.5GB	-

Monitoring Memory Usage

# Real-time memory monitoring
docker stats --no-stream

# Check which services are running
docker compose ps

# View memory allocation
docker compose config | grep -A 5 "resources:"

☁️ Azure Data Lake Storage Setup

1. Create Azure Storage Account

# Using Azure CLI
az storage account create \
  --name insighterastorage \
  --resource-group your-resource-group \
  --location southeastasia \
  --sku Standard_LRS \
  --kind StorageV2 \
  --hierarchical-namespace true  # Required for ADLS Gen2

2. Create Containers

# Create required containers
az storage container create --name warehouse --account-name insighterastorage
az storage container create --name raw-data --account-name insighterastorage
az storage container create --name spark-logs --account-name insighterastorage

3. Get Access Keys

# Get storage account key
az storage account keys list \
  --account-name insighterastorage \
  --query '[0].value' -o tsv

4. Configure `.env`

# Azure Data Lake Storage Gen2
AZURE_STORAGE_ACCOUNT=insighterastorage
AZURE_STORAGE_KEY=your_storage_key_here
AZURE_STORAGE_CONTAINER=warehouse
AZURE_CATALOG_WAREHOUSE=abfss://warehouse@insighterastorage.dfs.core.windows.net/iceberg/

5. Using Azure in Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("AzureIcebergExample") \
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.iceberg.type", "rest") \
    .config("spark.sql.catalog.iceberg.uri", "http://iceberg-rest:8181") \
    .getOrCreate()

# Write to Azure via Iceberg
df.writeTo("iceberg.warehouse.my_table").create()

# Read from Azure via Iceberg
df = spark.table("iceberg.warehouse.my_table")

# Direct Azure access
df = spark.read.parquet("abfss://raw-data@insighterastorage.dfs.core.windows.net/")

📁 Project Structure

insightera-cluster/
├── docker-compose.yaml          # Main orchestration (memory-optimized)
├── .env                          # Environment configuration
├── .env.example                  # Template with Azure config
├── .github/
│   └── copilot-instructions.md  # AI coding guidelines
│
├── docker/                       # Custom Docker images
│   ├── hadoop/                   # Hadoop 3.3.6 with Azure libs
│   ├── spark/                    # Spark 3.5.0 + Iceberg 1.4.2
│   ├── hive/                     # Hive 4.0.0 Metastore
│   ├── superset/                 # Superset 3.1.0
│   └── postgres/                 # Multi-database init
│
├── hadoop/config/                # Hadoop configuration
│   ├── core-site.xml
│   ├── hdfs-site.xml
│   ├── yarn-site.xml
│   └── mapred-site.xml
│
├── spark/                        # Spark configuration & apps
│   ├── config/
│   │   ├── spark-defaults.conf   # Optimized for 8GB + Azure
│   │   └── spark-env.sh
│   └── apps/                     # Your Spark applications
│
├── hive/config/                  # Hive configuration
│   ├── hive-site.xml
│   └── hive-env.sh
│
├── airflow/                      # Airflow DAGs & plugins
│   ├── dags/
│   │   ├── spark_iceberg_etl_example.py
│   │   └── data_quality_checks.py
│   ├── plugins/
│   └── logs/
│
├── prometheus/                   # Prometheus configuration
│   ├── prometheus.yml
│   └── alerts/alerts.yml
│
├── grafana/                      # Grafana dashboards
│   ├── provisioning/
│   │   ├── dashboards/
│   │   └── datasources/
│   └── dashboards/
│       └── cluster-overview.json
│
└── scripts/                      # Utility scripts
    ├── start.sh                  # Profile-based startup
    ├── stop.sh
    ├── healthcheck.sh
    ├── deploy.sh
    └── backup.sh

🛠️ Usage Examples

Spark SQL with Iceberg (Azure Backend)

# Connect to Spark SQL
docker exec -it insightera-spark-master spark-sql

# Create namespace and table
CREATE NAMESPACE IF NOT EXISTS iceberg.warehouse;

CREATE TABLE iceberg.warehouse.events (
    id STRING,
    event_time TIMESTAMP,
    event_type STRING,
    user_id STRING,
    data STRING
)
USING iceberg
PARTITIONED BY (days(event_time), event_type);

# Insert data
INSERT INTO iceberg.warehouse.events VALUES
    ('1', current_timestamp(), 'click', 'user_001', '{"page": "/home"}'),
    ('2', current_timestamp(), 'view', 'user_002', '{"page": "/product"}');

# Query with time travel
SELECT * FROM iceberg.warehouse.events VERSION AS OF 1;

# Schema evolution
ALTER TABLE iceberg.warehouse.events ADD COLUMN source STRING;

PySpark with Azure Data Lake

from pyspark.sql import SparkSession

# Session is pre-configured with Azure credentials
spark = SparkSession.builder \
    .appName("AzureDataLakeExample") \
    .getOrCreate()

# Read from Azure Data Lake directly
df_raw = spark.read.json(
    "abfss://raw-data@{account}.dfs.core.windows.net/events/"
)

# Write to Iceberg table (stored in Azure)
df_raw.writeTo("iceberg.warehouse.events_processed").create()

# Read from Iceberg with time travel
df_v1 = spark.read \
    .option("snapshot-id", 1234567890) \
    .table("iceberg.warehouse.events_processed")

# Incremental read
df_changes = spark.read \
    .option("start-snapshot-id", 123) \
    .option("end-snapshot-id", 456) \
    .table("iceberg.warehouse.events_processed")

Airflow DAG Example

from airflow import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id='azure_etl_pipeline',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily',
    catchup=False,
) as dag:

    # Check Spark availability
    check_spark = BashOperator(
        task_id='check_spark',
        bash_command='curl -s http://spark-master:8080 > /dev/null'
    )

    # Run Spark job with Azure
    etl_job = SparkSubmitOperator(
        task_id='run_etl',
        application='/opt/spark/apps/etl_job.py',
        conn_id='spark_default',
        conf={
            'spark.sql.catalog.iceberg.uri': 'http://iceberg-rest:8181',
            'spark.driver.memory': '512m',
            'spark.executor.memory': '768m',
        }
    )

    check_spark >> etl_job

Hive Queries (Optional - use profile 'hive')

# Start HiveServer2 first
docker compose --profile hive up -d hive-server

# Connect to HiveServer2
docker exec -it insightera-hive-server beeline \
    -u "jdbc:hive2://localhost:10000"

# Query Iceberg tables via Hive
SHOW DATABASES;
USE iceberg;
SHOW TABLES;
SELECT * FROM warehouse.events LIMIT 10;

📊 Monitoring

Using Profiles for Monitoring

# Start with monitoring
./scripts/start.sh monitoring

# Access dashboards
# Grafana: http://localhost:3000 (admin/admin)
# Prometheus: http://localhost:9090

Pre-configured Grafana Dashboards

Dashboard	Description
Cluster Overview	CPU, Memory, Disk usage across all services
Container Metrics	Individual container resource consumption
Spark Metrics	Job execution, memory, shuffle stats
Airflow Status	DAG runs, task durations, failures

Prometheus Alert Rules

# Pre-configured alerts in prometheus/alerts/alerts.yml
- High CPU Usage (>80% for 5min)
- High Memory Usage (>85%)
- Disk Space Warning (<20% free)
- Container Down
- HDFS Capacity Warning (<10% free)
- Spark Worker Unavailable

Memory Monitoring Commands

# Real-time memory stats
docker stats

# Memory usage summary
docker stats --no-stream --format \
    "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Check for OOM events
dmesg | grep -i "out of memory"

# Container resource limits
docker inspect insightera-spark-worker-1 | \
    jq '.[0].HostConfig.Memory'

🔧 Configuration Reference

Environment Variables

Variable	Default	Description
`AZURE_STORAGE_ACCOUNT`	-	Azure Storage account name
`AZURE_STORAGE_KEY`	-	Azure Storage access key
`AZURE_STORAGE_CONTAINER`	warehouse	Default container name
`SPARK_WORKER_CORES`	1	Cores per Spark worker
`SPARK_WORKER_MEMORY`	1g	Memory per Spark worker
`SPARK_EXECUTOR_MEMORY`	768m	Executor memory (8GB optimized)
`POSTGRES_USER`	insightera	PostgreSQL username
`POSTGRES_PASSWORD`	-	PostgreSQL password
`AIRFLOW__CORE__FERNET_KEY`	-	Airflow encryption key

Scaling Resources

# Scale Spark workers (requires --profile scale for worker-2)
docker compose --profile scale up -d

# Adjust memory in .env
SPARK_WORKER_MEMORY=2g  # If you have more RAM

# Apply changes
docker compose up -d

🚢 Deployment to Remote Server

Using Deploy Script

# Deploy to your VPS
./scripts/deploy.sh 157.10.252.183 ~/.ssh/your_key your_user

# What it does:
# 1. Syncs project files via rsync
# 2. Installs Docker if needed
# 3. Configures environment
# 4. Builds and starts services

Manual Deployment

# SSH to server
ssh -i ~/.ssh/your_key user@your-server

# Clone or copy files
git clone https://github.com/your-repo/insightera-cluster.git
cd insightera-cluster

# Configure environment
cp .env.example .env
vim .env  # Add Azure credentials

# Start cluster
chmod +x scripts/*.sh
./scripts/start.sh default
./scripts/healthcheck.sh

🔒 Security Checklist

Before production deployment:

Change all default passwords in .env
Generate new Fernet key for Airflow
Generate new secret key for Superset
Configure Azure Service Principal (instead of storage key)
Enable SSL/TLS for external access
Set up firewall rules
Configure proper RBAC in Azure
Enable audit logging

# Generate Fernet key for Airflow
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

# Generate random secret for Superset
openssl rand -hex 32

🐛 Troubleshooting

Common Issues

Out of Memory (OOM) on 8GB VPS

# Check memory usage
docker stats --no-stream

# If memory > 7GB, stop optional services
docker compose stop superset grafana prometheus

# Use minimal profile
./scripts/stop.sh
./scripts/start.sh minimal

Services Won't Start

# Check logs for specific service
docker compose logs -f postgres
docker compose logs -f spark-master

# Check container status
docker compose ps

# Restart with clean state
docker compose down
docker compose up -d

Azure Connection Issues

# Verify Azure credentials
docker exec insightera-spark-master env | grep AZURE

# Test Azure connectivity from Spark
docker exec -it insightera-spark-master spark-shell
# In Spark shell:
spark.read.text("abfss://warehouse@youraccount.dfs.core.windows.net/test.txt")

# Common errors:
# - "No FileSystem for scheme: abfss" → Missing Azure libraries
# - "403 Forbidden" → Wrong storage key or SAS token
# - "Account not found" → Wrong account name

HDFS Issues

# Check HDFS health
docker exec insightera-namenode hdfs dfsadmin -report

# Leave safe mode
docker exec insightera-namenode hdfs dfsadmin -safemode leave

# Check disk usage
docker exec insightera-namenode hdfs dfs -df -h

# Fix permission issues
docker exec insightera-namenode hdfs dfs -chmod -R 777 /user

Spark Job Failures

# Check Spark Master logs
docker logs insightera-spark-master

# Check Spark Worker logs
docker logs insightera-spark-worker-1

# Access Spark UI for detailed job info
open http://localhost:8080

# Common memory errors - reduce memory in job:
spark-submit --driver-memory 256m --executor-memory 512m your_job.py

Airflow Issues

# Check Airflow webserver logs
docker logs insightera-airflow-webserver

# Check scheduler logs
docker logs insightera-airflow-scheduler

# Reinitialize database
docker exec insightera-airflow-webserver airflow db init

# Reset admin password
docker exec insightera-airflow-webserver \
    airflow users create --username admin --password admin \
    --firstname Admin --lastname User --role Admin --email admin@local

Iceberg Catalog Issues

# Check Iceberg REST service
curl http://localhost:8181/v1/config

# Check catalog namespaces
curl http://localhost:8181/v1/namespaces

# Verify Iceberg tables
docker exec -it insightera-spark-master spark-sql -e "SHOW NAMESPACES IN iceberg"

Health Check Commands

# Run full health check
./scripts/healthcheck.sh

# Check individual services
curl -s http://localhost:9870  # Hadoop NameNode
curl -s http://localhost:8080  # Spark Master
curl -s http://localhost:8082/health  # Airflow
curl -s http://localhost:8181/v1/config  # Iceberg REST

# Check container health status
docker inspect --format='{{.State.Health.Status}}' insightera-postgres

Performance Tuning for 8GB RAM

# Reduce shuffle partitions
# In spark-defaults.conf or SparkSession:
spark.sql.shuffle.partitions=20  # Instead of default 200

# Enable memory-efficient settings
spark.memory.fraction=0.4
spark.memory.storageFraction=0.3

# Use disk-based shuffle
spark.shuffle.spill=true
spark.shuffle.spill.compress=true

📝 License

This project is licensed under the MIT License.

🤝 Contributing

Contributions are welcome! Please read our contributing guidelines before submitting PRs.

Built with ❤️ for Big Data enthusiasts

Optimized for 8GB RAM VPS with Azure Data Lake Storage integration

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
airflow		airflow
deploy		deploy
docker		docker
grafana		grafana
hadoop/config		hadoop/config
hive/config		hive/config
nginx		nginx
prometheus		prometheus
scripts		scripts
spark		spark
superset/config		superset/config
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Folders and files

Latest commit

History

Repository files navigation