You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Recommended for 8GB VPS
./scripts/start.sh default
# Add visualization
./scripts/start.sh viz
# Add monitoring
./scripts/start.sh monitoring
# Use MinIO instead of Azure
./scripts/start.sh minio
# Full cluster (requires 16GB+ RAM)
./scripts/start.sh full
Memory Allocation Table (Default Profile)
Service
Memory Limit
Memory Reserved
CPU Limit
PostgreSQL
512MB
256MB
0.5
Redis
256MB
128MB
0.25
Hadoop NameNode
512MB
256MB
0.5
Hadoop DataNode
512MB
256MB
0.5
Spark Master
512MB
256MB
0.5
Spark Worker
1536MB
1024MB
1.0
Hive Metastore
512MB
256MB
0.5
Iceberg REST
256MB
128MB
0.25
Airflow (each)
512MB
256MB
0.5
Total
~5.5GB
~3.5GB
-
Monitoring Memory Usage
# Real-time memory monitoring
docker stats --no-stream
# Check which services are running
docker compose ps
# View memory allocation
docker compose config | grep -A 5 "resources:"
βοΈ Azure Data Lake Storage Setup
1. Create Azure Storage Account
# Using Azure CLI
az storage account create \
--name insighterastorage \
--resource-group your-resource-group \
--location southeastasia \
--sku Standard_LRS \
--kind StorageV2 \
--hierarchical-namespace true# Required for ADLS Gen2
2. Create Containers
# Create required containers
az storage container create --name warehouse --account-name insighterastorage
az storage container create --name raw-data --account-name insighterastorage
az storage container create --name spark-logs --account-name insighterastorage
3. Get Access Keys
# Get storage account key
az storage account keys list \
--account-name insighterastorage \
--query '[0].value' -o tsv
4. Configure .env
# Azure Data Lake Storage Gen2
AZURE_STORAGE_ACCOUNT=insighterastorage
AZURE_STORAGE_KEY=your_storage_key_here
AZURE_STORAGE_CONTAINER=warehouse
AZURE_CATALOG_WAREHOUSE=abfss://warehouse@insighterastorage.dfs.core.windows.net/iceberg/
5. Using Azure in Spark
frompyspark.sqlimportSparkSessionspark=SparkSession.builder \
.appName("AzureIcebergExample") \
.config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.iceberg.type", "rest") \
.config("spark.sql.catalog.iceberg.uri", "http://iceberg-rest:8181") \
.getOrCreate()
# Write to Azure via Icebergdf.writeTo("iceberg.warehouse.my_table").create()
# Read from Azure via Icebergdf=spark.table("iceberg.warehouse.my_table")
# Direct Azure accessdf=spark.read.parquet("abfss://raw-data@insighterastorage.dfs.core.windows.net/")
# Connect to Spark SQL
docker exec -it insightera-spark-master spark-sql
# Create namespace and table
CREATE NAMESPACE IF NOT EXISTS iceberg.warehouse;
CREATE TABLE iceberg.warehouse.events (
id STRING,
event_time TIMESTAMP,
event_type STRING,
user_id STRING,
data STRING
)
USING iceberg
PARTITIONED BY (days(event_time), event_type);# Insert data
INSERT INTO iceberg.warehouse.events VALUES
('1', current_timestamp(), 'click', 'user_001', '{"page": "/home"}'),
('2', current_timestamp(), 'view', 'user_002', '{"page": "/product"}');
# Query with time travel
SELECT * FROM iceberg.warehouse.events VERSION AS OF 1;# Schema evolution
ALTER TABLE iceberg.warehouse.events ADD COLUMN source STRING;
PySpark with Azure Data Lake
frompyspark.sqlimportSparkSession# Session is pre-configured with Azure credentialsspark=SparkSession.builder \
.appName("AzureDataLakeExample") \
.getOrCreate()
# Read from Azure Data Lake directlydf_raw=spark.read.json(
"abfss://raw-data@{account}.dfs.core.windows.net/events/"
)
# Write to Iceberg table (stored in Azure)df_raw.writeTo("iceberg.warehouse.events_processed").create()
# Read from Iceberg with time traveldf_v1=spark.read \
.option("snapshot-id", 1234567890) \
.table("iceberg.warehouse.events_processed")
# Incremental readdf_changes=spark.read \
.option("start-snapshot-id", 123) \
.option("end-snapshot-id", 456) \
.table("iceberg.warehouse.events_processed")
# Start HiveServer2 first
docker compose --profile hive up -d hive-server
# Connect to HiveServer2
docker exec -it insightera-hive-server beeline \
-u "jdbc:hive2://localhost:10000"# Query Iceberg tables via Hive
SHOW DATABASES;
USE iceberg;
SHOW TABLES;
SELECT * FROM warehouse.events LIMIT 10;
# Pre-configured alerts in prometheus/alerts/alerts.yml
- High CPU Usage (>80% for 5min)
- High Memory Usage (>85%)
- Disk Space Warning (<20% free)
- Container Down
- HDFS Capacity Warning (<10% free)
- Spark Worker Unavailable
# Scale Spark workers (requires --profile scale for worker-2)
docker compose --profile scale up -d
# Adjust memory in .env
SPARK_WORKER_MEMORY=2g # If you have more RAM# Apply changes
docker compose up -d
π’ Deployment to Remote Server
Using Deploy Script
# Deploy to your VPS
./scripts/deploy.sh 157.10.252.183 ~/.ssh/your_key your_user
# What it does:# 1. Syncs project files via rsync# 2. Installs Docker if needed# 3. Configures environment# 4. Builds and starts services
Manual Deployment
# SSH to server
ssh -i ~/.ssh/your_key user@your-server
# Clone or copy files
git clone https://github.com/your-repo/insightera-cluster.git
cd insightera-cluster
# Configure environment
cp .env.example .env
vim .env # Add Azure credentials# Start cluster
chmod +x scripts/*.sh
./scripts/start.sh default
./scripts/healthcheck.sh
π Security Checklist
Before production deployment:
Change all default passwords in .env
Generate new Fernet key for Airflow
Generate new secret key for Superset
Configure Azure Service Principal (instead of storage key)
Enable SSL/TLS for external access
Set up firewall rules
Configure proper RBAC in Azure
Enable audit logging
# Generate Fernet key for Airflow
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"# Generate random secret for Superset
openssl rand -hex 32
# Check logs for specific service
docker compose logs -f postgres
docker compose logs -f spark-master
# Check container status
docker compose ps
# Restart with clean state
docker compose down
docker compose up -d
Azure Connection Issues
# Verify Azure credentials
docker exec insightera-spark-master env | grep AZURE
# Test Azure connectivity from Spark
docker exec -it insightera-spark-master spark-shell
# In Spark shell:
spark.read.text("abfss://warehouse@youraccount.dfs.core.windows.net/test.txt")
# Common errors:# - "No FileSystem for scheme: abfss" β Missing Azure libraries# - "403 Forbidden" β Wrong storage key or SAS token# - "Account not found" β Wrong account name
# Run full health check
./scripts/healthcheck.sh
# Check individual services
curl -s http://localhost:9870 # Hadoop NameNode
curl -s http://localhost:8080 # Spark Master
curl -s http://localhost:8082/health # Airflow
curl -s http://localhost:8181/v1/config # Iceberg REST# Check container health status
docker inspect --format='{{.State.Health.Status}}' insightera-postgres
Performance Tuning for 8GB RAM
# Reduce shuffle partitions# In spark-defaults.conf or SparkSession:
spark.sql.shuffle.partitions=20 # Instead of default 200# Enable memory-efficient settings
spark.memory.fraction=0.4
spark.memory.storageFraction=0.3
# Use disk-based shuffle
spark.shuffle.spill=true
spark.shuffle.spill.compress=true
π License
This project is licensed under the MIT License.
π€ Contributing
Contributions are welcome! Please read our contributing guidelines before submitting PRs.
Built with β€οΈ for Big Data enthusiasts
Optimized for 8GB RAM VPS with Azure Data Lake Storage integration
About
Production-ready Big Data analytics platform optimized for 8GB RAM VPS with Azure Data Lake Storage Gen2. Features Hadoop HDFS 3.3.6, Spark 3.5.0, Iceberg 1.4.2, Hive 4.0.0, Airflow 2.8.0, Superset 3.1.0, Prometheus, and Grafana.