-
Notifications
You must be signed in to change notification settings - Fork 128
Feedback: Automated .airflowignore generation & Resource utilization concerns in Data Ingestion #116
Description
Hello team,
I am currently implementing the Google Cortex Framework at my organization and wanted to raise two specific concerns regarding the current setup, particularly regarding Cloud Composer resource utilization and scalability.
1. DAG Parsing & Automation (.airflowignore)
Given the deeply nested nature of the DAGs directory and the extensive helper functionality, I am curious why an .airflowignore file is not baked into the automation deployment.
We are observing heavy utilization on the DAG Processor, likely due to it parsing non-DAG helper files. While we manually added an .airflowignore to mitigate this, manual maintenance is error-prone. Baking this into the automation would ensure the scheduler always ignores the correct non-DAG paths as the framework evolves.
2. Ingestion Architecture & Resource Contention
I also have concerns regarding the design pattern where Airflow workers invoke APIs to ingest data into CSV format before moving it to GCS.
This pattern treats Airflow as a data processing engine rather than an orchestrator. This introduces significant serialization and network I/O overhead directly onto the worker nodes.
For context, we are running a medium environment with the following resources:
- Scheduler: 1x (0.5 vCPU, 2 GB memory)
- DAG Processor: 1x (1 vCPU, 4 GB memory)
- Worker: Autoscaling 1-3 (1 vCPU, 4 GB memory)
With 1 vCPU per worker, the default concurrency (set to 9 tasks) causes immediate CPU contention when multiple ingestion tasks run in parallel. The current architecture forces us to scale workers vertically (increasing cost) to handle data movement that would be better suited for an external service (e.g., Cloud Run, Dataflow) or an ELT pattern (BigQuery Data Transfer).
We are currently at 70 DAGs, and I am concerned this pattern will not scale cost-effectively as we expand.
Is there a recommended workaround or a plan to offload this heavy lifting from the Airflow workers in future releases?
Thanks,
Andrew