Feedback: Automated .airflowignore generation & Resource utilization concerns in Data Ingestion

Hello team,

I am currently implementing the Google Cortex Framework at my organization and wanted to raise two specific concerns regarding the current setup, particularly regarding Cloud Composer resource utilization and scalability.

### 1. DAG Parsing & Automation (`.airflowignore`)
Given the deeply nested nature of the DAGs directory and the extensive helper functionality, I am curious why an `.airflowignore` file is not baked into the automation deployment.

We are observing heavy utilization on the DAG Processor, likely due to it parsing non-DAG helper files. While we manually added an `.airflowignore` to mitigate this, manual maintenance is error-prone. Baking this into the automation would ensure the scheduler always ignores the correct non-DAG paths as the framework evolves.

### 2. Ingestion Architecture & Resource Contention
I also have concerns regarding the design pattern where Airflow workers invoke APIs to ingest data into CSV format before moving it to GCS.

This pattern treats Airflow as a data processing engine rather than an orchestrator. This introduces significant serialization and network I/O overhead directly onto the worker nodes.

For context, we are running a medium environment with the following resources:

- Scheduler: 1x (0.5 vCPU, 2 GB memory)
- DAG Processor: 1x (1 vCPU, 4 GB memory)
- Worker: Autoscaling 1-3 (1 vCPU, 4 GB memory)

With 1 vCPU per worker, the default concurrency (set to 9 tasks) causes immediate CPU contention when multiple ingestion tasks run in parallel. The current architecture forces us to scale workers vertically (increasing cost) to handle data movement that would be better suited for an external service (e.g., Cloud Run, Dataflow) or an ELT pattern (BigQuery Data Transfer).

We are currently at 70 DAGs, and I am concerned this pattern will not scale cost-effectively as we expand.

Is there a recommended workaround or a plan to offload this heavy lifting from the Airflow workers in future releases?

Thanks,
Andrew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback: Automated .airflowignore generation & Resource utilization concerns in Data Ingestion #116

1. DAG Parsing & Automation (`.airflowignore`)

2. Ingestion Architecture & Resource Contention

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feedback: Automated .airflowignore generation & Resource utilization concerns in Data Ingestion #116

Description

1. DAG Parsing & Automation (.airflowignore)

2. Ingestion Architecture & Resource Contention

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. DAG Parsing & Automation (`.airflowignore`)