Skip to content

Feedback: Automated .airflowignore generation & Resource utilization concerns in Data Ingestion #116

@mckenzie-andrew

Description

@mckenzie-andrew

Hello team,

I am currently implementing the Google Cortex Framework at my organization and wanted to raise two specific concerns regarding the current setup, particularly regarding Cloud Composer resource utilization and scalability.

1. DAG Parsing & Automation (.airflowignore)

Given the deeply nested nature of the DAGs directory and the extensive helper functionality, I am curious why an .airflowignore file is not baked into the automation deployment.

We are observing heavy utilization on the DAG Processor, likely due to it parsing non-DAG helper files. While we manually added an .airflowignore to mitigate this, manual maintenance is error-prone. Baking this into the automation would ensure the scheduler always ignores the correct non-DAG paths as the framework evolves.

2. Ingestion Architecture & Resource Contention

I also have concerns regarding the design pattern where Airflow workers invoke APIs to ingest data into CSV format before moving it to GCS.

This pattern treats Airflow as a data processing engine rather than an orchestrator. This introduces significant serialization and network I/O overhead directly onto the worker nodes.

For context, we are running a medium environment with the following resources:

  • Scheduler: 1x (0.5 vCPU, 2 GB memory)
  • DAG Processor: 1x (1 vCPU, 4 GB memory)
  • Worker: Autoscaling 1-3 (1 vCPU, 4 GB memory)

With 1 vCPU per worker, the default concurrency (set to 9 tasks) causes immediate CPU contention when multiple ingestion tasks run in parallel. The current architecture forces us to scale workers vertically (increasing cost) to handle data movement that would be better suited for an external service (e.g., Cloud Run, Dataflow) or an ELT pattern (BigQuery Data Transfer).

We are currently at 70 DAGs, and I am concerned this pattern will not scale cost-effectively as we expand.

Is there a recommended workaround or a plan to offload this heavy lifting from the Airflow workers in future releases?

Thanks,
Andrew

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions