# aef-data-model **Repository Path**: mirrors_GoogleCloudPlatform/aef-data-model ## Basic Information - **Project Name**: aef-data-model - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-03 - **Last Updated**: 2026-03-01 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Analytics Engineering Framework - Data Model ***Note:*** For a comprehensive installation guide of all the AEF repositories together, please look [here](https://github.com/googlecloudplatform/aef-orchestration-framework/blob/main/AEF_DEPLOYMENT.md). [Analytics engineers](https://www.getdbt.com/what-is-analytics-engineering) lay the foundation for others to organize, transform, and document data using software engineering principles. Providing easy to use data platforms that empower data practitioners to independently build data pipelines in a standardized and scalable way, and answer their own data-driven questions. This repository is designed for managing your data models (BigQuery, BigLake, etc.) and deploying them from various sources (SQL files, Dataform repositories, etc.). Additionally, it facilitates defining and deploying Dataplex metadata (lakes, zones, assets, annotations, policy tags, etc.). While usable independently, this tool is optimized as a component within a comprehensive Analytics Engineering Framework comprising: 1. [Orchestration Framework](https://github.com/googlecloudplatform/aef-orchestration-framework): Maintained by Analytics Engineers to provide seamless, extensible orchestration and execution infrastructure. 1. **(This repository) Data Model**: Directly used by end data practitioners to manage data models, schemas, and Dataplex metadata. 1. [Data Orchestration](https://github.com/googlecloudplatform/aef-data-orchestration): Directly used by end data practitioners to define and deploy data pipelines using levels, threads, and steps. 1. [Data Transformation](https://github.com/googlecloudplatform/aef-data-transformation): Directly used by end data practitioners to define, store, and deploy data transformations. ![AEF_repositories_data_model.png](AEF_repositories_data_model.png) ### Concepts This reference Data Model management repository is your central hub for streamlined data model definition. It deploys dataform repositories and integrates with BigQuery metadata and [Dataplex](https://cloud.google.com/dataplex) to enable data governance, discoverability, and access control. Here's what it offers: - Creates or references Dataform repositories form given [third-party Git repositories](https://cloud.google.com/dataform/docs/connect-repository). - Creates or references BigQuery datasets based on given parameters or the variables you have in the [dataform settings files](https://cloud.google.com/dataform/docs/configure-dataform) like `dataform.json` you have in Dataform repositories. - Creates or references GCS buckets containing DDLs. - Creates or references GCS data buckets. - Creates Dataplex lakes and zones and register data assets (GCS buckets or BigQuery Datasets) accordingly based on given parameters. - Defines tag templates, policy taxonomies, governance rules and annotations to be applied to data assets based on given [parameter files](https://cloud.google.com/cortex/docs/datamesh#metadata_resources). ![data_model&metadata_trategy.png](data_model&metadata_trategy.png) ### Referenced Dataform Repositories If your data model includes datasets referenced in Dataform repositories, you can choose to create those datasets as part of this data model deployment. To do so, set the Terraform variable ***create_dataform_repositories*** to ```true```. This controls whether the Dataform scripts found in the repositories will be created alongside other Terraform resources. If set to false, Dataform repositories should be created as an additional step in your CI/CD pipeline. If you decide to use this repository to create datasets referenced in your Dataform repositories, define those datasets in the ***dataform.json*** file using the following format: ```json lines { ... "vars": { "dataset_id_":"", "dataset_projectid_":"", "dataset_location_":"", ... } } ``` ## Usage ### 1. Dataplex: - Familiarize yourself with [this](https://cloud.google.com/cortex/docs/datamesh-concepts) concepts. - Define metadata in the following `.yaml` files: - [Asset Annotations](https://cloud.google.com/cortex/docs/datamesh#asset_annotations) - Dataplex [Lakes](https://cloud.google.com/cortex/docs/datamesh#Dataplex-Lakes), Zones, and Assets - [Policy Taxonomies and Tags](https://cloud.google.com/cortex/docs/datamesh#policy_taxonomies_and_tags) - [Data Catalog Tag Templates](https://cloud.google.com/cortex/docs/datamesh#catalog_tag_templates) ``` └── metadata │ ├── annotations │ │ └── annotations.yaml │ ├── lakes │ │ └── lakes.yaml │ ├── policy_taxonomies │ │ └── policy_taxonomies.yaml │ └── tag_templates │ └── tag_templates.yaml ``` ### 2. Tracking your DDLs #### Option 1 - Dataplex Auto-Discovery - Dataplex has been configured to auto-create BigQuery tables for data files present in the [data_buckets](terraform/variables.tf#L100). - Some examples are present in the [demo-data sub-folder here](https://github.com/GoogleCloudPlatform/aef-data-model/blob/542ccd0c4639c88246fe2a28fd58ad7be1365948/sample-data/terraform/demo.tfvars#L15). - The discovery job runs once per hour, but you can override this by changing the trigger to a custom Cron job scheduled a few minutes from the current time. #### Option 2 - Create a custom external table using SQL - Depending on your configuration on the [run_ddls_in_buckets](terraform/variables.tf#L119) and [ddl_buckets](terraform/variables.tf#L125) parameters, .sql files in the referenced DDL buckets will run (at tfe plan/apply time). - Following [sample-data example](https://github.com/GoogleCloudPlatform/aef-data-model/blob/542ccd0c4639c88246fe2a28fd58ad7be1365948/sample-data/terraform/demo.tfvars#L43) you could place your DDL files in your repo at the path aef-data-model/sample-data/gcs-files/.sql to keep track in your repository of the explicit DDLs in environment. - Or just place your DLLs in your [referenced DDL buckets](terraform/variables.tf#L125), DDLs there will run if possible (at tfe plan/apply). #### Option 3 - Use dataform to track your DDLs - Depending on your configuration on the [compile_dataform_repositories](terraform/variables.tf#L47) and [execute_dataform_repositories](terraform/variables.tf#L53) parameters, .sqlx files with ``ddl`` dataform execution tag will run (at tfe plan/apply time). - [Reference your dataform repositories](https://github.com/GoogleCloudPlatform/aef-data-model/blob/542ccd0c4639c88246fe2a28fd58ad7be1365948/terraform/prod.tfvars#L11). - Add your DLLs to your dataform repositories, as the example [here](https://github.com/oscarpulido55/aef-sample-dataform-repo/blob/main/definitions/sources/raw_locations.sqlx). ### 3. Terraform While this repository can be used to keep track of your data model and metadata, the provided terraform code can be used to control deployment or just to reference it, so you can deploy it as another step in your CI/CD pipeline. Define your terraform variables. It is recommended creating a `.tfvars` file. | name | description | type | required | default | |------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|----------|---------| | [include_metadata_in_tfe_deployment](terraform/variables.tf#L16) | Controls whether metadata is deployed alongside Terraform resources. If false Metadata can be deployed as a next step in a CICD pipeline. | bool | false | - | | [overwrite_metadata](terraform/variables.tf#L22) | Whether to overwrite existing Dataplex (Cortex Datamesh) metadata. | string | true | false | | [create_dataform_datasets](terraform/variables.tf#L29) | Controls whether the datasets found in the dataform.json files in the repositories will be created alongside Terraform resources. If false datasets should be created otherwise. | bool | false | - | | [create_ddl_buckets_datasets](terraform/variables.tf#L35) | Controls whether the datasets referenced in the GCS DDL buckets will be created alongside Terraform resources. If false datasets should be created otherwise. | bool | false | - | | [create_dataform_repositories](terraform/variables.tf#L41) | Controls whether the dataform scripts found in the repositories will be created alongside Terraform resources. If false dataform repositories should be created as an additional step in the CICD pipeline. | bool | false | - | | [compile_dataform_repositories](terraform/variables.tf#L47) | Controls whether the dataform scripts found in the repositories will be compiled alongside Terraform resources. If false dataform repositories should be compiled as an additional step in the CICD pipeline. | bool | false | - | | [execute_dataform_repositories](terraform/variables.tf#L53) | Controls whether the dataform scripts found in the repositories will be executed alongside Terraform resources. If false dataform repositories should be executed as an additional step in the CICD pipeline. | bool | false | - | | [domain](terraform/variables.tf#L59) | Your organization or domain name, organization if centralized data management, domain name if one repository for each data domain in a Data mesh environment. | string | true | - | | [project](terraform/variables.tf#L65) | Project where the the dataform repositories, the Dataplex metadata, and other resources will be created. | string | true | - | | [region](terraform/variables.tf#L71) | Region where the datasets from the dataform.json files, the dataform repositories, the Dataplex metadata, and other resources will be created. | string | true | - | | [dataform_repositories](terraform/variables.tf#L77) | Dataform repository remote settings required to attach the repository to a remote repository. | map(object({...})) | false | {} | | [dataform_repositories_git_token](terraform/variables.tf#L87) | Git token to access the dataform repositories, it will be stored as a secret in secret manager, and it will be used to connect and read the dataform.json to create the datasets. | string (sensitive) | true | - | | [create_data_buckets](terraform/variables.tf#L94) | Controls whether the referenced data buckets will be created. If false referenced buckets should exist. | bool | false | - | | [data_buckets](terraform/variables.tf#L100) | Data buckets. | map(object({...})) | false | {} | | [create_ddl_buckets](terraform/variables.tf#L113) | Controls whether the referenced buckets containing DDLs will be created. If false referenced buckets should exist. | bool | false | - | | [run_ddls_in_buckets](terraform/variables.tf#L119) | Controls whether the .sql files in the referenced DDL buckets should be run. | bool | false | - | | [ddl_buckets](terraform/variables.tf#L125) | Buckets containing .sql DDL scripts to be executed on Terraform deploy, It can be of flavors: bigquery, TODO | map(object({...})) | false | {} | #### Example ```hcl project = "" region = "" domain = "" include_metadata_in_tfe_deployment = true create_dataform_repositories = true compile_dataform_repositories = true execute_dataform_repositories = true create_dataform_datasets = true dataform_repositories_git_token = "" dataform_repositories = { sample-repo-1 = { remote_repo_url = "" secret_name = "" }, ... } create_data_buckets = false data_buckets = { data-bucket-1 = { name = "" region = "" project = "" dataplex_lake = "" dataplex_zone = "" }, ... } create_ddl_buckets = false run_ddls_in_buckets = true ddl_buckets = { ddl-bucket-1 = { bucket_name = "" bucket_region = "" bucket_project = "" ddl_flavor = "bigquery" ddl_project_id = "" ddl_dataset_id = "" ddl_data_bucket_name = "" ddl_connection_name = "" }, ... } ``` #### Run Terraform Plan / Apply using the variables you defined. ```bash terraform plan -var-file="prod.tfvars" ```