Skip to content

Module Structure

terraform-aws-arc-glue

Latest Release Last Updated Terraform GitHub Actions

Quality gate


Overview

SourceFuse AWS Reference Architecture (ARC) Terraform module for managing AWS Glue resources, providing a comprehensive, production-ready solution for deploying data catalog, ETL jobs, crawlers, workflows, and related components with enterprise-grade security and operational best practices.

Module Features

This module provides a complete AWS Glue infrastructure including:

  • Data Catalog Management: Automated database and table creation with cross-account access policies
  • ETL Job Orchestration: Support for Spark, Python Shell, and Ray jobs with configurable worker types
  • Data Discovery: Multi-source crawlers (S3, JDBC, MongoDB, Delta Lake) with scheduling
  • Workflow Automation: Complex workflow orchestration with triggers and dependencies
  • Security & Compliance: KMS encryption, IAM role management, VPC integration, and secret management
  • Enterprise Integration: JDBC connections for RDS/Redshift, MongoDB, and other data sources
  • Monitoring & Logging: CloudWatch integration, job bookmarks, and execution tracking

Usage

To see a full example, check out the complete example or simple example files in the example folder.

module "glue" {
  source = "sourcefuse/arc-glue/aws"

  namespace   = "mycompany"
  environment = "prod"
  name        = "data-lake"
  region      = "us-east-1"

  glue_config = {
    database = {
      create = true
      name   = "enterprise_database"
    }

    crawlers = {
      "s3-data-lake" = {
        database_name = "enterprise_database"
        role_arn      = aws_iam_role.glue.arn
        targets = {
          s3_targets = [{
            path = "s3://my-data-bucket/raw/"
          }]
        }
        schedule = "cron(0 2 * * ? *)"  # Daily at 2 AM
      }
    }

    jobs = {
      "transform-job" = {
        role_arn     = aws_iam_role.glue.arn
        glue_version = "4.0"
        command = {
          name    = "glueetl"
          script  = "s3://my-scripts/transform.py"
        }
        worker_type    = "G.2X"
        number_of_workers = 10
      }
    }
  }

  tags = {
    Project     = "Data Lake"
    CostCenter  = "Analytics"
    Compliance  = "HIPAA"
  }
}

Requirements

Name Version
terraform >= 1.5.0
aws >= 5.0, < 7.0

Providers

Name Version
aws 6.40.0

Modules

No modules.

Resources

Name Type
aws_glue_catalog_database.main resource
aws_glue_classifier.csv resource
aws_glue_classifier.grok resource
aws_glue_classifier.json resource
aws_glue_classifier.xml resource
aws_glue_connection.main resource
aws_glue_crawler.main resource
aws_glue_job.main resource
aws_glue_security_configuration.main resource
aws_glue_trigger.main resource
aws_glue_workflow.main resource
aws_iam_role.glue resource
aws_iam_role_policy_attachment.glue_basic resource
aws_iam_role_policy_attachment.glue_custom resource
aws_iam_role_policy_attachment.glue_s3 resource
aws_secretsmanager_secret.main resource
aws_secretsmanager_secret_version.main resource
aws_caller_identity.current data source
aws_iam_policy_document.assume_role data source

Inputs

Name Description Type Default Required
environment Environment identifier (e.g., dev, staging, prod) string n/a yes
glue_config AWS Glue configuration
object({
create = optional(bool, true)

# Glue Catalog Database
database = optional(object({
create = optional(bool, true)
name = optional(string, "default_database")
description = optional(string, "Default Glue database")
create_table_default_permission = optional(object({
create = optional(bool, false)
permissions = optional(list(object({
principal = map(string)
permissions = list(string)
})), [])
}), {})
}), {})

# Glue Workflows
workflows = optional(map(object({
description = optional(string, "")
max_concurrent_runs = optional(number, null)
})), {})

# Glue Triggers
triggers = optional(map(object({
description = optional(string, "")
workflow_name = optional(string, null)
type = string # SCHEDULED, CONDITIONAL, EVENT_DATA, ON_DEMAND
schedule = optional(string, null)
predicate = optional(object({
logical = optional(string, "AND")
conditions = list(object({
job_name = optional(string, null)
crawler_name = optional(string, null)
state = optional(string, null)
crawl_state = optional(string, null)
}))
}), null)
actions = list(object({
job_name = optional(string, null)
arguments = optional(map(string), null)
timeout = optional(number, null)
crawler_name = optional(string, null)
}))
event_batching_condition = optional(object({
batch_window = optional(number, null)
batch_size = optional(number, null)
}), null)
})), {})

# Glue Classifiers
classifiers = optional(map(object({
grok_classifier = optional(object({
name = string
classification = string
grok_pattern = string
custom_patterns = optional(map(string), {})
}), null)
json_classifier = optional(object({
name = string
json_path = string
}), null)
xml_classifier = optional(object({
name = string
classification = string
row_tag = string
}), null)
csv_classifier = optional(object({
name = string
delimiter = optional(string, ",")
quote_char = optional(string, "\"")
contains_header = optional(string, "UNKNOWN") # PRESENT, ABSENT, UNKNOWN
header = optional(list(string), [])
disable_value_trimming = optional(bool, false)
allow_single_quotes = optional(bool, false)
}), null)
})), {})

# Glue Dev Endpoints
dev_endpoints = optional(map(object({
description = optional(string, "")
role_arn = optional(string, null)
public_key = string
number_of_nodes = optional(number, 5)
worker_type = optional(string, "G.1X") # Standard, G.1X, G.2X
glue_version = optional(string, "2.0")
number_of_workers = optional(number, 2)
extra_python_libs_s3_path = optional(string, null)
extra_jars_s3_path = optional(string, null)
security_configuration = optional(string, null)
})), {})

# Glue Security Configurations
security_configurations = optional(map(object({
encryption_configuration = object({
s3_encryption = optional(object({
s3_encryption_mode = optional(string, "SSE-KMS") # SSE-KMS, SSE-S3, DISABLED
kms_key_arn = optional(string, null)
}), {})
cloudwatch_encryption = optional(object({
cloudwatch_encryption_mode = optional(string, "SSE-KMS") # SSE-KMS, DISABLED
kms_key_arn = optional(string, null)
}), {})
job_bookmarks_encryption = optional(object({
job_bookmarks_encryption_mode = optional(string, "CSE-KMS") # CSE-KMS, DISABLED
kms_key_arn = optional(string, null)
}), {})
})
})), {})

# Data Catalog Encryption Settings
catalog_encryption_settings = optional(object({
create = optional(bool, false)
connection_password_encryption = optional(bool, true)
at_rest_encryption = optional(object({
catalog_encryption_mode = optional(string, "SSE-KMS") # SSE-KMS, SSE-KMS-DIRECT-QUERY, DISABLED
kms_key_arn = optional(string, null)
}), {})
s3_encryption = optional(object({
s3_encryption_mode = optional(string, "SSE-KMS") # SSE-KMS, SSE-S3, DISABLED
kms_key_arn = optional(string, null)
}), {})
}), {})

# Data Catalog Resource Policy
catalog_resource_policy = optional(object({
create = optional(bool, false)
policy = optional(string, "")
description = optional(string, "Glue Data Catalog Resource Policy")
}), {})

# Glue Partition Index
partition_indexes = optional(map(object({
database_name = string
table_name = string
index_name = string
keys = list(string)
})), {})
})
{} no
glue_connections Glue connections to create. Kept separate from glue_config to avoid for_each unknown value issues when connection_properties contain apply-time values.
map(object({
connection_type = string
description = optional(string, "")
connection_properties = map(string)
physical_connection_requirements = optional(object({
availability_zone = optional(string, null)
subnet_id = optional(string, null)
security_group_id_list = optional(list(string), [])
}), null)
}))
{} no
glue_crawlers Glue crawlers. Kept separate from glue_config to avoid for_each unknown-value issues when targets contain apply-time values.
map(object({
database_name = string
description = optional(string, "")
role_arn = optional(string, null)
schedule = optional(string, null)
classifiers = optional(list(string), [])
configuration = optional(string, null)
table_prefix = optional(string, "")
schema_change_policy = optional(object({
delete_behavior = optional(string, "LOG")
update_behavior = optional(string, "UPDATE_IN_DATABASE")
}), {})
recrawl_policy = optional(object({
recrawl_behavior = optional(string, "CRAWL_NEW_FOLDERS_ONLY")
}), {})
lineage_configuration = optional(object({
crawler_lineage_settings = optional(string, "ENABLE")
}), {})
targets = object({
s3_targets = optional(list(object({
path = string
exclusions = optional(list(string), [])
connection_name = optional(string, null)
sample_size = optional(number, null)
event_queue_arn = optional(string, null)
dlq_event_queue_arn = optional(string, null)
})), [])
jdbc_targets = optional(list(object({
connection_name = string
path = optional(string, null)
exclusions = optional(list(string), [])
})), [])
mongo_db_targets = optional(list(object({
connection_name = string
path = optional(string, null)
scan_all = optional(bool, null)
})), [])
delta_targets = optional(list(object({
connection_name = optional(string, null)
delta_tables = optional(list(string), [])
write_manifest = optional(bool, null)
})), [])
catalog_targets = optional(list(object({
database_name = string
tables = list(string)
event_queue_arn = optional(string, null)
dlq_event_queue_arn = optional(string, null)
})), [])
})
}))
{} no
glue_jobs Glue jobs. Kept separate from glue_config to avoid for_each unknown-value issues when script_location contains apply-time values.
map(object({
description = optional(string, "")
role_arn = optional(string, null)
command = object({
name = string
script_location = string
python_version = optional(string, "3")
runtime = optional(string, null)
})
default_arguments = optional(map(string), {})
non_overridable_arguments = optional(map(string), {})
execution_property = optional(object({
max_concurrent_runs = optional(number, 1)
}), {})
max_retries = optional(number, 0)
timeout = optional(number, null)
max_capacity = optional(number, null)
number_of_workers = optional(number, null)
worker_type = optional(string, null)
glue_version = optional(string, "4.0")
execution_class = optional(string, null)
}))
{} no
iam_config IAM configuration for Glue resources
object({
create_role = optional(bool, true)
role_name = optional(string, null)
role_description = optional(string, "AWS Glue IAM Role")
role_policies = optional(map(string), {}) # Map of policy names to policy ARNs
create_custom_policy = optional(bool, false)
custom_policy_name = optional(string, null)
custom_policy_document = optional(string, null)
permissions_boundary = optional(string, null)
trusted_role_arns = optional(list(string), [])
})
{} no
kms_key_arn Optional KMS key ARN to use for Glue encryption. If not provided, module will create its own KMS key but won't use it for security configurations to avoid circular dependencies. string null no
name Name prefix for AWS Glue resources string n/a yes
namespace Namespace (organization) identifier for resources string n/a yes
region AWS region where resources will be created string "us-east-1" no
secrets_config Secrets Manager configuration for storing credentials
object({
secrets = optional(map(object({
name = optional(string, null)
description = optional(string, "")
secret_string = optional(string, null)
})), {})
})
{} no
tags Default tags to apply to all resources map(string) {} no

Outputs

Name Description
aws_account_id The AWS account ID where resources are created
glue_connection_names Map of connection key to name
glue_crawler_arns Map of crawler name to ARN
glue_crawler_names Map of crawler key to name
glue_database_arn The ARN of the Glue catalog database
glue_database_id The ID of the Glue catalog database
glue_database_name The name of the Glue catalog database
glue_iam_role_arn The ARN of the Glue IAM role
glue_iam_role_id The ID of the Glue IAM role
glue_iam_role_name The name of the Glue IAM role
glue_job_arns Map of job key to ARN
glue_job_names Map of job key to name
glue_secret_arns Map of secret key to ARN
glue_security_configurations Map of security configuration key to name
glue_workflows Map of workflow key to workflow object
module_version The version of this module
resource_prefix The resource prefix used for naming

Examples

Simple Example

Basic configuration with database and S3 crawler:

module "glue_simple" {
  source = "git::https://github.com/sourcefuse/terraform-aws-arc-glue.git"

  namespace   = "mycompany"
  environment = "dev"
  name        = "simple-glue"
  region      = "us-east-1"

  glue_config = {
    database = {
      create = true
      name   = "simple_database"
    }
  }

  glue_crawlers = {
    "s3-crawler" = {
      database_name = "simple_database"
      role_arn      = aws_iam_role.glue.arn
      targets = {
        s3_targets = [{
          path = "s3://my-data-bucket/"
        }]
      }
    }
  }
}

Complete Example

Enterprise configuration with jobs, workflows, and connections:

module "glue_complete" {
  source = "git::https://github.com/sourcefuse/terraform-aws-arc-glue.git"

  namespace   = "mycompany"
  environment = "prod"
  name        = "enterprise-glue"
  region      = "us-east-1"

  iam_config = {
    create_role = true
    role_policies = {
      "AmazonS3FullAccess" = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
      "AmazonAthenaFullAccess" = "arn:aws:iam::aws:policy/AmazonAthenaFullAccess"
    }
  }

  glue_config = {
    database = {
      create = true
      name   = "enterprise_database"
    }

    security_configuration = {
      create = true
      name   = "enterprise_security_config"
    }
  }

  glue_jobs = {
    "etl-job" = {
      role_arn     = aws_iam_role.glue.arn
      glue_version = "4.0"
      command = {
        name   = "glueetl"
        script = "s3://my-scripts/etl.py"
      }
      worker_type       = "G.2X"
      number_of_workers = 20
      max_capacity      = null
    }
  }

  glue_crawlers = {
    "s3-raw-data" = {
      database_name = "enterprise_database"
      role_arn      = aws_iam_role.glue.arn
      targets = {
        s3_targets = [{
          path = "s3://raw-data-bucket/"
        }]
      }
      schedule = "cron(0 1 * * ? *)"
    }

    "jdbc-source" = {
      database_name = "enterprise_database"
      role_arn      = aws_iam_role.glue.arn
      targets = {
        jdbc_targets = [{
          connection_name = "rds-connection"
          path            = "testdb/%"
        }]
      }
    }
  }

  glue_connections = {
    "rds-connection" = {
      connection_type = "JDBC"
      connection_properties = {
        JDBC_CONNECTION_URL = "jdbc:postgresql://${aws_rds_cluster.main.endpoint}:5432/testdb"
        USERNAME            = "admin"
        PASSWORD            = aws_secretsmanager_secret_version.rds_password.secret_string
      }
      security_groups = [aws_security_group.glue.id]
    }
  }
}

Module Configuration Details

IAM Configuration

The module can create and manage IAM roles with appropriate permissions:

iam_config = {
  create_role = true
  role_name   = "custom-glue-role"
  role_description = "Custom Glue execution role"

  # Add custom policies
  role_policies = {
    "CustomS3Access" = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
  }

  # Add permissions boundary for security
  permissions_boundary = "arn:aws:iam::123456789012:policy/GluePermissionsBoundary"

  # Enable cross-account access
  trusted_role_arns = [
    "arn:aws:iam::123456789012:root"
  ]
}

VPC Configuration

For private Glue jobs and connections:

vpc_config = {
  vpc_id             = "vpc-12345678"
  security_group_name = "glue-security-group"

  # Optional subnet configuration
  subnet_ids = ["subnet-12345", "subnet-67890"]

  # Security group rules
  ingress_rules = [{
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["10.0.0.0/8"]
  }]
}

Glue Job Types

The module supports multiple Glue job types:

Spark ETL Jobs:

spark_job = {
  role_arn     = aws_iam_role.glue.arn
  glue_version = "4.0"
  command = {
    name   = "glueetl"
    script = "s3://scripts/transform.py"
  }
  worker_type       = "G.2X"  # Standard, G.1X, G.2X, Z.2X
  number_of_workers = 10
}

Python Shell Jobs:

1
2
3
4
5
6
7
8
9
python_job = {
  role_arn     = aws_iam_role.glue.arn
  glue_version = "1.0"
  command = {
    name    = "pythonshell"
    script  = "s3://scripts/process.py"
  }
  max_capacity = 0.0625  # DPU for Python shell
}

Workflow Orchestration

Complex workflow orchestration with triggers:

glue_config = {
  workflows = {
    "data-pipeline" = {
      description = "End-to-end data processing pipeline"
      max_concurrent_runs = 1
    }
  }

  triggers = {
    "start-crawler" = {
      workflow_name = "data-pipeline"
      type          = "SCHEDULED"
      schedule      = "cron(0 2 * * ? *)"
      actions       = ["s3-raw-data"]
    }

    "start-etl" = {
      workflow_name = "data-pipeline"
      type          = "CONDITIONAL"
      predicate {
        conditions {
          job_name = "s3-raw-data"
          state    = "SUCCEEDED"
        }
      }
      actions = ["etl-job"]
    }
  }
}

Security Considerations

Encryption

The module supports comprehensive encryption:

# Use custom KMS key
kms_key_arn = "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"

# Configure security configuration
glue_config = {
  security_configuration = {
    create = true
    s3_encryption = {
      s3_encryption_mode = "SSE-KMS"
      kms_key_arn        = "arn:aws:kms:us-east-1:123456789012:key/12345678"
    }
    cloudwatch_encryption = {
      cloudwatch_encryption_mode = "SSE-KMS"
      kms_key_arn               = "arn:aws:kms:us-east-1:123456789012:key/12345678"
    }
  }
}

Network Security

  • VPC Integration: Deploy Glue resources in private VPC subnets
  • Security Groups: Control inbound/outbound traffic
  • Endpoints: Use VPC endpoints for private connectivity
  • IAM Policies: Implement least-privilege access

Troubleshooting

Common Issues

1. Crawler Timeout

1
2
3
4
5
6
7
8
9
glue_crawlers = {
  "my-crawler" = {
    # Increase timeout for large datasets
    timeouts = {
      create = "60m"
      update = "60m"
    }
  }
}

2. Job Failures - Check CloudWatch Logs: /aws-glue/jobs/output - Verify IAM permissions: S3, CloudWatch, Glue - Validate script locations in S3 - Review security group rules for network access

3. Connection Issues - Verify VPC endpoints for JDBC connections - Check security group ingress/egress rules - Validate credentials in Secrets Manager - Test connectivity from Glue to data source

Cost Optimization

Worker Type Selection

  • Standard: Cost-effective for simple transformations
  • G.1X: Memory-intensive workloads
  • G.2X: Balanced compute/memory
  • Z.2X: Compute-intensive with lower memory

Job Configuration

glue_jobs = {
  "cost-optimized-job" = {
    # Use execution class for cost savings
    execution_class = "FLEX"  # STANDARD or FLEX

    # Configure timeouts to prevent runaway costs
    timeout = 60  # minutes

    # Enable job bookmarks for incremental processing
    command = {
      name    = "glueetl"
      script  = "s3://scripts/incremental.py"
    }

    # Reduce workers for testing
    number_of_workers = 2
  }
}

Best Practices

  1. Resource Naming: Use consistent, descriptive resource names
  2. Tagging Strategy: Implement comprehensive tagging for cost management
  3. Incremental Processing: Use job bookmarks for efficiency
  4. Testing: Test jobs in development environment before production
  5. Monitoring: Set up CloudWatch alarms and metrics
  6. Security: Regularly audit IAM permissions and security groups
  7. Version Control: Store Glue scripts in version control
  8. Documentation: Document job logic and data transformations

Development

Prerequisites

Configurations

  • Configure pre-commit hooks
    pre-commit install
    

Versioning

while Contributing or doing git commit please specify the breaking change in your commit message whether its major,minor or patch

For Example

git commit -m "your commit message #major"
By specifying this , it will bump the version and if you don't specify this in your commit message then by default it will consider patch and will bump that accordingly

Tests

  • Tests are available in test directory
  • Configure the dependencies
    1
    2
    3
    cd test/
    go mod init github.com/sourcefuse/terraform-aws-refarch-<module_name>
    go get github.com/gruntwork-io/terratest/modules/terraform
    
  • Now execute the test
    go test -timeout  30m
    

Authors

This project is authored by: - SourceFuse ARC Team