terraform-aws-arc-glue ¶

Module: sourcefuse/arc-glue/aws

Registry: https://registry.terraform.io/modules/sourcefuse/arc-glue/aws

Category: Analytics / ETL

Source: https://github.com/sourcefuse/terraform-aws-arc-glue

Tip

🤖 New: Use this module with AI assistants via the ARC IaC MCP Server — search, scaffold, and security-scan ARC modules from natural language. Quick setup ↓

Overview¶

Manages AWS Glue resources — data catalog databases, ETL jobs, crawlers, workflows, connections, and triggers — for data lake and ETL pipelines.

Architecture¶

Architecture Diagram

What It Does¶

Glue Data Catalog databases and tables
ETL jobs (Spark, Python Shell, Ray) with configurable workers
Crawlers for automatic schema discovery from S3, JDBC, and more
Workflows and triggers for orchestration
Glue connections for JDBC and network sources
IAM roles with least-privilege policies
CloudWatch metrics and job bookmarks

For more information about this repository and its usage, please see Terraform AWS GLUE Usage Guide.

Quickstart¶

module "glue" {
  source = "sourcefuse/arc-glue/aws"

  namespace   = "mycompany"
  environment = "prod"
  name        = "data-lake"
  region      = "us-east-1"

  glue_config = {
    database = {
      create = true
      name   = "enterprise_database"
    }

    crawlers = {
      "s3-data-lake" = {
        database_name = "enterprise_database"
        role_arn      = aws_iam_role.glue.arn
        targets = {
          s3_targets = [{
            path = "s3://my-data-bucket/raw/"
          }]
        }
        schedule = "cron(0 2 * * ? *)"  # Daily at 2 AM
      }
    }

    jobs = {
      "transform-job" = {
        role_arn     = aws_iam_role.glue.arn
        glue_version = "4.0"
        command = {
          name    = "glueetl"
          script  = "s3://my-scripts/transform.py"
        }
        worker_type    = "G.2X"
        number_of_workers = 10
      }
    }
  }

  tags = {
    Project     = "Data Lake"
    CostCenter  = "Analytics"
    Compliance  = "HIPAA"
  }
}

Required Inputs¶

Name	Type	Description
`namespace`	`string`	Namespace prefix
`environment`	`string`	Deployment environment
`name`	`string`	Component name
`region`	`string`	AWS region
## Key Outputs

Name	Description
`database_name`	Glue catalog database name
`job_names`	Map of Glue job names
`crawler_names`	Map of crawler names
## Full Variable & Output Reference

The complete inputs/outputs reference is auto-generated below.

Requirements¶

Name	Version
terraform	>= 1.5.0
aws	>= 5.0, < 7.0

Providers¶

Name	Version
aws	6.40.0

Modules¶

No modules.

Resources¶

Name	Type
aws_glue_catalog_database.main	resource
aws_glue_classifier.csv	resource
aws_glue_classifier.grok	resource
aws_glue_classifier.json	resource
aws_glue_classifier.xml	resource
aws_glue_connection.main	resource
aws_glue_crawler.main	resource
aws_glue_job.main	resource
aws_glue_security_configuration.main	resource
aws_glue_trigger.main	resource
aws_glue_workflow.main	resource
aws_iam_role.glue	resource
aws_iam_role_policy_attachment.glue_basic	resource
aws_iam_role_policy_attachment.glue_custom	resource
aws_iam_role_policy_attachment.glue_s3	resource
aws_secretsmanager_secret.main	resource
aws_secretsmanager_secret_version.main	resource
aws_caller_identity.current	data source
aws_iam_policy_document.assume_role	data source

Inputs¶

Name	Description	Type	Default	Required
environment	Environment identifier (e.g., dev, staging, prod)	`string`	n/a	yes
glue_config	AWS Glue configuration	object({ create = optional(bool, true) # Glue Catalog Database database = optional(object({ create = optional(bool, true) name = optional(string, "default_database") description = optional(string, "Default Glue database") create_table_default_permission = optional(object({ create = optional(bool, false) permissions = optional(list(object({ principal = map(string) permissions = list(string) })), []) }), {}) }), {}) # Glue Workflows workflows = optional(map(object({ description = optional(string, "") max_concurrent_runs = optional(number, null) })), {}) # Glue Triggers triggers = optional(map(object({ description = optional(string, "") workflow_name = optional(string, null) type = string # SCHEDULED, CONDITIONAL, EVENT_DATA, ON_DEMAND schedule = optional(string, null) predicate = optional(object({ logical = optional(string, "AND") conditions = list(object({ job_name = optional(string, null) crawler_name = optional(string, null) state = optional(string, null) crawl_state = optional(string, null) })) }), null) actions = list(object({ job_name = optional(string, null) arguments = optional(map(string), null) timeout = optional(number, null) crawler_name = optional(string, null) })) event_batching_condition = optional(object({ batch_window = optional(number, null) batch_size = optional(number, null) }), null) })), {}) # Glue Classifiers classifiers = optional(map(object({ grok_classifier = optional(object({ name = string classification = string grok_pattern = string custom_patterns = optional(map(string), {}) }), null) json_classifier = optional(object({ name = string json_path = string }), null) xml_classifier = optional(object({ name = string classification = string row_tag = string }), null) csv_classifier = optional(object({ name = string delimiter = optional(string, ",") quote_char = optional(string, "\"") contains_header = optional(string, "UNKNOWN") # PRESENT, ABSENT, UNKNOWN header = optional(list(string), []) disable_value_trimming = optional(bool, false) allow_single_quotes = optional(bool, false) }), null) })), {}) # Glue Dev Endpoints dev_endpoints = optional(map(object({ description = optional(string, "") role_arn = optional(string, null) public_key = string number_of_nodes = optional(number, 5) worker_type = optional(string, "G.1X") # Standard, G.1X, G.2X glue_version = optional(string, "2.0") number_of_workers = optional(number, 2) extra_python_libs_s3_path = optional(string, null) extra_jars_s3_path = optional(string, null) security_configuration = optional(string, null) })), {}) # Glue Security Configurations security_configurations = optional(map(object({ encryption_configuration = object({ s3_encryption = optional(object({ s3_encryption_mode = optional(string, "SSE-KMS") # SSE-KMS, SSE-S3, DISABLED kms_key_arn = optional(string, null) }), {}) cloudwatch_encryption = optional(object({ cloudwatch_encryption_mode = optional(string, "SSE-KMS") # SSE-KMS, DISABLED kms_key_arn = optional(string, null) }), {}) job_bookmarks_encryption = optional(object({ job_bookmarks_encryption_mode = optional(string, "CSE-KMS") # CSE-KMS, DISABLED kms_key_arn = optional(string, null) }), {}) }) })), {}) # Data Catalog Encryption Settings catalog_encryption_settings = optional(object({ create = optional(bool, false) connection_password_encryption = optional(bool, true) at_rest_encryption = optional(object({ catalog_encryption_mode = optional(string, "SSE-KMS") # SSE-KMS, SSE-KMS-DIRECT-QUERY, DISABLED kms_key_arn = optional(string, null) }), {}) s3_encryption = optional(object({ s3_encryption_mode = optional(string, "SSE-KMS") # SSE-KMS, SSE-S3, DISABLED kms_key_arn = optional(string, null) }), {}) }), {}) # Data Catalog Resource Policy catalog_resource_policy = optional(object({ create = optional(bool, false) policy = optional(string, "") description = optional(string, "Glue Data Catalog Resource Policy") }), {}) # Glue Partition Index partition_indexes = optional(map(object({ database_name = string table_name = string index_name = string keys = list(string) })), {}) })	`{}`	no
glue_connections	Glue connections to create. Kept separate from glue_config to avoid for_each unknown value issues when connection_properties contain apply-time values.	map(object({ connection_type = string description = optional(string, "") connection_properties = map(string) physical_connection_requirements = optional(object({ availability_zone = optional(string, null) subnet_id = optional(string, null) security_group_id_list = optional(list(string), []) }), null) }))	`{}`	no
glue_crawlers	Glue crawlers. Kept separate from glue_config to avoid for_each unknown-value issues when targets contain apply-time values.	map(object({ database_name = string description = optional(string, "") role_arn = optional(string, null) schedule = optional(string, null) classifiers = optional(list(string), []) configuration = optional(string, null) table_prefix = optional(string, "") schema_change_policy = optional(object({ delete_behavior = optional(string, "LOG") update_behavior = optional(string, "UPDATE_IN_DATABASE") }), {}) recrawl_policy = optional(object({ recrawl_behavior = optional(string, "CRAWL_NEW_FOLDERS_ONLY") }), {}) lineage_configuration = optional(object({ crawler_lineage_settings = optional(string, "ENABLE") }), {}) targets = object({ s3_targets = optional(list(object({ path = string exclusions = optional(list(string), []) connection_name = optional(string, null) sample_size = optional(number, null) event_queue_arn = optional(string, null) dlq_event_queue_arn = optional(string, null) })), []) jdbc_targets = optional(list(object({ connection_name = string path = optional(string, null) exclusions = optional(list(string), []) })), []) mongo_db_targets = optional(list(object({ connection_name = string path = optional(string, null) scan_all = optional(bool, null) })), []) delta_targets = optional(list(object({ connection_name = optional(string, null) delta_tables = optional(list(string), []) write_manifest = optional(bool, null) })), []) catalog_targets = optional(list(object({ database_name = string tables = list(string) event_queue_arn = optional(string, null) dlq_event_queue_arn = optional(string, null) })), []) }) }))	`{}`	no
glue_jobs	Glue jobs. Kept separate from glue_config to avoid for_each unknown-value issues when script_location contains apply-time values.	map(object({ description = optional(string, "") role_arn = optional(string, null) command = object({ name = string script_location = string python_version = optional(string, "3") runtime = optional(string, null) }) default_arguments = optional(map(string), {}) non_overridable_arguments = optional(map(string), {}) execution_property = optional(object({ max_concurrent_runs = optional(number, 1) }), {}) max_retries = optional(number, 0) timeout = optional(number, null) max_capacity = optional(number, null) number_of_workers = optional(number, null) worker_type = optional(string, null) glue_version = optional(string, "4.0") execution_class = optional(string, null) }))	`{}`	no
iam_config	IAM configuration for Glue resources	object({ create_role = optional(bool, true) role_name = optional(string, null) role_description = optional(string, "AWS Glue IAM Role") role_policies = optional(map(string), {}) # Map of policy names to policy ARNs create_custom_policy = optional(bool, false) custom_policy_name = optional(string, null) custom_policy_document = optional(string, null) permissions_boundary = optional(string, null) trusted_role_arns = optional(list(string), []) })	`{}`	no
kms_key_arn	Optional KMS key ARN to use for Glue encryption. If not provided, module will create its own KMS key but won't use it for security configurations to avoid circular dependencies.	`string`	`null`	no
name	Name prefix for AWS Glue resources	`string`	n/a	yes
namespace	Namespace (organization) identifier for resources	`string`	n/a	yes
region	AWS region where resources will be created	`string`	`"us-east-1"`	no
secrets_config	Secrets Manager configuration for storing credentials	object({ secrets = optional(map(object({ name = optional(string, null) description = optional(string, "") secret_string = optional(string, null) })), {}) })	`{}`	no
tags	Default tags to apply to all resources	`map(string)`	`{}`	no

Outputs¶

Name	Description
aws_account_id	The AWS account ID where resources are created
glue_connection_names	Map of connection key to name
glue_crawler_arns	Map of crawler name to ARN
glue_crawler_names	Map of crawler key to name
glue_database_arn	The ARN of the Glue catalog database
glue_database_id	The ID of the Glue catalog database
glue_database_name	The name of the Glue catalog database
glue_iam_role_arn	The ARN of the Glue IAM role
glue_iam_role_id	The ID of the Glue IAM role
glue_iam_role_name	The name of the Glue IAM role
glue_job_arns	Map of job key to ARN
glue_job_names	Map of job key to name
glue_secret_arns	Map of secret key to ARN
glue_security_configurations	Map of security configuration key to name
glue_workflows	Map of workflow key to workflow object
module_version	The version of this module
resource_prefix	The resource prefix used for naming

Examples¶

Simple Example¶

Basic configuration with database and S3 crawler:

module "glue_simple" {
  source = "git::https://github.com/sourcefuse/terraform-aws-arc-glue.git"

  namespace   = "mycompany"
  environment = "dev"
  name        = "simple-glue"
  region      = "us-east-1"

  glue_config = {
    database = {
      create = true
      name   = "simple_database"
    }
  }

  glue_crawlers = {
    "s3-crawler" = {
      database_name = "simple_database"
      role_arn      = aws_iam_role.glue.arn
      targets = {
        s3_targets = [{
          path = "s3://my-data-bucket/"
        }]
      }
    }
  }
}

Complete Example¶

Enterprise configuration with jobs, workflows, and connections:

module "glue_complete" {
  source = "git::https://github.com/sourcefuse/terraform-aws-arc-glue.git"

  namespace   = "mycompany"
  environment = "prod"
  name        = "enterprise-glue"
  region      = "us-east-1"

  iam_config = {
    create_role = true
    role_policies = {
      "AmazonS3FullAccess" = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
      "AmazonAthenaFullAccess" = "arn:aws:iam::aws:policy/AmazonAthenaFullAccess"
    }
  }

  glue_config = {
    database = {
      create = true
      name   = "enterprise_database"
    }

    security_configuration = {
      create = true
      name   = "enterprise_security_config"
    }
  }

  glue_jobs = {
    "etl-job" = {
      role_arn     = aws_iam_role.glue.arn
      glue_version = "4.0"
      command = {
        name   = "glueetl"
        script = "s3://my-scripts/etl.py"
      }
      worker_type       = "G.2X"
      number_of_workers = 20
      max_capacity      = null
    }
  }

  glue_crawlers = {
    "s3-raw-data" = {
      database_name = "enterprise_database"
      role_arn      = aws_iam_role.glue.arn
      targets = {
        s3_targets = [{
          path = "s3://raw-data-bucket/"
        }]
      }
      schedule = "cron(0 1 * * ? *)"
    }

    "jdbc-source" = {
      database_name = "enterprise_database"
      role_arn      = aws_iam_role.glue.arn
      targets = {
        jdbc_targets = [{
          connection_name = "rds-connection"
          path            = "testdb/%"
        }]
      }
    }
  }

  glue_connections = {
    "rds-connection" = {
      connection_type = "JDBC"
      connection_properties = {
        JDBC_CONNECTION_URL = "jdbc:postgresql://${aws_rds_cluster.main.endpoint}:5432/testdb"
        USERNAME            = "admin"
        PASSWORD            = aws_secretsmanager_secret_version.rds_password.secret_string
      }
      security_groups = [aws_security_group.glue.id]
    }
  }
}

Module Configuration Details¶

IAM Configuration¶

The module can create and manage IAM roles with appropriate permissions:

iam_config = {
  create_role = true
  role_name   = "custom-glue-role"
  role_description = "Custom Glue execution role"

  # Add custom policies
  role_policies = {
    "CustomS3Access" = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
  }

  # Add permissions boundary for security
  permissions_boundary = "arn:aws:iam::123456789012:policy/GluePermissionsBoundary"

  # Enable cross-account access
  trusted_role_arns = [
    "arn:aws:iam::123456789012:root"
  ]
}

VPC Configuration¶

For private Glue jobs and connections:

vpc_config = {
  vpc_id             = "vpc-12345678"
  security_group_name = "glue-security-group"

  # Optional subnet configuration
  subnet_ids = ["subnet-12345", "subnet-67890"]

  # Security group rules
  ingress_rules = [{
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["10.0.0.0/8"]
  }]
}

Glue Job Types¶

The module supports multiple Glue job types:

Spark ETL Jobs:

spark_job = {
  role_arn     = aws_iam_role.glue.arn
  glue_version = "4.0"
  command = {
    name   = "glueetl"
    script = "s3://scripts/transform.py"
  }
  worker_type       = "G.2X"  # Standard, G.1X, G.2X, Z.2X
  number_of_workers = 10
}

Python Shell Jobs:

python_job = {
  role_arn     = aws_iam_role.glue.arn
  glue_version = "1.0"
  command = {
    name    = "pythonshell"
    script  = "s3://scripts/process.py"
  }
  max_capacity = 0.0625  # DPU for Python shell
}

Workflow Orchestration¶

Complex workflow orchestration with triggers:

glue_config = {
  workflows = {
    "data-pipeline" = {
      description = "End-to-end data processing pipeline"
      max_concurrent_runs = 1
    }
  }

  triggers = {
    "start-crawler" = {
      workflow_name = "data-pipeline"
      type          = "SCHEDULED"
      schedule      = "cron(0 2 * * ? *)"
      actions       = ["s3-raw-data"]
    }

    "start-etl" = {
      workflow_name = "data-pipeline"
      type          = "CONDITIONAL"
      predicate {
        conditions {
          job_name = "s3-raw-data"
          state    = "SUCCEEDED"
        }
      }
      actions = ["etl-job"]
    }
  }
}

Security Considerations¶

Encryption¶

The module supports comprehensive encryption:

# Use custom KMS key
kms_key_arn = "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"

# Configure security configuration
glue_config = {
  security_configuration = {
    create = true
    s3_encryption = {
      s3_encryption_mode = "SSE-KMS"
      kms_key_arn        = "arn:aws:kms:us-east-1:123456789012:key/12345678"
    }
    cloudwatch_encryption = {
      cloudwatch_encryption_mode = "SSE-KMS"
      kms_key_arn               = "arn:aws:kms:us-east-1:123456789012:key/12345678"
    }
  }
}

Network Security¶

VPC Integration: Deploy Glue resources in private VPC subnets
Security Groups: Control inbound/outbound traffic
Endpoints: Use VPC endpoints for private connectivity
IAM Policies: Implement least-privilege access

Troubleshooting¶

Common Issues¶

1. Crawler Timeout

glue_crawlers = {
  "my-crawler" = {
    # Increase timeout for large datasets
    timeouts = {
      create = "60m"
      update = "60m"
    }
  }
}

2. Job Failures - Check CloudWatch Logs: /aws-glue/jobs/output - Verify IAM permissions: S3, CloudWatch, Glue - Validate script locations in S3 - Review security group rules for network access

3. Connection Issues - Verify VPC endpoints for JDBC connections - Check security group ingress/egress rules - Validate credentials in Secrets Manager - Test connectivity from Glue to data source

Cost Optimization¶

Worker Type Selection¶

Standard: Cost-effective for simple transformations
G.1X: Memory-intensive workloads
G.2X: Balanced compute/memory
Z.2X: Compute-intensive with lower memory

Job Configuration¶

glue_jobs = {
  "cost-optimized-job" = {
    # Use execution class for cost savings
    execution_class = "FLEX"  # STANDARD or FLEX

    # Configure timeouts to prevent runaway costs
    timeout = 60  # minutes

    # Enable job bookmarks for incremental processing
    command = {
      name    = "glueetl"
      script  = "s3://scripts/incremental.py"
    }

    # Reduce workers for testing
    number_of_workers = 2
  }
}

Best Practices¶

Resource Naming: Use consistent, descriptive resource names
Tagging Strategy: Implement comprehensive tagging for cost management
Incremental Processing: Use job bookmarks for efficiency
Testing: Test jobs in development environment before production
Monitoring: Set up CloudWatch alarms and metrics
Security: Regularly audit IAM permissions and security groups
Version Control: Store Glue scripts in version control
Documentation: Document job logic and data transformations

Development¶

Prerequisites¶

Configurations¶

Configure pre-commit hooks
1
pre-commit install

Versioning¶

while Contributing or doing git commit please specify the breaking change in your commit message whether its major,minor or patch

For Example

git commit -m "your commit message #major"

By specifying this , it will bump the version and if you don't specify this in your commit message then by default it will consider patch and will bump that accordingly

Tests¶

Tests are available in test directory

Configure the dependencies

cd test/
go mod init github.com/sourcefuse/terraform-aws-refarch-<module_name>
go get github.com/gruntwork-io/terratest/modules/terraform

Now execute the test
1
go test -timeout 30m

AI Assistant Integration (ARC IaC MCP)¶

The ARC IaC MCP Server is a hosted Model Context Protocol service that lets AI assistants browse, search, scaffold, compare, and security-scan any of the SourceFuse ARC Terraform modules — directly from natural language.

What you can do with it:

Discover — search and filter modules by keyword or AWS resource type.
Understand — get inputs, outputs, and resources for any module without leaving your editor.
Scaffold — generate production-ready, multi-file Terraform with cross-module wiring already done.
Secure — scan generated or existing HCL for misconfigurations before it hits a PR.
Compare — diff modules side-by-side to make informed architectural decisions.

Setup (one minute)¶

The MCP endpoint is https://arc-iac-mcp.sourcef.us/mcp. Pick your client:

Claude Code CLI:

claude mcp add arc-iac --transport http https://arc-iac-mcp.sourcef.us/mcp

Claude Desktop — edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "arc-iac": {
      "url": "https://arc-iac-mcp.sourcef.us/mcp"
    }
  }
}

Cursor / Windsurf / Kiro — add the same block to .cursor/mcp.json (or the equivalent for your client).

Example prompts to try¶

"List all ARC modules sorted by downloads"
"What inputs does arc-ecs require?"
"Scaffold a production-ready arc-db Aurora setup with Secrets Manager"
"Compare arc-eks and arc-ecs for running 10 microservices"
"Scan this Terraform before I raise a PR: <paste HCL>"

See the ARC IaC MCP repo for the full tool reference, troubleshooting tips, and local-development instructions.

Contributing¶

See CONTRIBUTING.md for commit conventions and development setup.

Authors¶

This project is authored by: - SourceFuse ARC Team

terraform-aws-arc-glue¶