
terraform-aws-arc-glue¶
Overview¶
SourceFuse AWS Reference Architecture (ARC) Terraform module for managing AWS Glue resources, providing a comprehensive, production-ready solution for deploying data catalog, ETL jobs, crawlers, workflows, and related components with enterprise-grade security and operational best practices.
Module Features¶
This module provides a complete AWS Glue infrastructure including:
- Data Catalog Management: Automated database and table creation with cross-account access policies
- ETL Job Orchestration: Support for Spark, Python Shell, and Ray jobs with configurable worker types
- Data Discovery: Multi-source crawlers (S3, JDBC, MongoDB, Delta Lake) with scheduling
- Workflow Automation: Complex workflow orchestration with triggers and dependencies
- Security & Compliance: KMS encryption, IAM role management, VPC integration, and secret management
- Enterprise Integration: JDBC connections for RDS/Redshift, MongoDB, and other data sources
- Monitoring & Logging: CloudWatch integration, job bookmarks, and execution tracking
Usage¶
To see a full example, check out the complete example or simple example files in the example folder.
Requirements¶
| Name | Version |
|---|---|
| terraform | >= 1.5.0 |
| aws | >= 5.0, < 7.0 |
Providers¶
| Name | Version |
|---|---|
| aws | 6.40.0 |
Modules¶
No modules.
Resources¶
| Name | Type |
|---|---|
| aws_glue_catalog_database.main | resource |
| aws_glue_classifier.csv | resource |
| aws_glue_classifier.grok | resource |
| aws_glue_classifier.json | resource |
| aws_glue_classifier.xml | resource |
| aws_glue_connection.main | resource |
| aws_glue_crawler.main | resource |
| aws_glue_job.main | resource |
| aws_glue_security_configuration.main | resource |
| aws_glue_trigger.main | resource |
| aws_glue_workflow.main | resource |
| aws_iam_role.glue | resource |
| aws_iam_role_policy_attachment.glue_basic | resource |
| aws_iam_role_policy_attachment.glue_custom | resource |
| aws_iam_role_policy_attachment.glue_s3 | resource |
| aws_secretsmanager_secret.main | resource |
| aws_secretsmanager_secret_version.main | resource |
| aws_caller_identity.current | data source |
| aws_iam_policy_document.assume_role | data source |
Inputs¶
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| environment | Environment identifier (e.g., dev, staging, prod) | string |
n/a | yes |
| glue_config | AWS Glue configuration | object({ |
{} |
no |
| glue_connections | Glue connections to create. Kept separate from glue_config to avoid for_each unknown value issues when connection_properties contain apply-time values. | map(object({ |
{} |
no |
| glue_crawlers | Glue crawlers. Kept separate from glue_config to avoid for_each unknown-value issues when targets contain apply-time values. | map(object({ |
{} |
no |
| glue_jobs | Glue jobs. Kept separate from glue_config to avoid for_each unknown-value issues when script_location contains apply-time values. | map(object({ |
{} |
no |
| iam_config | IAM configuration for Glue resources | object({ |
{} |
no |
| kms_key_arn | Optional KMS key ARN to use for Glue encryption. If not provided, module will create its own KMS key but won't use it for security configurations to avoid circular dependencies. | string |
null |
no |
| name | Name prefix for AWS Glue resources | string |
n/a | yes |
| namespace | Namespace (organization) identifier for resources | string |
n/a | yes |
| region | AWS region where resources will be created | string |
"us-east-1" |
no |
| secrets_config | Secrets Manager configuration for storing credentials | object({ |
{} |
no |
| tags | Default tags to apply to all resources | map(string) |
{} |
no |
Outputs¶
| Name | Description |
|---|---|
| aws_account_id | The AWS account ID where resources are created |
| glue_connection_names | Map of connection key to name |
| glue_crawler_arns | Map of crawler name to ARN |
| glue_crawler_names | Map of crawler key to name |
| glue_database_arn | The ARN of the Glue catalog database |
| glue_database_id | The ID of the Glue catalog database |
| glue_database_name | The name of the Glue catalog database |
| glue_iam_role_arn | The ARN of the Glue IAM role |
| glue_iam_role_id | The ID of the Glue IAM role |
| glue_iam_role_name | The name of the Glue IAM role |
| glue_job_arns | Map of job key to ARN |
| glue_job_names | Map of job key to name |
| glue_secret_arns | Map of secret key to ARN |
| glue_security_configurations | Map of security configuration key to name |
| glue_workflows | Map of workflow key to workflow object |
| module_version | The version of this module |
| resource_prefix | The resource prefix used for naming |
Examples¶
Simple Example¶
Basic configuration with database and S3 crawler:
Complete Example¶
Enterprise configuration with jobs, workflows, and connections:
Module Configuration Details¶
IAM Configuration¶
The module can create and manage IAM roles with appropriate permissions:
VPC Configuration¶
For private Glue jobs and connections:
Glue Job Types¶
The module supports multiple Glue job types:
Spark ETL Jobs:
Python Shell Jobs:
Workflow Orchestration¶
Complex workflow orchestration with triggers:
Security Considerations¶
Encryption¶
The module supports comprehensive encryption:
Network Security¶
- VPC Integration: Deploy Glue resources in private VPC subnets
- Security Groups: Control inbound/outbound traffic
- Endpoints: Use VPC endpoints for private connectivity
- IAM Policies: Implement least-privilege access
Troubleshooting¶
Common Issues¶
1. Crawler Timeout
2. Job Failures
- Check CloudWatch Logs: /aws-glue/jobs/output
- Verify IAM permissions: S3, CloudWatch, Glue
- Validate script locations in S3
- Review security group rules for network access
3. Connection Issues - Verify VPC endpoints for JDBC connections - Check security group ingress/egress rules - Validate credentials in Secrets Manager - Test connectivity from Glue to data source
Cost Optimization¶
Worker Type Selection¶
- Standard: Cost-effective for simple transformations
- G.1X: Memory-intensive workloads
- G.2X: Balanced compute/memory
- Z.2X: Compute-intensive with lower memory
Job Configuration¶
Best Practices¶
- Resource Naming: Use consistent, descriptive resource names
- Tagging Strategy: Implement comprehensive tagging for cost management
- Incremental Processing: Use job bookmarks for efficiency
- Testing: Test jobs in development environment before production
- Monitoring: Set up CloudWatch alarms and metrics
- Security: Regularly audit IAM permissions and security groups
- Version Control: Store Glue scripts in version control
- Documentation: Document job logic and data transformations
Development¶
Prerequisites¶
Configurations¶
- Configure pre-commit hooks
Versioning¶
while Contributing or doing git commit please specify the breaking change in your commit message whether its major,minor or patch
For Example
Tests¶
- Tests are available in
testdirectory - Configure the dependencies
- Now execute the test
Authors¶
This project is authored by: - SourceFuse ARC Team