Tuesday, March 28, 2023
HomeBig DataIntroducing runtime roles for Amazon EMR steps: Use IAM roles and AWS...

Introducing runtime roles for Amazon EMR steps: Use IAM roles and AWS Lake Formation for entry management with Amazon EMR


You should utilize the Amazon EMR Steps API to submit Apache Hive, Apache Spark, and others forms of functions to an EMR cluster. You’ll be able to invoke the Steps API utilizing Apache Airflow, AWS Steps Capabilities, the AWS Command Line Interface (AWS CLI), all of the AWS SDKs, and the AWS Administration Console. Jobs submitted with the Steps API use the Amazon Elastic Compute Cloud (Amazon EC2) occasion profile to entry AWS assets corresponding to Amazon Easy Storage Service (Amazon S3) buckets, AWS Glue tables, and Amazon DynamoDB tables from the cluster.

Beforehand, if a step wanted entry to a particular S3 bucket and one other step wanted entry to a particular DynamoDB desk, the AWS Id and Entry Administration (IAM) coverage hooked up to the occasion profile needed to permit entry to each the S3 bucket and the DynamoDB desk. This meant that the IAM insurance policies you assigned to the occasion profile needed to include a union of all of the permissions for each step that ran on an EMR cluster.

We’re completely happy to introduce runtime roles for EMR steps. A runtime position is an IAM position that you just affiliate with an EMR step, and jobs use this position to entry AWS assets. With runtime roles for EMR steps, now you can specify totally different IAM roles for the Spark and the Hive jobs, thereby scoping down entry at a job degree. This lets you simplify entry controls on a single EMR cluster that’s shared between a number of tenants, whereby every tenant will be simply remoted utilizing IAM roles.

The power to specify an IAM position with a job can be obtainable on Amazon EMR on EKS and Amazon EMR Serverless. It’s also possible to use AWS Lake Formation to use table- and column-level permission for Apache Hive and Apache Spark jobs which might be submitted with EMR steps. For extra info, check with Configure runtime roles for Amazon EMR steps.

On this submit, we dive deeper into runtime roles for EMR steps, serving to you perceive how the assorted items work collectively, and the way every step is remoted on an EMR cluster.

Resolution overview

On this submit, we stroll by means of the next:

  1. Create an EMR cluster enabled to make use of the brand new role-based entry management with EMR steps.
  2. Create two IAM roles with totally different permissions when it comes to the Amazon S3 information and Lake Formation tables they’ll entry.
  3. Enable the IAM principal submitting the EMR steps to make use of these two IAM roles.
  4. See how EMR steps working with the identical code and attempting to entry the identical information have totally different permissions based mostly on the runtime position specified at submission time.
  5. See tips on how to monitor and management actions utilizing supply identification propagation.

Arrange EMR cluster safety configuration

Amazon EMR safety configurations simplify making use of constant safety, authorization, and authentication choices throughout your clusters. You’ll be able to create a safety configuration on the Amazon EMR console or through the AWS CLI or AWS SDK. If you connect a safety configuration to a cluster, Amazon EMR applies the settings within the safety configuration to the cluster. You’ll be able to connect a safety configuration to a number of clusters at creation time, however can’t apply them to a working cluster.

To allow runtime roles for EMR steps, we’ve to create a safety configuration as proven within the following code and allow the runtime roles property (configured through EnableApplicationScopedIAMRole). Along with the runtime roles, we’re enabling propagation of the supply identification (configured through PropagateSourceIdentity) and help for Lake Formation (configured through LakeFormationConfiguration). The supply identification is a mechanism to observe and management actions taken with assumed roles. Enabling Propagate supply identification means that you can audit actions carried out utilizing the runtime position. Lake Formation is an AWS service to securely handle a knowledge lake, which incorporates defining and implementing central entry management insurance policies on your information lake.

Create a file referred to as step-runtime-roles-sec-cfg.json with the next content material:

{
    "AuthorizationConfiguration": {
        "IAMConfiguration": {
            "EnableApplicationScopedIAMRole": true,
            "ApplicationScopedIAMRoleConfiguration": 
                {
                    "PropagateSourceIdentity": true
                }
        },
        "LakeFormationConfiguration": {
            "AuthorizedSessionTagValue": "Amazon EMR"
        }
    }
}

Create the Amazon EMR safety configuration:

aws emr create-security-configuration 
--name 'iamconfig-with-iam-lf' 
--security-configuration file://step-runtime-roles-sec-cfg.json

It’s also possible to do the identical through the Amazon console:

  1. On the Amazon EMR console, select Safety configurations within the navigation pane.
  2. Select Create.
  3. Select Create.
  4. For Safety configuration identify, enter a reputation.
  5. For Safety configuration setup choices, choose Select customized settings.
  6. For IAM position for functions, choose Runtime position.
  7. Choose Propagate supply identification to audit actions carried out utilizing the runtime position.
  8. For Tremendous-grained entry management, choose AWS Lake Formation.
  9. Full the safety configuration.

The safety configuration seems in your safety configuration record. It’s also possible to see that the authorization mechanism listed right here is the runtime position as an alternative of the occasion profile.

Launch the cluster

Now we launch an EMR cluster and specify the safety configuration we created. For extra info, check with Specify a safety configuration for a cluster.

The next code gives the AWS CLI command for launching an EMR cluster with the suitable safety configuration. Notice that this cluster is launched on the default VPC and public subnet with the default IAM roles. As well as, the cluster is launched with one major and one core occasion of the desired occasion sort. For extra particulars on tips on how to customise the launch parameters, check with create-cluster.

If the default EMR roles EMR_EC2_DefaultRole and EMR_DefaultRole don’t exist in IAM in your account (that is the primary time you’re launching an EMR cluster with these), earlier than launching the cluster, use the next command to create them:

aws emr create-default-roles

Create the cluster with the next code:

#Change together with your Key Pair
KEYPAIR=<MY_KEYPAIR>
INSTANCE_TYPE="r4.4xlarge"
#Change together with your Safety Configuration Identify
SECURITY_CONFIG="iamconfig-with-iam-lf"
#Change together with your S3 log URI
LOG_URI="s3://mybucket/logs/"

aws emr create-cluster 
--name "iam-passthrough-cluster" 
--release-label emr-6.7.0 
--use-default-roles 
--security-configuration $SECURITY_CONFIG 
--ec2-attributes KeyName=$KEYPAIR 
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$INSTANCE_TYPE  InstanceGroupType=CORE,InstanceCount=1,InstanceType=$INSTANCE_TYPE 
--applications Identify=Spark Identify=Hadoop Identify=Hive 
--log-uri $LOG_URI

When the cluster is absolutely provisioned (Ready state), let’s attempt to run a step on it with runtime roles for EMR steps enabled:

#Change together with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "--class",
              "org.apache.spark.examples.SparkPi",
              "/usr/lib/spark/examples/jars/spark-examples.jar",
              "5"
            ]
        }]'

After launching the command, we obtain the next as output:

An error occurred (ValidationException) when calling the AddJobFlowSteps operation: Runtime roles are required for this cluster. Please specify the position utilizing the ExecutionRoleArn parameter.

The step failed, asking us to supply a runtime position. Within the subsequent part, we arrange two IAM roles with totally different permissions and use them because the runtime roles for EMR steps.

Arrange IAM roles as runtime roles

Any IAM position that you just wish to use as a runtime position for EMR steps will need to have a belief coverage that permits the EMR cluster’s EC2 occasion profile to imagine it. In our setup, we’re utilizing the default IAM position EMR_EC2_DefaultRole because the occasion profile position. As well as, we create two IAM roles referred to as test-emr-demo1 and test-emr-demo2 that we use as runtime roles for EMR steps.

The next code is the belief coverage for each of the IAM roles, which lets the EMR cluster’s EC2 occasion profile position, EMR_EC2_DefaultRole, assume these roles and set the supply identification and LakeFormationAuthorizedCaller tag on the position classes. The TagSession permission is required in order that Amazon EMR can authorize to Lake Formation. The SetSourceIdentity assertion is required for the propagate supply identification function.

Create a file referred to as trust-policy.json with the next content material (exchange 123456789012 together with your AWS account ID):

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:SetSourceIdentity"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:TagSession",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/LakeFormationAuthorizedCaller": "Amazon EMR"
                }
            }
        }
    ]
}

Use that coverage to create the 2 IAM roles, test-emr-demo1 and test-emr-demo2:

aws iam create-role 
--role-name test-emr-demo1 
--assume-role-policy-document file://trust-policy.json

aws iam create-role 
--role-name test-emr-demo2 
--assume-role-policy-document file://trust-policy.json

Arrange permissions for the principal submitting the EMR steps with runtime roles

The IAM principal submitting the EMR steps must have permissions to invoke the AddJobFlowSteps API. As well as, you should use the Situation key elasticmapreduce:ExecutionRoleArn to manage entry to particular IAM roles. For instance, the next coverage permits the IAM principal to solely use IAM roles test-emr-demo1 and test-emr-demo2 because the runtime roles for EMR steps.

  1. Create the job-submitter-policy.json file with the next content material (exchange 123456789012 together with your AWS account ID):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Sid": "AddStepsWithSpecificExecRoleArn",
                "Effect": "Allow",
                "Action": [
                    "elasticmapreduce:AddJobFlowSteps"
                ],
                "Useful resource": "*",
                "Situation": {
                    "StringEquals": {
                        "elasticmapreduce:ExecutionRoleArn": [
                            "arn:aws:iam::123456789012:role/test-emr-demo1",
                            "arn:aws:iam::123456789012:role/test-emr-demo2"
                        ]
                    }
                }
            },
            {
                "Sid": "EMRDescribeCluster",
                "Impact": "Enable",
                "Motion": [
                    "elasticmapreduce:DescribeCluster"
                ],
                "Useful resource": "*"
            }
        ]
    }

  2. Create the IAM coverage with the next code:
    aws iam create-policy 
    --policy-name emr-runtime-roles-submitter-policy 
    --policy-document file://job-submitter-policy.json

  3. Assign this coverage to the IAM principal (IAM person or IAM position) you’re going to make use of to submit the EMR steps (exchange 123456789012 together with your AWS account ID and exchange john with the IAM person you employ to submit your EMR steps):
    aws iam attach-user-policy 
    --user-name john 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-submitter-policy"

IAM person john can now submit steps utilizing arn:aws:iam::123456789012:position/test-emr-demo1 and arn:aws:iam::123456789012:position/test-emr-demo2 because the step runtime roles.

Use runtime roles with EMR steps

We now put together our setup to indicate runtime roles for EMR steps in motion.

Arrange Amazon S3

To arrange your Amazon S3 information, full the next steps:

  1. Create a CSV file referred to as check.csv with the next content material:
  2. Add the file to Amazon S3 in three totally different areas:
    #Change this together with your bucket identify
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    aws s3 cp check.csv s3://${BUCKET_NAME}/demo1/
    aws s3 cp check.csv s3://${BUCKET_NAME}/demo2/
    aws s3 cp check.csv s3://${BUCKET_NAME}/nondemo/

    For our preliminary check, we use a PySpark software referred to as check.py with the next contents:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("my app").enableHiveSupport().getOrCreate()
    
    #Change this together with your bucket identify
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    attempt:
      spark.learn.csv("s3://" + BUCKET_NAME + "/demo1/check.csv").present()
      print("Accessed demo1")
    besides:
      print("Couldn't entry demo1")
    
    attempt:
      spark.learn.csv("s3://" + BUCKET_NAME + "/demo2/check.csv").present()
      print("Accessed demo2")
    besides:
      print("Couldn't entry demo2")
    
    attempt:
      spark.learn.csv("s3://" + BUCKET_NAME + "/nondemo/check.csv").present()
      print("Accessed nondemo")
    besides:
      print("Couldn't entry nondemo")
    spark.cease()

    Within the script, we’re attempting to entry the CSV file current below three totally different prefixes within the check bucket.

  3. Add the Spark software inside the identical S3 bucket the place we positioned the check.csv file however in a distinct location:
    #Change this together with your bucket identify
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    aws s3 cp check.py s3://${BUCKET_NAME}/scripts/

Arrange runtime position permissions

To indicate how runtime roles for EMR steps works, we assign to the roles we created totally different IAM permissions to entry Amazon S3. The next desk summarizes the grants we offer to every position (emr-steps-roles-new-us-east-1 is the bucket you configured within the earlier part).

S3 areas IAM Roles test-emr-demo1 test-emr-demo2
s3://emr-steps-roles-new-us-east-1/* No Entry No Entry
s3://emr-steps-roles-new-us-east-1/demo1/* Full Entry No Entry
s3://emr-steps-roles-new-us-east-1/demo2/* No Entry Full Entry
s3://emr-steps-roles-new-us-east-1/scripts/* Learn Entry Learn Entry
  1. Create the file demo1-policy.json with the next content material (substitute emr-steps-roles-new-us-east-1 together with your bucket identify):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo1",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo1/*"
                ]                    
            },
            {
                "Impact": "Enable",
                "Motion": [
                    "s3:Get*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts/*"
                ]                    
            }
        ]
    }

  2. Create the file demo2-policy.json with the next content material (substitute emr-steps-roles-new-us-east-1 together with your bucket identify):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo2",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo2/*"
                ]                    
            },
            {
                "Impact": "Enable",
                "Motion": [
                    "s3:Get*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts/*"
                ]                    
            }
        ]
    }

  3. Create our IAM insurance policies:
    aws iam create-policy 
    --policy-name test-emr-demo1-policy 
    --policy-document file://demo1-policy.json
    
    aws iam create-policy 
    --policy-name test-emr-demo2-policy 
    --policy-document file://demo2-policy.json

  4. Assign to every position the associated coverage (exchange 123456789012 together with your AWS account ID):
    aws iam attach-role-policy 
    --role-name test-emr-demo1 
    --policy-arn "arn:aws:iam::123456789012:coverage/test-emr-demo1-policy"
    
    aws iam attach-role-policy 
    --role-name test-emr-demo2 
    --policy-arn "arn:aws:iam::123456789012:coverage/test-emr-demo2-policy"

    To make use of runtime roles with Amazon EMR steps, we have to add the next coverage to our EMR cluster’s EC2 occasion profile (on this instance EMR_EC2_DefaultRole). With this coverage, the underlying EC2 situations for the EMR cluster can assume the runtime position and apply a tag to that runtime position.

  5. Create the file runtime-roles-policy.json with the next content material (exchange 123456789012 together with your AWS account ID):
    {
        "Model": "2012-10-17",
        "Assertion": [{
                "Sid": "AllowRuntimeRoleUsage",
                "Effect": "Allow",
                "Action": [
                    "sts:AssumeRole",
                    "sts:TagSession",
                    "sts:SetSourceIdentity"
                ],
                "Useful resource": [
                    "arn:aws:iam::123456789012:role/test-emr-demo1",
                    "arn:aws:iam::123456789012:role/test-emr-demo2"
                ]
            }
        ]
    }

  6. Create the IAM coverage:
    aws iam create-policy 
    --policy-name emr-runtime-roles-policy 
    --policy-document file://runtime-roles-policy.json

  7. Assign the created coverage to the EMR cluster’s EC2 occasion profile, on this instance EMR_EC2_DefaultRole:
    aws iam attach-role-policy 
    --role-name EMR_EC2_DefaultRole 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-policy"

Take a look at permissions with runtime roles

We’re now able to carry out our first check. We run the check.py script, beforehand uploaded to Amazon S3, two occasions as Spark steps: first utilizing the test-emr-demo1 position after which utilizing the test-emr-demo2 position because the runtime roles.

To run an EMR step specifying a runtime position, you want the newest model of the AWS CLI. For extra particulars about updating the AWS CLI, check with Putting in or updating the newest model of the AWS CLI.

Let’s submit a step specifying test-emr-demo1 because the runtime position:

#Change together with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change together with your AWS Account ID
ACCOUNT_ID=123456789012
#Change together with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:position/test-emr-demo1

This command returns an EMR step ID. To verify our step output logs, we will proceed two other ways:

  • From the Amazon EMR console – On the Steps tab, select the View logs hyperlink associated to the particular step ID and choose stdout.
  • From Amazon S3 – Whereas launching our cluster, we configured an S3 location for logging. We will discover our step logs below $(LOG_URI)/steps/<stepID>/stdout.gz.

The logs may take a few minutes to populate after the step is marked as Accomplished.

The next is the output of the EMR step with test-emr-demo1 because the runtime position:

+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|  a| 1a|
|  2|  b| 2b|
+---+---+---+

Accessed demo1
Couldn't entry demo2
Couldn't entry nondemo

As we will see, solely the demo1 folder was accessible by our software.

Diving deeper into the step stderr logs, we will see that the associated YARN software application_1656350436159_0017 was launched with the person 6GC64F33KUW4Q2JY6LKR7UAHWETKKXYL. We will affirm this by connecting to the EMR major occasion utilizing SSH and utilizing the YARN CLI:

[[email protected]]$ yarn software -status application_1656350436159_0017
...
Software-Id : application_1656350436159_0017
Software-Identify : my app
Software-Sort : SPARK
Person : 6GC64F33KUW4Q2JY6LKR7UAHWETKKXYL
Queue : default
Software Precedence : 0
...

Please word that in your case, the YARN software ID and the person can be totally different.

Now we submit the identical script once more as a brand new EMR step, however this time with the position test-emr-demo2 because the runtime position:

#Change together with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change together with your AWS Account ID
ACCOUNT_ID=123456789012
#Change together with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:position/test-emr-demo2

The next is the output of the EMR step with test-emr-demo2 because the runtime position:

Couldn't entry demo1
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|  a| 1a|
|  2|  b| 2b|
+---+---+---+

Accessed demo2
Couldn't entry nondemo

As we will see, solely the demo2 folder was accessible by our software.

Diving deeper into the step stderr logs, we will see that the associated YARN software application_1656350436159_0018 was launched with a distinct person 7T2ORHE6Z4Q7PHLN725C2CVWILZWYOLE. We will affirm this by utilizing the YARN CLI:

[[email protected]]$ yarn software -status application_1656350436159_0018
...
Software-Id : application_1656350436159_0018
Software-Identify : my app
Software-Sort : SPARK
Person : 7T2ORHE6Z4Q7PHLN725C2CVWILZWYOLE
Queue : default
Software Precedence : 0
...

Every step was in a position to solely entry the CSV file that was allowed by the runtime position, so step one was in a position to solely entry s3://emr-steps-roles-new-us-east-1/demo1/check.csv and the second step was solely in a position to entry s3://emr-steps-roles-new-us-east-1/demo2/check.csv. As well as, we noticed that Amazon EMR created a novel person for the steps, and used the person to run the roles. Please word that each roles want not less than learn entry to the S3 location the place the step scripts are situated (for instance, s3://emr-steps-roles-demo-bucket/scripts/check.py).

Now that we’ve seen how runtime roles for EMR steps work, let’s take a look at how we will use Lake Formation to use fine-grained entry controls with EMR steps.

Use Lake Formation-based entry management with EMR steps

You should utilize Lake Formation to use table- and column-level permissions with Apache Spark and Apache Hive jobs submitted as EMR steps. First, the info lake admin in Lake Formation must register Amazon EMR because the AuthorizedSessionTagValue to implement Lake Formation permissions on EMR. Lake Formation makes use of this session tag to authorize callers and supply entry to the info lake. The Amazon EMR worth is referenced contained in the step-runtime-roles-sec-cfg.json file we used earlier after we created the EMR safety configuration, and contained in the trust-policy.json file we used to create the 2 runtime roles test-emr-demo1 and test-emr-demo2.

We will achieve this on the Lake Formation console within the Exterior information filtering part (exchange 123456789012 together with your AWS account ID).

On the IAM runtime roles’ belief coverage, we have already got the sts:TagSession permission with the situation “aws:RequestTag/LakeFormationAuthorizedCaller": "Amazon EMR". So we’re able to proceed.

To exhibit how Lake Formation works with EMR steps, we create one database named entities with two tables named customers and merchandise, and we assign in Lake Formation the grants summarized within the following desk.

IAM Roles Tables entities
(DB)
customers
(Desk)
merchandise
(Desk)
test-emr-demo1 Full Learn Entry No Entry
test-emr-demo2 Learn Entry on Columns: uid, state Full Learn Entry

Put together Amazon S3 recordsdata

We first put together our Amazon S3 recordsdata.

  1. Create the customers.csv file with the next content material:
    00005678,john,pike,england,london,Hidden Street 78
    00009039,paolo,rossi,italy,milan,Through degli Alberi 56A
    00009057,july,finn,germany,berlin,Inexperienced Street 90

  2. Create the merchandise.csv file with the next content material:
    P0000789,Bike2000,Sport
    P0000567,CoverToCover,Smartphone
    P0005677,Whiteboard X786,Residence

  3. Add these recordsdata to Amazon S3 in two totally different areas:
    #Change this together with your bucket identify
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    aws s3 cp customers.csv s3://${BUCKET_NAME}/entities-database/customers/
    aws s3 cp merchandise.csv s3://${BUCKET_NAME}/entities-database/merchandise/

Put together the database and tables

We will create our entities database by utilizing the AWS Glue APIs.

  1. Create the entities-db.json file with the next content material (substitute emr-steps-roles-new-us-east-1 together with your bucket identify):
    {
        "DatabaseInput": {
            "Identify": "entities",
            "LocationUri": "s3://emr-steps-roles-new-us-east-1/entities-database/",
            "CreateTableDefaultPermissions": []
        }
    }

  2. With a Lake Formation admin person, run the next command to create our database:
    aws glue create-database 
    --cli-input-json file://entities-db.json

    We additionally use the AWS Glue APIs to create the tables customers and merchandise.

  3. Create the users-table.json file with the next content material (substitute emr-steps-roles-new-us-east-1 together with your bucket identify):
    {
        "TableInput": {
            "Identify": "customers",
            "StorageDescriptor": {
                "Columns": [{
                        "Name": "uid",
                        "Type": "string"
                    },
                    {
                        "Name": "name",
                        "Type": "string"
                    },
                    {
                        "Name": "surname",
                        "Type": "string"
                    },
                    {
                        "Name": "state",
                        "Type": "string"
                    },
                    {
                        "Name": "city",
                        "Type": "string"
                    },
                    {
                        "Name": "address",
                        "Type": "string"
                    }
                ],
                "Location": "s3://emr-steps-roles-new-us-east-1/entities-database/customers/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "discipline.delim": ",",
                        "serialization.format": ","
                    }
                }
            },
            "TableType": "EXTERNAL_TABLE",
            "Parameters": {
                "EXTERNAL": "TRUE"
            }
        }
    }

  4. Create the products-table.json file with the next content material (substitute emr-steps-roles-new-us-east-1 together with your bucket identify):
    {
        "TableInput": {
            "Identify": "merchandise",
            "StorageDescriptor": {
                "Columns": [{
                        "Name": "product_id",
                        "Type": "string"
                    },
                    {
                        "Name": "name",
                        "Type": "string"
                    },
                    {
                        "Name": "category",
                        "Type": "string"
                    }
                ],
                "Location": "s3://emr-steps-roles-new-us-east-1/entities-database/merchandise/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "discipline.delim": ",",
                        "serialization.format": ","
                    }
                }
            },
            "TableType": "EXTERNAL_TABLE",
            "Parameters": {
                "EXTERNAL": "TRUE"
            }
        }
    }

  5. With a Lake Formation admin person, create our tables with the next instructions:
    aws glue create-table 
        --database-name entities 
        --cli-input-json file://users-table.json
        
    aws glue create-table 
        --database-name entities 
        --cli-input-json file://products-table.json

Arrange the Lake Formation information lake areas

To entry our tables information in Amazon S3, Lake Formation wants learn/write entry to them. To attain that, we’ve to register Amazon S3 areas the place our information resides and specify for them which IAM position to acquire credentials from.

Let’s create our IAM position for the info entry.

  1. Create a file referred to as trust-policy-data-access-role.json with the next content material:
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Sid": "",
                "Effect": "Allow",
                "Principal": {
                    "Service": "lakeformation.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  2. Use the coverage to create the IAM position emr-demo-lf-data-access-role:
    aws iam create-role 
    --role-name emr-demo-lf-data-access-role 
    --assume-role-policy-document file://trust-policy-data-access-role.json

  3. Create the file data-access-role-policy.json with the next content material (substitute emr-steps-roles-new-us-east-1 together with your bucket identify):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/entities-database",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/entities-database/*"
                ]
            },
            {
                "Impact": "Enable",
                "Motion": [
                    "s3:ListBucket"
                ],
                "Useful resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1"
                ]
            }
        ]
    }

  4. Create our IAM coverage:
    aws iam create-policy 
    --policy-name data-access-role-policy 
    --policy-document file://data-access-role-policy.json

  5. Assign to our emr-demo-lf-data-access-role the created coverage (exchange 123456789012 together with your AWS account ID):
    aws iam attach-role-policy 
    --role-name emr-demo-lf-data-access-role 
    --policy-arn "arn:aws:iam::123456789012:coverage/data-access-role-policy"

    We will now register our information location in Lake Formation.

  6. On the Lake Formation console, select Knowledge lake areas within the navigation pane.
  7. Right here we will register our S3 location containing information for our two tables and select the created emr-demo-lf-data-access-role IAM position, which has learn/write entry to that location.

For extra particulars about including an Amazon S3 location to your information lake and configuring your IAM information entry roles, check with Including an Amazon S3 location to your information lake.

Implement Lake Formation permissions

To make certain we’re utilizing Lake Formation permissions, we should always affirm that we don’t have any grants arrange for the principal IAMAllowedPrincipals. The IAMAllowedPrincipals group consists of any IAM customers and roles which might be allowed entry to your Knowledge Catalog assets by your IAM insurance policies, and it’s used to take care of backward compatibility with AWS Glue.

To substantiate Lake Formations permissions are enforced, navigate to the Lake Formation console and select Knowledge lake permissions within the navigation pane. Filter permissions by “Database”:“entities” and take away all of the permissions given to the principal IAMAllowedPrincipals.

For extra particulars on IAMAllowedPrincipals and backward compatibility with AWS Glue, check with Altering the default safety settings on your information lake.

Configure AWS Glue and Lake Formation grants for IAM runtime roles

To permit our IAM runtime roles to correctly work together with Lake Formation, we should always present them the lakeformation:GetDataAccess and glue:Get* grants.

Lake Formation permissions management entry to Knowledge Catalog assets, Amazon S3 areas, and the underlying information at these areas. IAM permissions management entry to the Lake Formation and AWS Glue APIs and assets. Subsequently, though you may need the Lake Formation permission to entry a desk within the Knowledge Catalog (SELECT), your operation fails if you happen to don’t have the IAM permission on the glue:Get* API.

For extra particulars about Lake Formation entry management, check with Lake Formation entry management overview.

  1. Create the emr-runtime-roles-lake-formation-policy.json file with the next content material:
    {
        "Model": "2012-10-17",
        "Assertion": {
            "Sid": "LakeFormationManagedAccess",
            "Impact": "Enable",
            "Motion": [
                "lakeformation:GetDataAccess",
                "glue:Get*",
                "glue:Create*",
                "glue:Update*"
            ],
            "Useful resource": "*"
        }
    }

  2. Create the associated IAM coverage:
    aws iam create-policy 
    --policy-name emr-runtime-roles-lake-formation-policy 
    --policy-document file://emr-runtime-roles-lake-formation-policy.json

  3. Assign this coverage to each IAM runtime roles (exchange 123456789012 together with your AWS account ID):
    aws iam attach-role-policy 
    --role-name test-emr-demo1 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-lake-formation-policy"
    
    aws iam attach-role-policy 
    --role-name test-emr-demo2 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-lake-formation-policy"

Arrange Lake Formation permissions

We now arrange the permission in Lake Formation for the 2 runtime roles.

  1. Create the file users-grants-test-emr-demo1.json with the next content material to grant SELECT entry to all columns within the entities.customers desk to test-emr-demo1:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:position/test-emr-demo1"
        },
        "Useful resource": {
            "Desk": {
                "DatabaseName": "entities",
                "Identify": "customers"
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  2. Create the file users-grants-test-emr-demo2.json with the next content material to grant SELECT entry to the uid and state columns within the entities.customers desk to test-emr-demo2:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:position/test-emr-demo2"
        },
        "Useful resource": {
            "TableWithColumns": {
                "DatabaseName": "entities",
                "Identify": "customers",
                "ColumnNames": ["uid", "state"]
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  3. Create the file products-grants-test-emr-demo2.json with the next content material to grant SELECT entry to all columns within the entities.merchandise desk to test-emr-demo2:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:position/test-emr-demo2"
        },
        "Useful resource": {
            "Desk": {
                "DatabaseName": "entities",
                "Identify": "merchandise"
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  4. Let’s arrange our permissions in Lake Formation:
    aws lakeformation grant-permissions 
    --cli-input-json file://users-grants-test-emr-demo1.json
    
    aws lakeformation grant-permissions 
    --cli-input-json file://users-grants-test-emr-demo2.json
    
    aws lakeformation grant-permissions 
    --cli-input-json file://products-grants-test-emr-demo2.json

  5. Examine the permissions we outlined on the Lake Formation console on the Knowledge lake permissions web page by filtering by “Database”:“entities”.

Take a look at Lake Formation permissions with runtime roles

For our check, we use a PySpark software referred to as test-lake-formation.py with the next content material:


from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("Pyspark - TEST IAM RBAC with LF").enableHiveSupport().getOrCreate()

attempt:
    print("== choose * from entities.customers restrict 3 ==n")
    spark.sql("choose * from entities.customers restrict 3").present()
besides Exception as e:
    print(e)

attempt:
    print("== choose * from entities.merchandise restrict 3 ==n")
    spark.sql("choose * from entities.merchandise restrict 3").present()
besides Exception as e:
    print(e)

spark.cease()

Within the script, we’re attempting to entry the tables customers and merchandise. Let’s add our Spark software in the identical S3 bucket that we used earlier:

#Change this together with your bucket identify
BUCKET_NAME="emr-steps-roles-new-us-east-1"

aws s3 cp test-lake-formation.py s3://${BUCKET_NAME}/scripts/

We’re now able to carry out our check. We run the test-lake-formation.py script first utilizing the test-emr-demo1 position after which utilizing the test-emr-demo2 position because the runtime roles.

Let’s submit a step specifying test-emr-demo1 because the runtime position:

#Change together with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change together with your AWS Account ID
ACCOUNT_ID=123456789012
#Change together with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:position/test-emr-demo1

The next is the output of the EMR step with test-emr-demo1 because the runtime position:

== choose * from entities.customers restrict 3 ==

+--------+-----+-------+-------+------+--------------------+
|     uid| identify|surname|  state|  metropolis|             deal with|
+--------+-----+-------+-------+------+--------------------+
|00005678| john|   pike|england|london|      Hidden Street 78|
|00009039|paolo|  rossi|  italy| milan|Through degli Alberi 56A|
|00009057| july|   finn|germany|berlin|       Inexperienced Street 90|
+--------+-----+-------+-------+------+--------------------+

== choose * from entities.merchandise restrict 3 ==

Inadequate Lake Formation permission(s) on merchandise (...)

As we will see, our software was solely in a position to entry the customers desk.

Submit the identical script once more as a brand new EMR step, however this time with the position test-emr-demo2 because the runtime position:

#Change together with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change together with your AWS Account ID
ACCOUNT_ID=123456789012
#Change together with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:position/test-emr-demo2

The next is the output of the EMR step with test-emr-demo2 because the runtime position:

== choose * from entities.customers restrict 3 ==

+--------+-------+
|     uid|  state|
+--------+-------+
|00005678|england|
|00009039|  italy|
|00009057|germany|
+--------+-------+

== choose * from entities.merchandise restrict 3 ==

+----------+---------------+----------+
|product_id|           identify|  class|
+----------+---------------+----------+
|  P0000789|       Bike2000|     Sport|
|  P0000567|   CoverToCover|Smartphone|
|  P0005677|Whiteboard X786|      Residence|
+----------+---------------+----------+

As we will see, our software was in a position to entry a subset of columns for the customers desk and all of the columns for the merchandise desk.

We will conclude that the permissions whereas accessing the Knowledge Catalog are being enforced based mostly on the runtime position used with the EMR step.

Audit utilizing the supply identification

The supply identification is a mechanism to observe and management actions taken with assumed roles. The Propagate supply identification function equally means that you can monitor and management actions taken utilizing runtime roles by the roles submitted with EMR steps.

We already configured EMR_EC2_defaultRole with "sts:SetSourceIdentity" on our two runtime roles. Additionally, each runtime roles let EMR_EC2_DefaultRole to SetSourceIdentity of their belief coverage. So we’re able to proceed.

We now see the Propagate supply identification function in motion with a easy instance.

Configure the IAM position that’s assumed to submit the EMR steps

We configure the IAM position job-submitter-1, which is assumed specifying the supply identification and which is used to submit the EMR steps. On this instance, we permit the IAM person paul to imagine this position and set the supply identification. Please word you should use any IAM principal right here.

  1. Create a file referred to as trust-policy-2.json with the next content material (exchange 123456789012 together with your AWS account ID):
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::123456789012:user/paul"
                },
                "Action": "sts:AssumeRole"
            },
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::123456789012:user/paul"
                },
                "Action": "sts:SetSourceIdentity"
            }
        ]
    }

  2. Use it because the belief coverage to create the IAM position job-submitter-1:
    aws iam create-role 
    --role-name job-submitter-1 
    --assume-role-policy-document file://trust-policy-2.json

    We use now the identical emr-runtime-roles-submitter-policy coverage we outlined earlier than to permit the position to submit EMR steps utilizing the test-emr-demo1 and test-emr-demo2 runtime roles.

  3. Assign this coverage to the IAM position job-submitter-1 (exchange 123456789012 together with your AWS account ID):
    aws iam attach-role-policy 
    --role-name job-submitter-1 
    --policy-arn "arn:aws:iam::123456789012:coverage/emr-runtime-roles-submitter-policy"

Take a look at the supply identification with AWS CloudTrail

To indicate how propagation of supply identification works with Amazon EMR, we generate a task session with the supply identification test-ad-user.

With the IAM person paul (or with the IAM principal you configured), we first carry out the impersonation (exchange 123456789012 together with your AWS account ID):

aws sts assume-role 
--role-arn arn:aws:iam::123456789012:position/job-submitter-1 
--role-session-name demotest 
--source-identity test-ad-user

The next code is the output acquired:

{
"Credentials": {
    "SecretAccessKey": "<SECRET_ACCESS_KEY>",
    "SessionToken": "<SESSION_TOKEN>",
    "Expiration": "<EXPIRATION_TIME>",
    "AccessKeyId": "<ACCESS_KEY_ID>"
},
"AssumedRoleUser": {
    "AssumedRoleId": "AROAUVT2HQ3......:demotest",
    "Arn": "arn:aws:sts::123456789012:assumed-role/test-emr-role/demotest"
},
"SourceIdentity": "test-ad-user"
}

We use the short-term AWS safety credentials of the position session, to submit an EMR step together with the runtime position test-emr-demo1:

export AWS_ACCESS_KEY_ID="<ACCESS_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<SECRET_ACCESS_KEY>"
export AWS_SESSION_TOKEN="<SESSION_TOKEN>" 

#Change together with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change together with your AWS Account ID
ACCOUNT_ID=123456789012
#Change together with your Bucket identify
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps 
--cluster-id $CLUSTER_ID 
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' 
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:position/test-emr-demo1

In a couple of minutes, we will see occasions showing within the AWS CloudTrail log file. We will see all of the AWS APIs that the roles invoked utilizing the runtime position. Within the following snippet, we will see that the step carried out the sts:AssumeRole and lakeformation:GetDataAccess actions. It’s value noting how the supply identification test-ad-user has been preserved within the occasions.

Clear up

Now you can delete the EMR cluster you created.

  1. On the Amazon EMR console, select Clusters within the navigation pane.
  2. Choose the cluster iam-passthrough-cluster, then select Terminate.
  3. Select Terminate once more to verify.

Alternatively, you’ll be able to delete the cluster by utilizing the Amazon EMR CLI with the next command (exchange the EMR cluster ID with the one returned by the beforehand run aws emr create-cluster command):

aws emr terminate-clusters --cluster-ids j-3KVXXXXXXX7UG

Conclusion

On this submit, we mentioned how one can management information entry on Amazon EMR on EC2 clusters by utilizing runtime roles with EMR steps. We mentioned how the function works, how you should use Lake Formation to use fine-grained entry controls, and tips on how to monitor and management actions utilizing a supply identification. To be taught extra about this function, check with Configure runtime roles for Amazon EMR steps.


Concerning the authors

Stefano Sandona is an Analytics Specialist Resolution Architect with AWS. He loves information, distributed programs and safety. He helps clients around the globe architecting their information platforms. He has a powerful deal with Amazon EMR and all the safety facets round it.

Sharad Kala is a senior engineer at AWS working with the EMR workforce. He focuses on the safety facets of the functions working on EMR. He has a eager curiosity in working and studying about distributed programs.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments