You should utilize the Amazon EMR Steps API to submit Apache Hive, Apache Spark, and others forms of functions to an EMR cluster. You’ll be able to invoke the Steps API utilizing Apache Airflow, AWS Steps Capabilities, the AWS Command Line Interface (AWS CLI), all of the AWS SDKs, and the AWS Administration Console. Jobs submitted with the Steps API use the Amazon Elastic Compute Cloud (Amazon EC2) occasion profile to entry AWS assets corresponding to Amazon Easy Storage Service (Amazon S3) buckets, AWS Glue tables, and Amazon DynamoDB tables from the cluster.
Beforehand, if a step wanted entry to a particular S3 bucket and one other step wanted entry to a particular DynamoDB desk, the AWS Id and Entry Administration (IAM) coverage hooked up to the occasion profile needed to permit entry to each the S3 bucket and the DynamoDB desk. This meant that the IAM insurance policies you assigned to the occasion profile needed to include a union of all of the permissions for each step that ran on an EMR cluster.
We’re completely happy to introduce runtime roles for EMR steps. A runtime position is an IAM position that you just affiliate with an EMR step, and jobs use this position to entry AWS assets. With runtime roles for EMR steps, now you can specify totally different IAM roles for the Spark and the Hive jobs, thereby scoping down entry at a job degree. This lets you simplify entry controls on a single EMR cluster that’s shared between a number of tenants, whereby every tenant will be simply remoted utilizing IAM roles.
The power to specify an IAM position with a job can be obtainable on Amazon EMR on EKS and Amazon EMR Serverless. It’s also possible to use AWS Lake Formation to use table- and column-level permission for Apache Hive and Apache Spark jobs which might be submitted with EMR steps. For extra info, check with Configure runtime roles for Amazon EMR steps.
On this submit, we dive deeper into runtime roles for EMR steps, serving to you perceive how the assorted items work collectively, and the way every step is remoted on an EMR cluster.
Resolution overview
On this submit, we stroll by means of the next:
- Create an EMR cluster enabled to make use of the brand new role-based entry management with EMR steps.
- Create two IAM roles with totally different permissions when it comes to the Amazon S3 information and Lake Formation tables they’ll entry.
- Enable the IAM principal submitting the EMR steps to make use of these two IAM roles.
- See how EMR steps working with the identical code and attempting to entry the identical information have totally different permissions based mostly on the runtime position specified at submission time.
- See tips on how to monitor and management actions utilizing supply identification propagation.
Arrange EMR cluster safety configuration
Amazon EMR safety configurations simplify making use of constant safety, authorization, and authentication choices throughout your clusters. You’ll be able to create a safety configuration on the Amazon EMR console or through the AWS CLI or AWS SDK. If you connect a safety configuration to a cluster, Amazon EMR applies the settings within the safety configuration to the cluster. You’ll be able to connect a safety configuration to a number of clusters at creation time, however can’t apply them to a working cluster.
To allow runtime roles for EMR steps, we’ve to create a safety configuration as proven within the following code and allow the runtime roles property (configured through EnableApplicationScopedIAMRole
). Along with the runtime roles, we’re enabling propagation of the supply identification (configured through PropagateSourceIdentity
) and help for Lake Formation (configured through LakeFormationConfiguration
). The supply identification is a mechanism to observe and management actions taken with assumed roles. Enabling Propagate supply identification means that you can audit actions carried out utilizing the runtime position. Lake Formation is an AWS service to securely handle a knowledge lake, which incorporates defining and implementing central entry management insurance policies on your information lake.
Create a file referred to as step-runtime-roles-sec-cfg.json
with the next content material:
Create the Amazon EMR safety configuration:
It’s also possible to do the identical through the Amazon console:
- On the Amazon EMR console, select Safety configurations within the navigation pane.
- Select Create.
- Select Create.
- For Safety configuration identify, enter a reputation.
- For Safety configuration setup choices, choose Select customized settings.
- For IAM position for functions, choose Runtime position.
- Choose Propagate supply identification to audit actions carried out utilizing the runtime position.
- For Tremendous-grained entry management, choose AWS Lake Formation.
- Full the safety configuration.
The safety configuration seems in your safety configuration record. It’s also possible to see that the authorization mechanism listed right here is the runtime position as an alternative of the occasion profile.
Launch the cluster
Now we launch an EMR cluster and specify the safety configuration we created. For extra info, check with Specify a safety configuration for a cluster.
The next code gives the AWS CLI command for launching an EMR cluster with the suitable safety configuration. Notice that this cluster is launched on the default VPC and public subnet with the default IAM roles. As well as, the cluster is launched with one major and one core occasion of the desired occasion sort. For extra particulars on tips on how to customise the launch parameters, check with create-cluster.
If the default EMR roles EMR_EC2_DefaultRole
and EMR_DefaultRole
don’t exist in IAM in your account (that is the primary time you’re launching an EMR cluster with these), earlier than launching the cluster, use the next command to create them:
Create the cluster with the next code:
When the cluster is absolutely provisioned (Ready
state), let’s attempt to run a step on it with runtime roles for EMR steps enabled:
After launching the command, we obtain the next as output:
The step failed, asking us to supply a runtime position. Within the subsequent part, we arrange two IAM roles with totally different permissions and use them because the runtime roles for EMR steps.
Arrange IAM roles as runtime roles
Any IAM position that you just wish to use as a runtime position for EMR steps will need to have a belief coverage that permits the EMR cluster’s EC2 occasion profile to imagine it. In our setup, we’re utilizing the default IAM position EMR_EC2_DefaultRole
because the occasion profile position. As well as, we create two IAM roles referred to as test-emr-demo1
and test-emr-demo2
that we use as runtime roles for EMR steps.
The next code is the belief coverage for each of the IAM roles, which lets the EMR cluster’s EC2 occasion profile position, EMR_EC2_DefaultRole
, assume these roles and set the supply identification and LakeFormationAuthorizedCaller
tag on the position classes. The TagSession
permission is required in order that Amazon EMR can authorize to Lake Formation. The SetSourceIdentity
assertion is required for the propagate supply identification function.
Create a file referred to as trust-policy.json
with the next content material (exchange 123456789012 together with your AWS account ID):
Use that coverage to create the 2 IAM roles, test-emr-demo1
and test-emr-demo2
:
Arrange permissions for the principal submitting the EMR steps with runtime roles
The IAM principal submitting the EMR steps must have permissions to invoke the AddJobFlowSteps API. As well as, you should use the Situation key elasticmapreduce:ExecutionRoleArn
to manage entry to particular IAM roles. For instance, the next coverage permits the IAM principal to solely use IAM roles test-emr-demo1
and test-emr-demo2
because the runtime roles for EMR steps.
- Create the
job-submitter-policy.json
file with the next content material (exchange 123456789012 together with your AWS account ID): - Create the IAM coverage with the next code:
- Assign this coverage to the IAM principal (IAM person or IAM position) you’re going to make use of to submit the EMR steps (exchange 123456789012 together with your AWS account ID and exchange
john
with the IAM person you employ to submit your EMR steps):
IAM person john
can now submit steps utilizing arn:aws:iam::123456789012:position/test-emr-demo1
and arn:aws:iam::123456789012:position/test-emr-demo2
because the step runtime roles.
Use runtime roles with EMR steps
We now put together our setup to indicate runtime roles for EMR steps in motion.
Arrange Amazon S3
To arrange your Amazon S3 information, full the next steps:
- Create a CSV file referred to as
check.csv
with the next content material: - Add the file to Amazon S3 in three totally different areas:
For our preliminary check, we use a PySpark software referred to as
check.py
with the next contents:Within the script, we’re attempting to entry the CSV file current below three totally different prefixes within the check bucket.
- Add the Spark software inside the identical S3 bucket the place we positioned the
check.csv
file however in a distinct location:
Arrange runtime position permissions
To indicate how runtime roles for EMR steps works, we assign to the roles we created totally different IAM permissions to entry Amazon S3. The next desk summarizes the grants we offer to every position (emr-steps-roles-new-us-east-1
is the bucket you configured within the earlier part).
S3 areas IAM Roles | test-emr-demo1 | test-emr-demo2 |
s3://emr-steps-roles-new-us-east-1/* | No Entry | No Entry |
s3://emr-steps-roles-new-us-east-1/demo1/* | Full Entry | No Entry |
s3://emr-steps-roles-new-us-east-1/demo2/* | No Entry | Full Entry |
s3://emr-steps-roles-new-us-east-1/scripts/* | Learn Entry | Learn Entry |
- Create the file
demo1-policy.json
with the next content material (substituteemr-steps-roles-new-us-east-1
together with your bucket identify): - Create the file
demo2-policy.json
with the next content material (substituteemr-steps-roles-new-us-east-1
together with your bucket identify): - Create our IAM insurance policies:
- Assign to every position the associated coverage (exchange 123456789012 together with your AWS account ID):
To make use of runtime roles with Amazon EMR steps, we have to add the next coverage to our EMR cluster’s EC2 occasion profile (on this instance
EMR_EC2_DefaultRole
). With this coverage, the underlying EC2 situations for the EMR cluster can assume the runtime position and apply a tag to that runtime position. - Create the file
runtime-roles-policy.json
with the next content material (exchange 123456789012 together with your AWS account ID): - Create the IAM coverage:
- Assign the created coverage to the EMR cluster’s EC2 occasion profile, on this instance
EMR_EC2_DefaultRole
:
Take a look at permissions with runtime roles
We’re now able to carry out our first check. We run the check.py
script, beforehand uploaded to Amazon S3, two occasions as Spark steps: first utilizing the test-emr-demo1
position after which utilizing the test-emr-demo2
position because the runtime roles.
To run an EMR step specifying a runtime position, you want the newest model of the AWS CLI. For extra particulars about updating the AWS CLI, check with Putting in or updating the newest model of the AWS CLI.
Let’s submit a step specifying test-emr-demo1
because the runtime position:
This command returns an EMR step ID. To verify our step output logs, we will proceed two other ways:
- From the Amazon EMR console – On the Steps tab, select the View logs hyperlink associated to the particular step ID and choose
stdout
. - From Amazon S3 – Whereas launching our cluster, we configured an S3 location for logging. We will discover our step logs below
$(LOG_URI)/steps/<stepID>/stdout.gz
.
The logs may take a few minutes to populate after the step is marked as Accomplished
.
The next is the output of the EMR step with test-emr-demo1
because the runtime position:
As we will see, solely the demo1
folder was accessible by our software.
Diving deeper into the step stderr
logs, we will see that the associated YARN software application_1656350436159_0017
was launched with the person 6GC64F33KUW4Q2JY6LKR7UAHWETKKXYL
. We will affirm this by connecting to the EMR major occasion utilizing SSH and utilizing the YARN CLI:
Please word that in your case, the YARN software ID and the person can be totally different.
Now we submit the identical script once more as a brand new EMR step, however this time with the position test-emr-demo2
because the runtime position:
The next is the output of the EMR step with test-emr-demo2
because the runtime position:
As we will see, solely the demo2
folder was accessible by our software.
Diving deeper into the step stderr
logs, we will see that the associated YARN software application_1656350436159_0018
was launched with a distinct person 7T2ORHE6Z4Q7PHLN725C2CVWILZWYOLE
. We will affirm this by utilizing the YARN CLI:
Every step was in a position to solely entry the CSV file that was allowed by the runtime position, so step one was in a position to solely entry s3://emr-steps-roles-new-us-east-1/demo1/check.csv
and the second step was solely in a position to entry s3://emr-steps-roles-new-us-east-1/demo2/check.csv
. As well as, we noticed that Amazon EMR created a novel person for the steps, and used the person to run the roles. Please word that each roles want not less than learn entry to the S3 location the place the step scripts are situated (for instance, s3://emr-steps-roles-demo-bucket/scripts/check.py
).
Now that we’ve seen how runtime roles for EMR steps work, let’s take a look at how we will use Lake Formation to use fine-grained entry controls with EMR steps.
Use Lake Formation-based entry management with EMR steps
You should utilize Lake Formation to use table- and column-level permissions with Apache Spark and Apache Hive jobs submitted as EMR steps. First, the info lake admin in Lake Formation must register Amazon EMR
because the AuthorizedSessionTagValue
to implement Lake Formation permissions on EMR. Lake Formation makes use of this session tag to authorize callers and supply entry to the info lake. The Amazon EMR
worth is referenced contained in the step-runtime-roles-sec-cfg.json
file we used earlier after we created the EMR safety configuration, and contained in the trust-policy.json
file we used to create the 2 runtime roles test-emr-demo1
and test-emr-demo2
.
We will achieve this on the Lake Formation console within the Exterior information filtering part (exchange 123456789012 together with your AWS account ID).
On the IAM runtime roles’ belief coverage, we have already got the sts:TagSession
permission with the situation “aws:RequestTag/LakeFormationAuthorizedCaller": "Amazon EMR"
. So we’re able to proceed.
To exhibit how Lake Formation works with EMR steps, we create one database named entities
with two tables named customers
and merchandise
, and we assign in Lake Formation the grants summarized within the following desk.
IAM Roles Tables | entities (DB) |
|
customers (Desk) |
merchandise (Desk) |
|
test-emr-demo1 | Full Learn Entry | No Entry |
test-emr-demo2 | Learn Entry on Columns: uid, state | Full Learn Entry |
Put together Amazon S3 recordsdata
We first put together our Amazon S3 recordsdata.
- Create the
customers.csv
file with the next content material: - Create the merchandise.csv file with the next content material:
- Add these recordsdata to Amazon S3 in two totally different areas:
Put together the database and tables
We will create our entities
database by utilizing the AWS Glue APIs.
- Create the
entities-db.json
file with the next content material (substituteemr-steps-roles-new-us-east-
1 together with your bucket identify): - With a Lake Formation admin person, run the next command to create our database:
We additionally use the AWS Glue APIs to create the tables customers and merchandise.
- Create the
users-table.json
file with the next content material (substituteemr-steps-roles-new-us-east-1
together with your bucket identify): - Create the
products-table.json
file with the next content material (substituteemr-steps-roles-new-us-east-1
together with your bucket identify): - With a Lake Formation admin person, create our tables with the next instructions:
Arrange the Lake Formation information lake areas
To entry our tables information in Amazon S3, Lake Formation wants learn/write entry to them. To attain that, we’ve to register Amazon S3 areas the place our information resides and specify for them which IAM position to acquire credentials from.
Let’s create our IAM position for the info entry.
- Create a file referred to as
trust-policy-data-access-role.json
with the next content material: - Use the coverage to create the IAM
position emr-demo-lf-data-access-role
: - Create the file
data-access-role-policy.json
with the next content material (substituteemr-steps-roles-new-us-east-1
together with your bucket identify): - Create our IAM coverage:
- Assign to our
emr-demo-lf-data-access-role
the created coverage (exchange 123456789012 together with your AWS account ID):We will now register our information location in Lake Formation.
- On the Lake Formation console, select Knowledge lake areas within the navigation pane.
- Right here we will register our S3 location containing information for our two tables and select the created
emr-demo-lf-data-access-role
IAM position, which has learn/write entry to that location.
For extra particulars about including an Amazon S3 location to your information lake and configuring your IAM information entry roles, check with Including an Amazon S3 location to your information lake.
Implement Lake Formation permissions
To make certain we’re utilizing Lake Formation permissions, we should always affirm that we don’t have any grants arrange for the principal IAMAllowedPrincipals
. The IAMAllowedPrincipals
group consists of any IAM customers and roles which might be allowed entry to your Knowledge Catalog assets by your IAM insurance policies, and it’s used to take care of backward compatibility with AWS Glue.
To substantiate Lake Formations permissions are enforced, navigate to the Lake Formation console and select Knowledge lake permissions within the navigation pane. Filter permissions by “Database”:“entities”
and take away all of the permissions given to the principal IAMAllowedPrincipals
.
For extra particulars on IAMAllowedPrincipals
and backward compatibility with AWS Glue, check with Altering the default safety settings on your information lake.
Configure AWS Glue and Lake Formation grants for IAM runtime roles
To permit our IAM runtime roles to correctly work together with Lake Formation, we should always present them the lakeformation:GetDataAccess
and glue:Get*
grants.
Lake Formation permissions management entry to Knowledge Catalog assets, Amazon S3 areas, and the underlying information at these areas. IAM permissions management entry to the Lake Formation and AWS Glue APIs and assets. Subsequently, though you may need the Lake Formation permission to entry a desk within the Knowledge Catalog (SELECT), your operation fails if you happen to don’t have the IAM permission on the glue:Get*
API.
For extra particulars about Lake Formation entry management, check with Lake Formation entry management overview.
- Create the
emr-runtime-roles-lake-formation-policy.json
file with the next content material: - Create the associated IAM coverage:
- Assign this coverage to each IAM runtime roles (exchange 123456789012 together with your AWS account ID):
Arrange Lake Formation permissions
We now arrange the permission in Lake Formation for the 2 runtime roles.
- Create the file
users-grants-test-emr-demo1.json
with the next content material to grant SELECT entry to all columns within theentities.customers
desk totest-emr-demo1
: - Create the file
users-grants-test-emr-demo2.json
with the next content material to grant SELECT entry to theuid
andstate
columns within theentities.customers
desk totest-emr-demo2
: - Create the file
products-grants-test-emr-demo2.json
with the next content material to grant SELECT entry to all columns within theentities.merchandise
desk totest-emr-demo2
: - Let’s arrange our permissions in Lake Formation:
- Examine the permissions we outlined on the Lake Formation console on the Knowledge lake permissions web page by filtering by
“Database”:“entities”
.
Take a look at Lake Formation permissions with runtime roles
For our check, we use a PySpark software referred to as test-lake-formation.py
with the next content material:
Within the script, we’re attempting to entry the tables customers
and merchandise
. Let’s add our Spark software in the identical S3 bucket that we used earlier:
We’re now able to carry out our check. We run the test-lake-formation.py
script first utilizing the test-emr-demo1
position after which utilizing the test-emr-demo2
position because the runtime roles.
Let’s submit a step specifying test-emr-demo
1 because the runtime position:
The next is the output of the EMR step with test-emr-demo1
because the runtime position:
As we will see, our software was solely in a position to entry the customers
desk.
Submit the identical script once more as a brand new EMR step, however this time with the position test-emr-demo2
because the runtime position:
The next is the output of the EMR step with test-emr-demo2
because the runtime position:
As we will see, our software was in a position to entry a subset of columns for the customers
desk and all of the columns for the merchandise
desk.
We will conclude that the permissions whereas accessing the Knowledge Catalog are being enforced based mostly on the runtime position used with the EMR step.
Audit utilizing the supply identification
The supply identification is a mechanism to observe and management actions taken with assumed roles. The Propagate supply identification function equally means that you can monitor and management actions taken utilizing runtime roles by the roles submitted with EMR steps.
We already configured EMR_EC2_defaultRole
with "sts:SetSourceIdentity"
on our two runtime roles. Additionally, each runtime roles let EMR_EC2_DefaultRole
to SetSourceIdentity
of their belief coverage. So we’re able to proceed.
We now see the Propagate supply identification function in motion with a easy instance.
Configure the IAM position that’s assumed to submit the EMR steps
We configure the IAM position job-submitter-1
, which is assumed specifying the supply identification and which is used to submit the EMR steps. On this instance, we permit the IAM person paul
to imagine this position and set the supply identification. Please word you should use any IAM principal right here.
- Create a file referred to as
trust-policy-2.json
with the next content material (exchange 123456789012 together with your AWS account ID): - Use it because the belief coverage to create the IAM position
job-submitter-1
:We use now the identical
emr-runtime-roles-submitter-policy
coverage we outlined earlier than to permit the position to submit EMR steps utilizing thetest-emr-demo1
andtest-emr-demo2
runtime roles. - Assign this coverage to the IAM position
job-submitter-1
(exchange 123456789012 together with your AWS account ID):
Take a look at the supply identification with AWS CloudTrail
To indicate how propagation of supply identification works with Amazon EMR, we generate a task session with the supply identification test-ad-user
.
With the IAM person paul
(or with the IAM principal you configured), we first carry out the impersonation (exchange 123456789012 together with your AWS account ID):
The next code is the output acquired:
We use the short-term AWS safety credentials of the position session, to submit an EMR step together with the runtime position test-emr-demo1
:
In a couple of minutes, we will see occasions showing within the AWS CloudTrail log file. We will see all of the AWS APIs that the roles invoked utilizing the runtime position. Within the following snippet, we will see that the step carried out the sts:AssumeRole
and lakeformation:GetDataAccess
actions. It’s value noting how the supply identification test-ad-user
has been preserved within the occasions.
Clear up
Now you can delete the EMR cluster you created.
- On the Amazon EMR console, select Clusters within the navigation pane.
- Choose the cluster
iam-passthrough-cluster
, then select Terminate. - Select Terminate once more to verify.
Alternatively, you’ll be able to delete the cluster by utilizing the Amazon EMR CLI with the next command (exchange the EMR cluster ID with the one returned by the beforehand run aws emr create-cluster
command):
Conclusion
On this submit, we mentioned how one can management information entry on Amazon EMR on EC2 clusters by utilizing runtime roles with EMR steps. We mentioned how the function works, how you should use Lake Formation to use fine-grained entry controls, and tips on how to monitor and management actions utilizing a supply identification. To be taught extra about this function, check with Configure runtime roles for Amazon EMR steps.
Concerning the authors
Stefano Sandona is an Analytics Specialist Resolution Architect with AWS. He loves information, distributed programs and safety. He helps clients around the globe architecting their information platforms. He has a powerful deal with Amazon EMR and all the safety facets round it.
Sharad Kala is a senior engineer at AWS working with the EMR workforce. He focuses on the safety facets of the functions working on EMR. He has a eager curiosity in working and studying about distributed programs.