Run Spark jobs using the ACK EMR on EKS controller
Run Spark jobs using ACK service controller for EMR on EKS.
Using ACK service controller for EMR on EKS, customers have the ability to define and run EMR jobs directly from their Kubernetes clusters. EMR on EKS manages the lifecycle of these jobs and it is 3.5 times faster than open-source Spark because it uses highly optimized EMR runtime
To get started, you can download the EMR on EKS controller image from Amazon ECR and run Spark jobs in minutes. ACK service controller for EMR on EKS is generally available. To learn more about EMR on EKS, visit our documentation.
Installation steps
Here are the steps involved for installing EMR on EKS controller.
Prereqs
Install these tools before proceeding:
- AWS CLI
kubectl
- the Kubernetes CLIeksctl
- the CLI for AWS EKSyq
- YAML processorhelm
- the package manager for Kubernetes
Configure AWS CLI with sufficient permissions to install EKS cluster. Please see documentation for further guidance
Install EKS cluster
You can either create an EKS cluster or re-use existing one. Below listed are steps for creating new EKS cluster. Let’s export environment variables that are needed for the EMR on EKS cluster setup. Please copy and paste commands into terminal for faster provisioning
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
export EKS_CLUSTER_NAME="ack-emr-eks"
export AWS_REGION="us-west-2"
We’ll use eksctl to install EKS cluster.
eksctl create cluster -f - << EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${EKS_CLUSTER_NAME}
region: ${AWS_REGION}
version: "1.23"
managedNodeGroups:
- instanceType: m5.xlarge
name: ${EKS_CLUSTER_NAME}-ng
desiredCapacity: 2
iam:
withOIDC: true
EOF
Create IAM Identity mapping
We need to create emr-containers identity in EKS cluster so that EMR service has proper RBAC permissions needed to run and manage Spark jobs
export EMR_NAMESPACE=emr-ns
echo "creating namespace for $SERVICE"
kubectl create ns $EMR_NAMESPACE
echo "creating iamidentitymapping using eksctl"
eksctl create iamidentitymapping \
--cluster $EKS_CLUSTER_NAME \
--namespace $EMR_NAMESPACE \
--service-name "emr-containers"
Expected outcome
2022-08-26 09:07:42 [ℹ] created "emr-ns:Role.rbac.authorization.k8s.io/emr-containers"
2022-08-26 09:07:42 [ℹ] created "emr-ns:RoleBinding.rbac.authorization.k8s.io/emr-containers"
2022-08-26 09:07:42 [ℹ] adding identity "arn:aws:iam::012345678910:role/AWSServiceRoleForAmazonEMRContainers" to auth ConfigMap
Install emrcontainers-controller in your EKS cluster
Now we can go ahead and install EMR on EKS controller. First, let’s export environment variables needed for setup
export SERVICE=emrcontainers
export RELEASE_VERSION=$(curl -sL https://api.github.com/repos/aws-controllers-k8s/${SERVICE}-controller/releases/latest | jq -r '.tag_name | ltrimstr("v")')
export ACK_SYSTEM_NAMESPACE=ack-system
We cam use Helm for the installation
echo "installing ack-$SERVICE-controller"
aws ecr-public get-login-password --region us-east-1 | helm registry login --username AWS --password-stdin public.ecr.aws
helm install --create-namespace -n $ACK_SYSTEM_NAMESPACE ack-$SERVICE-controller \
oci://public.ecr.aws/aws-controllers-k8s/$SERVICE-chart --version=$RELEASE_VERSION --set=aws.region=$AWS_REGION
Expected outcome
NAME: ack-emrcontainers-controller
LAST DEPLOYED: Fri Aug 26 09:05:08 2022
NAMESPACE: ack-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
emrcontainers-chart has been installed.
This chart deploys "public.ecr.aws/aws-controllers-k8s/emrcontainers-controller:0.0.6".
Check its status by running:
kubectl --namespace ack-system get pods -l "app.kubernetes.io/instance=ack-emrcontainers-controller"
You are now able to create Amazon EMR on EKS (EMRContainers) resources!
Configure IRSA for emr on eks controller
Once the controller is deployed, you need to setup IAM permissions for the controller so that it can create and manage resources using EMR, S3 and other API’s. We will use IAM Roles for Service Account to secure this IAM role so that only EMR on EKS controller service account can assume the permissions assigned.
Please follow how to configure IAM permissions for IRSA setup. Make sure to change the value for SERVICE
to emrcontainers
After completing all the steps, please validate annotation for service account before proceeding.
# validate annotation
kubectl get pods -n $ACK_SYSTEM_NAMESPACE
CONTROLLER_POD_NAME=$(kubectl get pods -n $ACK_SYSTEM_NAMESPACE --selector=app.kubernetes.io/name=emrcontainers-chart -o jsonpath='{.items..metadata.name}')
kubectl describe pod -n $ACK_SYSTEM_NAMESPACE $CONTROLLER_POD_NAME | grep "^\s*AWS_"
Expected outcome
AWS_REGION: us-west-2
AWS_ENDPOINT_URL:
AWS_ROLE_ARN: arn:aws:iam::012345678910:role/ack-emrcontainers-controller
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token (http://eks.amazonaws.com/serviceaccount/token)
Create EMR VirtualCluster
We can now create EMR Virtual Cluster. An EMR Virtual Cluster is mapped to a Kubernetes namespace. EMR uses virtual clusters to run jobs and host endpoints.
cat << EOF > virtualcluster.yaml
---
apiVersion: emrcontainers.services.k8s.aws/v1alpha1
kind: VirtualCluster
metadata:
name: my-ack-vc
spec:
name: my-ack-vc
containerProvider:
id: $EKS_CLUSTER_NAME
type_: EKS
info:
eksInfo:
namespace: emr-ns
EOF
Let’s create a virtualcluster
envsubst < virtualcluster.yaml | kubectl apply -f -
kubectl describe virtualclusters
Expected outcome
Name: my-ack-vc
Namespace: default
Labels: <none>
Annotations: <none>
API Version: emrcontainers.services.k8s.aws/v1alpha1
Kind: VirtualCluster
...
Status:
Ack Resource Metadata:
Arn: arn:aws:emr-containers:us-west-2:012345678910:/virtualclusters/dxnqujbxexzri28ph1wspbxo0
Owner Account ID: 012345678910
Region: us-west-2
Conditions:
Last Transition Time: 2022-08-26T17:21:26Z
Message: Resource synced successfully
Reason:
Status: True
Type: ACK.ResourceSynced
Id: dxnqujbxexzri28ph1wspbxo0
Events: <none>
Create Job Execution Role
In order to run sample spark job, we need to create EMR Job Execution Role. This Role will have IAM permissions such as S3, CloudWatch Logs for running your job. We will use IRSA to secure this job role
ACK_JOB_EXECUTION_ROLE="ack-${SERVICE}-jobexecution-role"
ACK_JOB_EXECUTION_IAM_ROLE_DESCRIPTION="IRSA role for ACK ${SERVICE} Job Execution"
cat <<EOF > job_trust.json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
EOF
aws iam create-role --role-name "${ACK_JOB_EXECUTION_ROLE}" \
--assume-role-policy-document file://job_trust.json \
--description "${ACK_JOB_EXECUTION_IAM_ROLE_DESCRIPTION}"
export ACK_JOB_EXECUTION_ROLE_ARN=$(aws iam get-role --role-name=$ACK_JOB_EXECUTION_ROLE --query Role.Arn --output text)
cat <<EOF > job_policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject*"
],
"Resource": [
"arn:aws:s3:::tripdata",
"arn:aws:s3:::tripdata/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:PutLogEvents",
"logs:CreateLogStream",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws:logs:*:*:*"
]
}
]
}
EOF
echo "Creating ACK-${SERVICE}-JobExecution-POLICY"
aws iam create-policy \
--policy-name ack-${SERVICE}-jobexecution-policy \
--policy-document file://job_policy.json
echo -n "Attaching IAM policy ..."
aws iam attach-role-policy \
--role-name "${ACK_JOB_EXECUTION_ROLE}" \
--policy-arn "arn:aws:iam::${AWS_ACCOUNT_ID}:policy/ack-${SERVICE}-jobexecution-policy"
aws emr-containers update-role-trust-policy \
--cluster-name ${EKS_CLUSTER_NAME} \
--namespace ${EMR_NAMESPACE} \
--role-name ${ACK_JOB_EXECUTION_ROLE}
Run a Sample Spark Job
Before running a sample job, let’s create CloudWatch Logs and an S3 bucket to store EMR on EKS logs
export RANDOM_ID1=$(LC_ALL=C tr -dc a-z0-9 </dev/urandom | head -c 8)
aws logs create-log-group --log-group-name=/emr-on-eks-logs/$EKS_CLUSTER_NAME
aws s3 mb s3://$EKS_CLUSTER_NAME-$RANDOM_ID1
Now let’s submit sample spark job
echo "checking if VirtualCluster Status is "True""
VC=$(kubectl get virtualcluster -o jsonpath='{.items..metadata.name}')
kubectl describe virtualcluster/$VC | yq e '.Status.Conditions.Status'
export RANDOM_ID2=$(LC_ALL=C tr -dc a-z0-9 </dev/urandom | head -c 8)
cat << EOF > jobrun.yaml
---
apiVersion: emrcontainers.services.k8s.aws/v1alpha1
kind: JobRun
metadata:
name: my-ack-jobrun-${RANDOM_ID2}
spec:
name: my-ack-jobrun-${RANDOM_ID2}
virtualClusterRef:
from:
name: my-ack-vc
executionRoleARN: "${ACK_JOB_EXECUTION_ROLE_ARN}"
releaseLabel: "emr-6.7.0-latest"
jobDriver:
sparkSubmitJobDriver:
entryPoint: "local:///usr/lib/spark/examples/src/main/python/pi.py"
entryPointArguments:
sparkSubmitParameters: "--conf spark.executor.instances=2 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1"
configurationOverrides: |
ApplicationConfiguration: null
MonitoringConfiguration:
CloudWatchMonitoringConfiguration:
LogGroupName: /emr-on-eks-logs/$EKS_CLUSTER_NAME
LogStreamNamePrefix: pi-job
S3MonitoringConfiguration:
LogUri: s3://$EKS_CLUSTER_NAME-$RANDOM_ID1
EOF
echo "running sample job"
kubectl apply -f jobrun.yaml
kubectl describe jobruns
Expected outcome
Name: my-ack-jobrun-t2rpcpks
Namespace: default
Labels: <none>
Annotations: <none>
API Version: emrcontainers.services.k8s.aws/v1alpha1
Kind: JobRun
...
Status:
Ack Resource Metadata:
Arn: arn:aws:emr-containers:us-west-2:012345678910:/virtualclusters/dxnqujbxexzri28ph1wspbxo0/jobruns/000000030mrd934cdqc
Owner Account ID: 012345678910
Region: us-west-2
Conditions:
Last Transition Time: 2022-08-26T18:29:12Z
Message: Resource synced successfully
Reason:
Status: True
Type: ACK.ResourceSynced
Id: 000000030mrd934cdqc
Events: <none>
Cleanup
Simply run these commands to cleanup your environment
# delete all custom resources
kubectl delete -f virtualcluster.yaml
kubectl delete -f jobrun.yaml
# note: you cannot delete jobruns until virtualcluster its mapped to is deleted
# uninstall emrcontainers controller
helm delete ack-$SERVICE-controller -n $ACK_SYSTEM_NAMESPACE
# delete namespace
kubectl delete ns $ACK_SYSTEM_NAMESPACE
kubectl delete ns $EMR_NAMESPACE
# delete aws resources
aws logs delete-log-group --log-group-name=/emr-on-eks-logs/$EKS_CLUSTER_NAME
aws s3 rm s3://$EKS_CLUSTER_NAME-$RANDOM_ID1 --recursive
aws s3 rb s3://$EKS_CLUSTER_NAME-$RANDOM_ID1
# delete EKS cluster
eksctl delete cluster --name "${EKS_CLUSTER_NAME}"
Limitations
- You cannot delete a JobRun unless its in error state. There is no delete-job-run API for deleting jobs (for good reason). However, if your JobRun goes into error state, you can run
kubectl delete jobrun/<job-run-name>
to cancel the job.
Troubleshooting
- If you run into issues creating VirtualCluster or JobRuns, check EMR on EKS controller logs for troubleshooting
CONTROLLER_POD=$(kubectl get pod -n ${ACK_SYSTEM_NAMESPACE} -o jsonpath='{.items..metadata.name}')
kubectl logs ${CONTROLLER_POD} -n ${ACK_SYSTEM_NAMESPACE}
- You can enable debug logs for EMR on EKS controller if you are unable to determine cause of the error. You need to change values for
enable-development-logging
totrue
and--log-level
todebug
CONTROLLER_DEPLOYMENT=$(kubectl get deploy -n ${ACK_SYSTEM_NAMESPACE} -o jsonpath='{.items..metadata.name}')
kubectl edit deploy/${CONTROLLER_DEPLOYMENT} -n ${ACK_SYSTEM_NAMESPACE}
This is how your values should look after changes are applied.
- --aws-region
- $(AWS_REGION)
- --aws-endpoint-url
- $(AWS_ENDPOINT_URL)
- --enable-development-logging
- "true"
- --log-level
- debug
- --resource-tags
- $(ACK_RESOURCE_TAGS)
- If you run into any issue, please create Github issue. Click New issue and select the type of issue, add
[emr-containers] <highlevel overview>
under title, and add enough details so that we can reproduce and provide a response