TrainingJob

sagemaker.services.k8s.aws/v1alpha1

TypeLink
GoDocsagemaker-controller/apis/v1alpha1#TrainingJob

Metadata

PropertyValue
ScopeNamespaced
KindTrainingJob
ListKindTrainingJobList
Pluraltrainingjobs
Singulartrainingjob

Contains information about a training job.

Spec

algorithmSpecification: 
  algorithmName: string
  enableSageMakerMetricsTimeSeries: boolean
  metricDefinitions:
  - name: string
    regex: string
  trainingImage: string
  trainingInputMode: string
checkpointConfig: 
  localPath: string
  s3URI: string
debugHookConfig: 
  collectionConfigurations:
  - collectionName: string
    collectionParameters: {}
  hookParameters: {}
  localPath: string
  s3OutputPath: string
debugRuleConfigurations:
- instanceType: string
  localPath: string
  ruleConfigurationName: string
  ruleEvaluatorImage: string
  ruleParameters: {}
  s3OutputPath: string
  volumeSizeInGB: integer
enableInterContainerTrafficEncryption: boolean
enableManagedSpotTraining: boolean
enableNetworkIsolation: boolean
environment: {}
experimentConfig: 
  experimentName: string
  trialComponentDisplayName: string
  trialName: string
hyperParameters: {}
inputDataConfig:
- channelName: string
  compressionType: string
  contentType: string
  dataSource: 
    fileSystemDataSource: 
      directoryPath: string
      fileSystemAccessMode: string
      fileSystemID: string
      fileSystemType: string
    s3DataSource: 
      attributeNames:
      - string
      instanceGroupNames:
      - string
      s3DataDistributionType: string
      s3DataType: string
      s3URI: string
  inputMode: string
  recordWrapperType: string
  shuffleConfig: 
    seed: integer
outputDataConfig: 
  kmsKeyID: string
  s3OutputPath: string
profilerConfig: 
  profilingIntervalInMilliseconds: integer
  profilingParameters: {}
  s3OutputPath: string
profilerRuleConfigurations:
- instanceType: string
  localPath: string
  ruleConfigurationName: string
  ruleEvaluatorImage: string
  ruleParameters: {}
  s3OutputPath: string
  volumeSizeInGB: integer
resourceConfig: 
  instanceCount: integer
  instanceGroups:
  - instanceCount: integer
    instanceGroupName: string
    instanceType: string
  instanceType: string
  keepAlivePeriodInSeconds: integer
  volumeKMSKeyID: string
  volumeSizeInGB: integer
retryStrategy: 
  maximumRetryAttempts: integer
roleARN: string
stoppingCondition: 
  maxRuntimeInSeconds: integer
  maxWaitTimeInSeconds: integer
tags:
- key: string
  value: string
tensorBoardOutputConfig: 
  localPath: string
  s3OutputPath: string
trainingJobName: string
vpcConfig: 
  securityGroupIDs:
  - string
  subnets:
  - string
FieldDescription
algorithmSpecification
Required
object
The registry path of the Docker image that contains the training algorithm and algorithm-specific metadata, including the input mode. For more information about algorithms provided by SageMaker, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). For information about providing your own algorithms, see Using Your Own Algorithms with Amazon SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html).
algorithmSpecification.algorithmName
Optional
string
algorithmSpecification.enableSageMakerMetricsTimeSeries
Optional
boolean
algorithmSpecification.metricDefinitions
Optional
array
algorithmSpecification.metricDefinitions.[]
Required
object
Specifies a metric that the training algorithm writes to stderr or stdout. SageMakerhyperparameter tuning captures all defined metrics. You specify one metric that a hyperparameter tuning job uses as its objective metric to choose the best training job.
algorithmSpecification.metricDefinitions.[].regex
Optional
string
algorithmSpecification.trainingImage
Optional
string
algorithmSpecification.trainingInputMode
Optional
string
The training input mode that the algorithm supports. For more information about input modes, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html).
Pipe mode
If an algorithm supports Pipe mode, Amazon SageMaker streams data directly from Amazon S3 to the container.
File mode
If an algorithm supports File mode, SageMaker downloads the training data from S3 to the provisioned ML storage volume, and mounts the directory to the Docker volume for the training container.
You must provision the ML storage volume with sufficient capacity to accommodate the data downloaded from S3. In addition to the training data, the ML storage volume also stores the output model. The algorithm container uses the ML storage volume to also store intermediate information, if any.
For distributed algorithms, training data is distributed uniformly. Your training duration is predictable if the input data objects sizes are approximately the same. SageMaker does not split the files any further for model training. If the object sizes are skewed, training won’t be optimal as the data distribution is also skewed when one host in a training cluster is overloaded, thus becoming a bottleneck in training.
FastFile mode
If an algorithm supports FastFile mode, SageMaker streams data directly from S3 to the container with no code changes, and provides file system access to the data. Users can author their training script to interact with these files as if they were stored on disk.
FastFile mode works best when the data is read sequentially. Augmented manifest files aren’t supported. The startup time is lower when there are fewer files in the S3 bucket provided.
checkpointConfig
Optional
object
Contains information about the output location for managed spot training checkpoint data.
checkpointConfig.localPath
Optional
string
checkpointConfig.s3URI
Optional
string
debugHookConfig
Optional
object
Configuration information for the Amazon SageMaker Debugger hook parameters, metric and tensor collections, and storage paths. To learn more about how to configure the DebugHookConfig parameter, see Use the SageMaker and Debugger Configuration API Operations to Create, Update, and Debug Your Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).
debugHookConfig.collectionConfigurations
Optional
array
debugHookConfig.collectionConfigurations.[]
Required
object
Configuration information for the Amazon SageMaker Debugger output tensor collections.
debugHookConfig.collectionConfigurations.[].collectionParameters
Optional
object
debugHookConfig.hookParameters
Optional
object
debugHookConfig.localPath
Optional
string
debugHookConfig.s3OutputPath
Optional
string
debugRuleConfigurations
Optional
array
Configuration information for Amazon SageMaker Debugger rules for debugging output tensors.
debugRuleConfigurations.[]
Required
object
Configuration information for SageMaker Debugger rules for debugging. To learn more about how to configure the DebugRuleConfiguration parameter, see Use the SageMaker and Debugger Configuration API Operations to Create, Update, and Debug Your Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).
debugRuleConfigurations.[].localPath
Optional
string
debugRuleConfigurations.[].ruleConfigurationName
Optional
string
debugRuleConfigurations.[].ruleEvaluatorImage
Optional
string
debugRuleConfigurations.[].ruleParameters
Optional
object
debugRuleConfigurations.[].s3OutputPath
Optional
string
debugRuleConfigurations.[].volumeSizeInGB
Optional
integer
enableInterContainerTrafficEncryption
Optional
boolean
To encrypt all communications between ML compute instances in distributed training, choose True. Encryption provides greater security for distributed training, but training might take longer. How long it takes depends on the amount of communication between compute instances, especially if you use a deep learning algorithm in distributed training. For more information, see Protect Communications Between ML Compute Instances in a Distributed Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/train-encrypt.html).
enableManagedSpotTraining
Optional
boolean
To train models using managed spot training, choose True. Managed spot training provides a fully managed and scalable infrastructure for training machine learning models. this option is useful when training jobs can be interrupted and when there is flexibility when the training job is run.
The complete and intermediate results of jobs are stored in an Amazon S3 bucket, and can be used as a starting point to train models incrementally. Amazon SageMaker provides metrics and logs in CloudWatch. They can be used to see when managed spot training jobs are running, interrupted, resumed, or completed.
enableNetworkIsolation
Optional
boolean
Isolates the training container. No inbound or outbound network calls can be made, except for calls between peers within a training cluster for distributed training. If you enable network isolation for training jobs that are configured to use a VPC, SageMaker downloads and uploads customer data and model artifacts through the specified VPC, but the training container does not have network access.
environment
Optional
object
The environment variables to set in the Docker container.
experimentConfig
Optional
object
Associates a SageMaker job as a trial component with an experiment and trial. Specified when you call the following APIs:
* CreateProcessingJob
* CreateTrainingJob
* CreateTransformJob
experimentConfig.experimentName
Optional
string
experimentConfig.trialComponentDisplayName
Optional
string
experimentConfig.trialName
Optional
string
hyperParameters
Optional
object
Algorithm-specific parameters that influence the quality of the model. You set hyperparameters before you start the learning process. For a list of hyperparameters for each training algorithm provided by SageMaker, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html).
You can specify a maximum of 100 hyperparameters. Each hyperparameter is a key-value pair. Each key and value is limited to 256 characters, as specified by the Length Constraint.
Do not include any security-sensitive information including account access IDs, secrets or tokens in any hyperparameter field. If the use of security-sensitive credentials are detected, SageMaker will reject your training job request and return an exception error.
inputDataConfig
Optional
array
An array of Channel objects. Each channel is a named input source. InputDataConfig describes the input data and its location.
Algorithms can accept input data from one or more channels. For example, an algorithm might have two channels of input data, training_data and validation_data. The configuration for each channel provides the S3, EFS, or FSx location where the input data is stored. It also provides information about the stored data: the MIME type, compression method, and whether the data is wrapped in RecordIO format.
Depending on the input mode that the algorithm supports, SageMaker either copies input data files from an S3 bucket to a local directory in the Docker container, or makes it available as input streams. For example, if you specify an EFS location, input data files are available as input streams. They do not need to be downloaded.
inputDataConfig.[]
Required
object
A channel is a named input source that training algorithms can consume.
inputDataConfig.[].compressionType
Optional
string
inputDataConfig.[].contentType
Optional
string
inputDataConfig.[].dataSource
Optional
object
Describes the location of the channel data.
inputDataConfig.[].dataSource.fileSystemDataSource
Optional
object
Specifies a file system data source for a channel.
inputDataConfig.[].dataSource.fileSystemDataSource.directoryPath
Optional
string
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemAccessMode
Optional
string
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemID
Optional
string
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemType
Optional
string
inputDataConfig.[].dataSource.s3DataSource
Optional
object
Describes the S3 data source.
inputDataConfig.[].dataSource.s3DataSource.attributeNames
Optional
array
inputDataConfig.[].dataSource.s3DataSource.attributeNames.[]
Required
string
inputDataConfig.[].dataSource.s3DataSource.instanceGroupNames.[]
Required
string
inputDataConfig.[].dataSource.s3DataSource.s3DataType
Optional
string
inputDataConfig.[].dataSource.s3DataSource.s3URI
Optional
string
inputDataConfig.[].inputMode
Optional
string
The training input mode that the algorithm supports. For more information about input modes, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html).
Pipe mode
If an algorithm supports Pipe mode, Amazon SageMaker streams data directly from Amazon S3 to the container.
File mode
If an algorithm supports File mode, SageMaker downloads the training data from S3 to the provisioned ML storage volume, and mounts the directory to the Docker volume for the training container.
You must provision the ML storage volume with sufficient capacity to accommodate the data downloaded from S3. In addition to the training data, the ML storage volume also stores the output model. The algorithm container uses the ML storage volume to also store intermediate information, if any.
For distributed algorithms, training data is distributed uniformly. Your training duration is predictable if the input data objects sizes are approximately the same. SageMaker does not split the files any further for model training. If the object sizes are skewed, training won’t be optimal as the data distribution is also skewed when one host in a training cluster is overloaded, thus becoming a bottleneck in training.
FastFile mode
If an algorithm supports FastFile mode, SageMaker streams data directly from S3 to the container with no code changes, and provides file system access to the data. Users can author their training script to interact with these files as if they were stored on disk.
FastFile mode works best when the data is read sequentially. Augmented manifest files aren’t supported. The startup time is lower when there are fewer files in the S3 bucket provided.
inputDataConfig.[].recordWrapperType
Optional
string
inputDataConfig.[].shuffleConfig
Optional
object
A configuration for a shuffle option for input data in a channel. If you use S3Prefix for S3DataType, the results of the S3 key prefix matches are shuffled. If you use ManifestFile, the order of the S3 object references in the ManifestFile is shuffled. If you use AugmentedManifestFile, the order of the JSON lines in the AugmentedManifestFile is shuffled. The shuffling order is determined using the Seed value.
For Pipe input mode, when ShuffleConfig is specified shuffling is done at the start of every epoch. With large datasets, this ensures that the order of the training data is different for each epoch, and it helps reduce bias and possible overfitting. In a multi-node training job when ShuffleConfig is combined with S3DataDistributionType of ShardedByS3Key, the data is shuffled across nodes so that the content sent to a particular node on the first epoch might be sent to a different node on the second epoch.
inputDataConfig.[].shuffleConfig.seed
Optional
integer
outputDataConfig
Required
object
Specifies the path to the S3 location where you want to store model artifacts. SageMaker creates subfolders for the artifacts.
outputDataConfig.kmsKeyID
Optional
string
outputDataConfig.s3OutputPath
Optional
string
profilerConfig
Optional
object
Configuration information for Amazon SageMaker Debugger system monitoring, framework profiling, and storage paths.
profilerConfig.profilingIntervalInMilliseconds
Optional
integer
profilerConfig.profilingParameters
Optional
object
profilerConfig.s3OutputPath
Optional
string
profilerRuleConfigurations
Optional
array
Configuration information for Amazon SageMaker Debugger rules for profiling system and framework metrics.
profilerRuleConfigurations.[]
Required
object
Configuration information for profiling rules.
profilerRuleConfigurations.[].localPath
Optional
string
profilerRuleConfigurations.[].ruleConfigurationName
Optional
string
profilerRuleConfigurations.[].ruleEvaluatorImage
Optional
string
profilerRuleConfigurations.[].ruleParameters
Optional
object
profilerRuleConfigurations.[].s3OutputPath
Optional
string
profilerRuleConfigurations.[].volumeSizeInGB
Optional
integer
resourceConfig
Required
object
The resources, including the ML compute instances and ML storage volumes, to use for model training.
ML storage volumes store model artifacts and incremental states. Training algorithms might also use ML storage volumes for scratch space. If you want SageMaker to use the ML storage volume to store the training data, choose File as the TrainingInputMode in the algorithm specification. For distributed training algorithms, specify an instance count greater than 1.
resourceConfig.instanceCount
Optional
integer
resourceConfig.instanceGroups
Optional
array
resourceConfig.instanceGroups.[]
Required
object
Defines an instance group for heterogeneous cluster training. When requesting a training job using the CreateTrainingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API, you can configure multiple instance groups .
resourceConfig.instanceGroups.[].instanceGroupName
Optional
string
resourceConfig.instanceGroups.[].instanceType
Optional
string
resourceConfig.instanceType
Optional
string
resourceConfig.keepAlivePeriodInSeconds
Optional
integer
resourceConfig.volumeKMSKeyID
Optional
string
resourceConfig.volumeSizeInGB
Optional
integer
retryStrategy
Optional
object
The number of times to retry the job when the job fails due to an InternalServerError.
retryStrategy.maximumRetryAttempts
Optional
integer
roleARN
Required
string
The Amazon Resource Name (ARN) of an IAM role that SageMaker can assume to perform tasks on your behalf.
During model training, SageMaker needs your permission to read input data from an S3 bucket, download a Docker image that contains training code, write model artifacts to an S3 bucket, write logs to Amazon CloudWatch Logs, and publish metrics to Amazon CloudWatch. You grant permissions for all of these tasks to an IAM role. For more information, see SageMaker Roles (https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).
To be able to pass this role to SageMaker, the caller of this API must have the iam:PassRole permission.
stoppingCondition
Required
object
Specifies a limit to how long a model training job can run. It also specifies how long a managed Spot training job has to complete. When the job reaches the time limit, SageMaker ends the training job. Use this API to cap model training costs.
To stop a job, SageMaker sends the algorithm the SIGTERM signal, which delays job termination for 120 seconds. Algorithms can use this 120-second window to save the model artifacts, so the results of training are not lost.
stoppingCondition.maxRuntimeInSeconds
Optional
integer
stoppingCondition.maxWaitTimeInSeconds
Optional
integer
tags
Optional
array
An array of key-value pairs. You can use tags to categorize your Amazon Web Services resources in different ways, for example, by purpose, owner, or environment. For more information, see Tagging Amazon Web Services Resources (https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html).
tags.[]
Required
object
A tag object that consists of a key and an optional value, used to manage metadata for SageMaker Amazon Web Services resources.
You can add tags to notebook instances, training jobs, hyperparameter tuning jobs, batch transform jobs, models, labeling jobs, work teams, endpoint configurations, and endpoints. For more information on adding tags to SageMaker resources, see AddTags.
For more information on adding metadata to your Amazon Web Services resources with tagging, see Tagging Amazon Web Services resources (https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html). For advice on best practices for managing Amazon Web Services resources with tagging, see Tagging Best Practices: Implement an Effective Amazon Web Services Resource Tagging Strategy (https://d1.awsstatic.com/whitepapers/aws-tagging-best-practices.pdf).
tags.[].value
Optional
string
tensorBoardOutputConfig
Optional
object
Configuration of storage locations for the Amazon SageMaker Debugger TensorBoard output data.
tensorBoardOutputConfig.localPath
Optional
string
tensorBoardOutputConfig.s3OutputPath
Optional
string
trainingJobName
Required
string
The name of the training job. The name must be unique within an Amazon Web Services Region in an Amazon Web Services account.
vpcConfig
Optional
object
A VpcConfig object that specifies the VPC that you want your training job to connect to. Control access to and from your training container by configuring the VPC. For more information, see Protect Training Jobs by Using an Amazon Virtual Private Cloud (https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html).
vpcConfig.securityGroupIDs
Optional
array
vpcConfig.securityGroupIDs.[]
Required
string
vpcConfig.subnets.[]
Required
string

Status

ackResourceMetadata: 
  arn: string
  ownerAccountID: string
  region: string
conditions:
- lastTransitionTime: string
  message: string
  reason: string
  status: string
  type: string
creationTime: string
debugRuleEvaluationStatuses:
- lastModifiedTime: string
  ruleConfigurationName: string
  ruleEvaluationJobARN: string
  ruleEvaluationStatus: string
  statusDetails: string
failureReason: string
lastModifiedTime: string
modelArtifacts: 
  s3ModelArtifacts: string
profilerRuleEvaluationStatuses:
- lastModifiedTime: string
  ruleConfigurationName: string
  ruleEvaluationJobARN: string
  ruleEvaluationStatus: string
  statusDetails: string
profilingStatus: string
secondaryStatus: string
trainingJobStatus: string
warmPoolStatus: 
  resourceRetainedBillableTimeInSeconds: integer
  reusedByJob: string
  status: string
FieldDescription
ackResourceMetadata
Optional
object
All CRs managed by ACK have a common Status.ACKResourceMetadata member that is used to contain resource sync state, account ownership, constructed ARN for the resource
ackResourceMetadata.arn
Optional
string
ARN is the Amazon Resource Name for the resource. This is a globally-unique identifier and is set only by the ACK service controller once the controller has orchestrated the creation of the resource OR when it has verified that an “adopted” resource (a resource where the ARN annotation was set by the Kubernetes user on the CR) exists and matches the supplied CR’s Spec field values. TODO(vijat@): Find a better strategy for resources that do not have ARN in CreateOutputResponse https://github.com/aws/aws-controllers-k8s/issues/270
ackResourceMetadata.ownerAccountID
Required
string
OwnerAccountID is the AWS Account ID of the account that owns the backend AWS service API resource.
ackResourceMetadata.region
Required
string
Region is the AWS region in which the resource exists or will exist.
conditions
Optional
array
All CRS managed by ACK have a common Status.Conditions member that contains a collection of ackv1alpha1.Condition objects that describe the various terminal states of the CR and its backend AWS service API resource
conditions.[]
Required
object
Condition is the common struct used by all CRDs managed by ACK service controllers to indicate terminal states of the CR and its backend AWS service API resource
conditions.[].message
Optional
string
A human readable message indicating details about the transition.
conditions.[].reason
Optional
string
The reason for the condition’s last transition.
conditions.[].status
Optional
string
Status of the condition, one of True, False, Unknown.
conditions.[].type
Optional
string
Type is the type of the Condition
creationTime
Optional
string
A timestamp that indicates when the training job was created.
debugRuleEvaluationStatuses
Optional
array
Evaluation status of Amazon SageMaker Debugger rules for debugging on a training job.
debugRuleEvaluationStatuses.[]
Required
object
Information about the status of the rule evaluation.
debugRuleEvaluationStatuses.[].ruleConfigurationName
Optional
string
debugRuleEvaluationStatuses.[].ruleEvaluationJobARN
Optional
string
debugRuleEvaluationStatuses.[].ruleEvaluationStatus
Optional
string
debugRuleEvaluationStatuses.[].statusDetails
Optional
string
failureReason
Optional
string
If the training job failed, the reason it failed.
lastModifiedTime
Optional
string
A timestamp that indicates when the status of the training job was last modified.
modelArtifacts
Optional
object
Information about the Amazon S3 location that is configured for storing model artifacts.
modelArtifacts.s3ModelArtifacts
Optional
string
profilerRuleEvaluationStatuses
Optional
array
Evaluation status of Amazon SageMaker Debugger rules for profiling on a training job.
profilerRuleEvaluationStatuses.[]
Required
object
Information about the status of the rule evaluation.
profilerRuleEvaluationStatuses.[].ruleConfigurationName
Optional
string
profilerRuleEvaluationStatuses.[].ruleEvaluationJobARN
Optional
string
profilerRuleEvaluationStatuses.[].ruleEvaluationStatus
Optional
string
profilerRuleEvaluationStatuses.[].statusDetails
Optional
string
profilingStatus
Optional
string
Profiling status of a training job.
secondaryStatus
Optional
string
Provides detailed information about the state of the training job. For detailed information on the secondary status of the training job, see StatusMessage under SecondaryStatusTransition.
SageMaker provides primary statuses and secondary statuses that apply to each of them:
InProgress
* Starting - Starting the training job.
* Downloading - An optional stage for algorithms that support File training input mode. It indicates that data is being downloaded to the ML storage volumes.
* Training - Training is in progress.
* Interrupted - The job stopped because the managed spot training instances were interrupted.
* Uploading - Training is complete and the model artifacts are being uploaded to the S3 location.
Completed
* Completed - The training job has completed.
Failed
* Failed - The training job has failed. The reason for the failure is returned in the FailureReason field of DescribeTrainingJobResponse.
Stopped
* MaxRuntimeExceeded - The job stopped because it exceeded the maximum allowed runtime.
* MaxWaitTimeExceeded - The job stopped because it exceeded the maximum allowed wait time.
* Stopped - The training job has stopped.
Stopping
* Stopping - Stopping the training job.
Valid values for SecondaryStatus are subject to change.
We no longer support the following secondary statuses:
* LaunchingMLInstances
* PreparingTraining
* DownloadingTrainingImage
trainingJobStatus
Optional
string
The status of the training job.
SageMaker provides the following training job statuses:
* InProgress - The training is in progress.
* Completed - The training job has completed.
* Failed - The training job has failed. To see the reason for the failure, see the FailureReason field in the response to a DescribeTrainingJobResponse call.
* Stopping - The training job is stopping.
* Stopped - The training job has stopped.
For more detailed information, see SecondaryStatus.
warmPoolStatus
Optional
object
The status of the warm pool associated with the training job.
warmPoolStatus.resourceRetainedBillableTimeInSeconds
Optional
integer
warmPoolStatus.reusedByJob
Optional
string
warmPoolStatus.status
Optional
string