TrainingJob

sagemaker.services.k8s.aws/v1alpha1

Type	Link
GoDoc	sagemaker-controller/apis/v1alpha1#TrainingJob

Metadata

Property	Value
Scope	Namespaced
Kind	`TrainingJob`
ListKind	`TrainingJobList`
Plural	`trainingjobs`
Singular	`trainingjob`

Contains information about a training job.

Spec

algorithmSpecification: 
  algorithmName: string
  enableSageMakerMetricsTimeSeries: boolean
  metricDefinitions:
  - name: string
    regex: string
  trainingImage: string
  trainingInputMode: string
checkpointConfig: 
  localPath: string
  s3URI: string
debugHookConfig: 
  collectionConfigurations:
  - collectionName: string
    collectionParameters: {}
  hookParameters: {}
  localPath: string
  s3OutputPath: string
debugRuleConfigurations:
- instanceType: string
  localPath: string
  ruleConfigurationName: string
  ruleEvaluatorImage: string
  ruleParameters: {}
  s3OutputPath: string
  volumeSizeInGB: integer
enableInterContainerTrafficEncryption: boolean
enableManagedSpotTraining: boolean
enableNetworkIsolation: boolean
environment: {}
experimentConfig: 
  experimentName: string
  trialComponentDisplayName: string
  trialName: string
hyperParameters: {}
infraCheckConfig: 
  enableInfraCheck: boolean
inputDataConfig:
- channelName: string
  compressionType: string
  contentType: string
  dataSource: 
    fileSystemDataSource: 
      directoryPath: string
      fileSystemAccessMode: string
      fileSystemID: string
      fileSystemType: string
    s3DataSource: 
      attributeNames:
      - string
      instanceGroupNames:
      - string
      s3DataDistributionType: string
      s3DataType: string
      s3URI: string
  inputMode: string
  recordWrapperType: string
  shuffleConfig: 
    seed: integer
outputDataConfig: 
  compressionType: string
  kmsKeyID: string
  s3OutputPath: string
profilerConfig: 
  profilingIntervalInMilliseconds: integer
  profilingParameters: {}
  s3OutputPath: string
profilerRuleConfigurations:
- instanceType: string
  localPath: string
  ruleConfigurationName: string
  ruleEvaluatorImage: string
  ruleParameters: {}
  s3OutputPath: string
  volumeSizeInGB: integer
remoteDebugConfig: 
  enableRemoteDebug: boolean
resourceConfig: 
  instanceCount: integer
  instanceGroups:
  - instanceCount: integer
    instanceGroupName: string
    instanceType: string
  instanceType: string
  keepAlivePeriodInSeconds: integer
  volumeKMSKeyID: string
  volumeSizeInGB: integer
retryStrategy: 
  maximumRetryAttempts: integer
roleARN: string
stoppingCondition: 
  maxPendingTimeInSeconds: integer
  maxRuntimeInSeconds: integer
  maxWaitTimeInSeconds: integer
tags:
- key: string
  value: string
tensorBoardOutputConfig: 
  localPath: string
  s3OutputPath: string
trainingJobName: string
vpcConfig: 
  securityGroupIDs:
  - string
  subnets:
  - string

Field	Description
algorithmSpecification Required	object The registry path of the Docker image that contains the training algorithm and algorithm-specific metadata, including the input mode. For more information about algorithms provided by SageMaker, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). For information about providing your own algorithms, see Using Your Own Algorithms with Amazon SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html).
algorithmSpecification.algorithmName Optional	string
algorithmSpecification.enableSageMakerMetricsTimeSeries Optional	boolean
algorithmSpecification.metricDefinitions Optional	array
algorithmSpecification.metricDefinitions.[] Required	object Specifies a metric that the training algorithm writes to stderr or stdout.
You can view these logs to understand how your training job performs and
check for any errors encountered during training. SageMaker hyperparameter
tuning captures all defined metrics. Specify one of the defined metrics to
use as an objective metric using the TuningObjective (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html#sagemaker-Type-HyperParameterTrainingJobDefinition-TuningObjective)
parameter in the HyperParameterTrainingJobDefinition API to evaluate job
performance during hyperparameter tuning.
algorithmSpecification.metricDefinitions.[].regex Optional	string
algorithmSpecification.trainingImage Optional	string
algorithmSpecification.trainingInputMode Optional	string The training input mode that the algorithm supports. For more information about input modes, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). Pipe mode If an algorithm supports Pipe mode, Amazon SageMaker streams data directly from Amazon S3 to the container. File mode If an algorithm supports File mode, SageMaker downloads the training data from S3 to the provisioned ML storage volume, and mounts the directory to the Docker volume for the training container. You must provision the ML storage volume with sufficient capacity to accommodate the data downloaded from S3. In addition to the training data, the ML storage volume also stores the output model. The algorithm container uses the ML storage volume to also store intermediate information, if any. For distributed algorithms, training data is distributed uniformly. Your training duration is predictable if the input data objects sizes are approximately the same. SageMaker does not split the files any further for model training. If the object sizes are skewed, training won’t be optimal as the data distribution is also skewed when one host in a training cluster is overloaded, thus becoming a bottleneck in training. FastFile mode If an algorithm supports FastFile mode, SageMaker streams data directly from S3 to the container with no code changes, and provides file system access to the data. Users can author their training script to interact with these files as if they were stored on disk. FastFile mode works best when the data is read sequentially. Augmented manifest files aren’t supported. The startup time is lower when there are fewer files in the S3 bucket provided.
checkpointConfig Optional	object Contains information about the output location for managed spot training checkpoint data.
checkpointConfig.localPath Optional	string
checkpointConfig.s3URI Optional	string
debugHookConfig Optional	object Configuration information for the Amazon SageMaker Debugger hook parameters, metric and tensor collections, and storage paths. To learn more about how to configure the DebugHookConfig parameter, see Use the SageMaker and Debugger Configuration API Operations to Create, Update, and Debug Your Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).
debugHookConfig.collectionConfigurations Optional	array
debugHookConfig.collectionConfigurations.[] Required	object Configuration information for the Amazon SageMaker Debugger output tensor
collections.
debugHookConfig.collectionConfigurations.[].collectionParameters Optional	object
debugHookConfig.hookParameters Optional	object
debugHookConfig.localPath Optional	string
debugHookConfig.s3OutputPath Optional	string
debugRuleConfigurations Optional	array Configuration information for Amazon SageMaker Debugger rules for debugging output tensors.
debugRuleConfigurations.[] Required	object Configuration information for SageMaker Debugger rules for debugging. To
learn more about how to configure the DebugRuleConfiguration parameter, see
Use the SageMaker and Debugger Configuration API Operations to Create, Update,
and Debug Your Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).
debugRuleConfigurations.[].localPath Optional	string
debugRuleConfigurations.[].ruleConfigurationName Optional	string
debugRuleConfigurations.[].ruleEvaluatorImage Optional	string
debugRuleConfigurations.[].ruleParameters Optional	object
debugRuleConfigurations.[].s3OutputPath Optional	string
debugRuleConfigurations.[].volumeSizeInGB Optional	integer
enableInterContainerTrafficEncryption Optional	boolean To encrypt all communications between ML compute instances in distributed training, choose True. Encryption provides greater security for distributed training, but training might take longer. How long it takes depends on the amount of communication between compute instances, especially if you use a deep learning algorithm in distributed training. For more information, see Protect Communications Between ML Compute Instances in a Distributed Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/train-encrypt.html).
enableManagedSpotTraining Optional	boolean To train models using managed spot training, choose True. Managed spot training provides a fully managed and scalable infrastructure for training machine learning models. this option is useful when training jobs can be interrupted and when there is flexibility when the training job is run. The complete and intermediate results of jobs are stored in an Amazon S3 bucket, and can be used as a starting point to train models incrementally. Amazon SageMaker provides metrics and logs in CloudWatch. They can be used to see when managed spot training jobs are running, interrupted, resumed, or completed.
enableNetworkIsolation Optional	boolean Isolates the training container. No inbound or outbound network calls can be made, except for calls between peers within a training cluster for distributed training. If you enable network isolation for training jobs that are configured to use a VPC, SageMaker downloads and uploads customer data and model artifacts through the specified VPC, but the training container does not have network access.
environment Optional	object The environment variables to set in the Docker container.
experimentConfig Optional	object Associates a SageMaker job as a trial component with an experiment and trial. Specified when you call the following APIs: * CreateProcessingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html) * CreateTrainingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) * CreateTransformJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html)
experimentConfig.experimentName Optional	string
experimentConfig.trialComponentDisplayName Optional	string
experimentConfig.trialName Optional	string
hyperParameters Optional	object Algorithm-specific parameters that influence the quality of the model. You set hyperparameters before you start the learning process. For a list of hyperparameters for each training algorithm provided by SageMaker, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). You can specify a maximum of 100 hyperparameters. Each hyperparameter is a key-value pair. Each key and value is limited to 256 characters, as specified by the Length Constraint. Do not include any security-sensitive information including account access IDs, secrets or tokens in any hyperparameter field. If the use of security-sensitive credentials are detected, SageMaker will reject your training job request and return an exception error.
infraCheckConfig Optional	object Contains information about the infrastructure health check configuration for the training job.
infraCheckConfig.enableInfraCheck Optional	boolean
inputDataConfig Optional	array An array of Channel objects. Each channel is a named input source. InputDataConfig describes the input data and its location. Algorithms can accept input data from one or more channels. For example, an algorithm might have two channels of input data, training_data and validation_data. The configuration for each channel provides the S3, EFS, or FSx location where the input data is stored. It also provides information about the stored data: the MIME type, compression method, and whether the data is wrapped in RecordIO format. Depending on the input mode that the algorithm supports, SageMaker either copies input data files from an S3 bucket to a local directory in the Docker container, or makes it available as input streams. For example, if you specify an EFS location, input data files are available as input streams. They do not need to be downloaded. Your input must be in the same Amazon Web Services region as your training job.
inputDataConfig.[] Required	object A channel is a named input source that training algorithms can consume.
inputDataConfig.[].compressionType Optional	string
inputDataConfig.[].contentType Optional	string
inputDataConfig.[].dataSource Optional	object Describes the location of the channel data.
inputDataConfig.[].dataSource.fileSystemDataSource Optional	object Specifies a file system data source for a channel.
inputDataConfig.[].dataSource.fileSystemDataSource.directoryPath Optional	string
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemAccessMode Optional	string
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemID Optional	string
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemType Optional	string
inputDataConfig.[].dataSource.s3DataSource Optional	object Describes the S3 data source. Your input bucket must be in the same Amazon Web Services region as your training job.
inputDataConfig.[].dataSource.s3DataSource.attributeNames Optional	array
inputDataConfig.[].dataSource.s3DataSource.attributeNames.[] Required	string
inputDataConfig.[].dataSource.s3DataSource.instanceGroupNames.[] Required	string
inputDataConfig.[].dataSource.s3DataSource.s3DataType Optional	string
inputDataConfig.[].dataSource.s3DataSource.s3URI Optional	string
inputDataConfig.[].inputMode Optional	string The training input mode that the algorithm supports. For more information about input modes, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). Pipe mode If an algorithm supports Pipe mode, Amazon SageMaker streams data directly from Amazon S3 to the container. File mode If an algorithm supports File mode, SageMaker downloads the training data from S3 to the provisioned ML storage volume, and mounts the directory to the Docker volume for the training container. You must provision the ML storage volume with sufficient capacity to accommodate the data downloaded from S3. In addition to the training data, the ML storage volume also stores the output model. The algorithm container uses the ML storage volume to also store intermediate information, if any. For distributed algorithms, training data is distributed uniformly. Your training duration is predictable if the input data objects sizes are approximately the same. SageMaker does not split the files any further for model training. If the object sizes are skewed, training won’t be optimal as the data distribution is also skewed when one host in a training cluster is overloaded, thus becoming a bottleneck in training. FastFile mode If an algorithm supports FastFile mode, SageMaker streams data directly from S3 to the container with no code changes, and provides file system access to the data. Users can author their training script to interact with these files as if they were stored on disk. FastFile mode works best when the data is read sequentially. Augmented manifest files aren’t supported. The startup time is lower when there are fewer files in the S3 bucket provided.
inputDataConfig.[].recordWrapperType Optional	string
inputDataConfig.[].shuffleConfig Optional	object A configuration for a shuffle option for input data in a channel. If you use S3Prefix for S3DataType, the results of the S3 key prefix matches are shuffled. If you use ManifestFile, the order of the S3 object references in the ManifestFile is shuffled. If you use AugmentedManifestFile, the order of the JSON lines in the AugmentedManifestFile is shuffled. The shuffling order is determined using the Seed value. For Pipe input mode, when ShuffleConfig is specified shuffling is done at the start of every epoch. With large datasets, this ensures that the order of the training data is different for each epoch, and it helps reduce bias and possible overfitting. In a multi-node training job when ShuffleConfig is combined with S3DataDistributionType of ShardedByS3Key, the data is shuffled across nodes so that the content sent to a particular node on the first epoch might be sent to a different node on the second epoch.
inputDataConfig.[].shuffleConfig.seed Optional	integer
outputDataConfig Required	object Specifies the path to the S3 location where you want to store model artifacts. SageMaker creates subfolders for the artifacts.
outputDataConfig.compressionType Optional	string
outputDataConfig.kmsKeyID Optional	string
outputDataConfig.s3OutputPath Optional	string
profilerConfig Optional	object Configuration information for Amazon SageMaker Debugger system monitoring, framework profiling, and storage paths.
profilerConfig.profilingIntervalInMilliseconds Optional	integer
profilerConfig.profilingParameters Optional	object
profilerConfig.s3OutputPath Optional	string
profilerRuleConfigurations Optional	array Configuration information for Amazon SageMaker Debugger rules for profiling system and framework metrics.
profilerRuleConfigurations.[] Required	object Configuration information for profiling rules.
profilerRuleConfigurations.[].localPath Optional	string
profilerRuleConfigurations.[].ruleConfigurationName Optional	string
profilerRuleConfigurations.[].ruleEvaluatorImage Optional	string
profilerRuleConfigurations.[].ruleParameters Optional	object
profilerRuleConfigurations.[].s3OutputPath Optional	string
profilerRuleConfigurations.[].volumeSizeInGB Optional	integer
remoteDebugConfig Optional	object Configuration for remote debugging. To learn more about the remote debugging functionality of SageMaker, see Access a training container through Amazon Web Services Systems Manager (SSM) for remote debugging (https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-debugging.html).
remoteDebugConfig.enableRemoteDebug Optional	boolean
resourceConfig Required	object The resources, including the ML compute instances and ML storage volumes, to use for model training. ML storage volumes store model artifacts and incremental states. Training algorithms might also use ML storage volumes for scratch space. If you want SageMaker to use the ML storage volume to store the training data, choose File as the TrainingInputMode in the algorithm specification. For distributed training algorithms, specify an instance count greater than 1.
resourceConfig.instanceCount Optional	integer
resourceConfig.instanceGroups Optional	array
resourceConfig.instanceGroups.[] Required	object Defines an instance group for heterogeneous cluster training. When requesting
a training job using the CreateTrainingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html)
API, you can configure multiple instance groups .
resourceConfig.instanceGroups.[].instanceGroupName Optional	string
resourceConfig.instanceGroups.[].instanceType Optional	string
resourceConfig.instanceType Optional	string
resourceConfig.keepAlivePeriodInSeconds Optional	integer Optional. Customer requested period in seconds for which the Training cluster is kept alive after the job is finished.
resourceConfig.volumeKMSKeyID Optional	string
resourceConfig.volumeSizeInGB Optional	integer
retryStrategy Optional	object The number of times to retry the job when the job fails due to an InternalServerError.
retryStrategy.maximumRetryAttempts Optional	integer
roleARN Required	string The Amazon Resource Name (ARN) of an IAM role that SageMaker can assume to perform tasks on your behalf. During model training, SageMaker needs your permission to read input data from an S3 bucket, download a Docker image that contains training code, write model artifacts to an S3 bucket, write logs to Amazon CloudWatch Logs, and publish metrics to Amazon CloudWatch. You grant permissions for all of these tasks to an IAM role. For more information, see SageMaker Roles (https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). To be able to pass this role to SageMaker, the caller of this API must have the iam:PassRole permission. Regex Pattern: `^arn:aws[a-z\-]*:iam::\d{12}:role/?[a-zA-Z_0-9+=,.@\-_/]+$`
stoppingCondition Required	object Specifies a limit to how long a model training job can run. It also specifies how long a managed Spot training job has to complete. When the job reaches the time limit, SageMaker ends the training job. Use this API to cap model training costs. To stop a job, SageMaker sends the algorithm the SIGTERM signal, which delays job termination for 120 seconds. Algorithms can use this 120-second window to save the model artifacts, so the results of training are not lost.
stoppingCondition.maxPendingTimeInSeconds Optional	integer Maximum job scheduler pending time in seconds.
stoppingCondition.maxRuntimeInSeconds Optional	integer
stoppingCondition.maxWaitTimeInSeconds Optional	integer
tags Optional	array An array of key-value pairs. You can use tags to categorize your Amazon Web Services resources in different ways, for example, by purpose, owner, or environment. For more information, see Tagging Amazon Web Services Resources (https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html).
tags.[] Required	object A tag object that consists of a key and an optional value, used to manage
metadata for SageMaker Amazon Web Services resources.

You can add tags to notebook instances, training jobs, hyperparameter tuning jobs, batch transform jobs, models, labeling jobs, work teams, endpoint configurations, and endpoints. For more information on adding tags to SageMaker resources, see AddTags (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AddTags.html).

For more information on adding metadata to your Amazon Web Services resources with tagging, see Tagging Amazon Web Services resources (https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html). For advice on best practices for managing Amazon Web Services resources with tagging, see Tagging Best Practices: Implement an Effective Amazon Web Services Resource Tagging Strategy (https://d1.awsstatic.com/whitepapers/aws-tagging-best-practices.pdf). || tags.[].key
Optional | string
| | tags.[].value
Optional | string
| | tensorBoardOutputConfig
Optional | object
Configuration of storage locations for the Amazon SageMaker Debugger TensorBoard
output data. | | tensorBoardOutputConfig.localPath
Optional | string
| | tensorBoardOutputConfig.s3OutputPath
Optional | string
| | trainingJobName
Required | string
The name of the training job. The name must be unique within an Amazon Web
Services Region in an Amazon Web Services account.

Regex Pattern: ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}$ | | vpcConfig
Optional | object
A VpcConfig (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html)
object that specifies the VPC that you want your training job to connect
to. Control access to and from your training container by configuring the
VPC. For more information, see Protect Training Jobs by Using an Amazon Virtual
Private Cloud (https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html). | | vpcConfig.securityGroupIDs
Optional | array
| | vpcConfig.securityGroupIDs.[]
Required | string
|| vpcConfig.subnets
Optional | array
| | vpcConfig.subnets.[]
Required | string
|

Status

ackResourceMetadata: 
  arn: string
  ownerAccountID: string
  region: string
conditions:
- lastTransitionTime: string
  message: string
  reason: string
  status: string
  type: string
creationTime: string
debugRuleEvaluationStatuses:
- lastModifiedTime: string
  ruleConfigurationName: string
  ruleEvaluationJobARN: string
  ruleEvaluationStatus: string
  statusDetails: string
failureReason: string
lastModifiedTime: string
modelArtifacts: 
  s3ModelArtifacts: string
profilerRuleEvaluationStatuses:
- lastModifiedTime: string
  ruleConfigurationName: string
  ruleEvaluationJobARN: string
  ruleEvaluationStatus: string
  statusDetails: string
profilingStatus: string
secondaryStatus: string
trainingJobStatus: string
warmPoolStatus: 
  resourceRetainedBillableTimeInSeconds: integer
  reusedByJob: string
  status: string

Field	Description
ackResourceMetadata Optional	object All CRs managed by ACK have a common `Status.ACKResourceMetadata` member that is used to contain resource sync state, account ownership, constructed ARN for the resource
ackResourceMetadata.arn Optional	string ARN is the Amazon Resource Name for the resource. This is a globally-unique identifier and is set only by the ACK service controller once the controller has orchestrated the creation of the resource OR when it has verified that an “adopted” resource (a resource where the ARN annotation was set by the Kubernetes user on the CR) exists and matches the supplied CR’s Spec field values. https://github.com/aws/aws-controllers-k8s/issues/270
ackResourceMetadata.ownerAccountID Required	string OwnerAccountID is the AWS Account ID of the account that owns the backend AWS service API resource.
ackResourceMetadata.region Required	string Region is the AWS region in which the resource exists or will exist.
conditions Optional	array All CRs managed by ACK have a common `Status.Conditions` member that contains a collection of `ackv1alpha1.Condition` objects that describe the various terminal states of the CR and its backend AWS service API resource
conditions.[] Required	object Condition is the common struct used by all CRDs managed by ACK service
controllers to indicate terminal states of the CR and its backend AWS
service API resource
conditions.[].message Optional	string A human readable message indicating details about the transition.
conditions.[].reason Optional	string The reason for the condition’s last transition.
conditions.[].status Optional	string Status of the condition, one of True, False, Unknown.
conditions.[].type Optional	string Type is the type of the Condition
creationTime Optional	string A timestamp that indicates when the training job was created.
debugRuleEvaluationStatuses Optional	array Evaluation status of Amazon SageMaker Debugger rules for debugging on a training job.
debugRuleEvaluationStatuses.[] Required	object Information about the status of the rule evaluation.
debugRuleEvaluationStatuses.[].ruleConfigurationName Optional	string
debugRuleEvaluationStatuses.[].ruleEvaluationJobARN Optional	string
debugRuleEvaluationStatuses.[].ruleEvaluationStatus Optional	string
debugRuleEvaluationStatuses.[].statusDetails Optional	string
failureReason Optional	string If the training job failed, the reason it failed.
lastModifiedTime Optional	string A timestamp that indicates when the status of the training job was last modified.
modelArtifacts Optional	object Information about the Amazon S3 location that is configured for storing model artifacts.
modelArtifacts.s3ModelArtifacts Optional	string
profilerRuleEvaluationStatuses Optional	array Evaluation status of Amazon SageMaker Debugger rules for profiling on a training job.
profilerRuleEvaluationStatuses.[] Required	object Information about the status of the rule evaluation.
profilerRuleEvaluationStatuses.[].ruleConfigurationName Optional	string
profilerRuleEvaluationStatuses.[].ruleEvaluationJobARN Optional	string
profilerRuleEvaluationStatuses.[].ruleEvaluationStatus Optional	string
profilerRuleEvaluationStatuses.[].statusDetails Optional	string
profilingStatus Optional	string Profiling status of a training job.
secondaryStatus Optional	string Provides detailed information about the state of the training job. For detailed information on the secondary status of the training job, see StatusMessage under SecondaryStatusTransition (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_SecondaryStatusTransition.html). SageMaker provides primary statuses and secondary statuses that apply to each of them: InProgress * Starting - Starting the training job. * Downloading - An optional stage for algorithms that support File training input mode. It indicates that data is being downloaded to the ML storage volumes. * Training - Training is in progress. * Interrupted - The job stopped because the managed spot training instances were interrupted. * Uploading - Training is complete and the model artifacts are being uploaded to the S3 location. Completed * Completed - The training job has completed. Failed * Failed - The training job has failed. The reason for the failure is returned in the FailureReason field of DescribeTrainingJobResponse. Stopped * MaxRuntimeExceeded - The job stopped because it exceeded the maximum allowed runtime. * MaxWaitTimeExceeded - The job stopped because it exceeded the maximum allowed wait time. * Stopped - The training job has stopped. Stopping * Stopping - Stopping the training job. Valid values for SecondaryStatus are subject to change. We no longer support the following secondary statuses: * LaunchingMLInstances * PreparingTraining * DownloadingTrainingImage
trainingJobStatus Optional	string The status of the training job. SageMaker provides the following training job statuses: * InProgress - The training is in progress. * Completed - The training job has completed. * Failed - The training job has failed. To see the reason for the failure, see the FailureReason field in the response to a DescribeTrainingJobResponse call. * Stopping - The training job is stopping. * Stopped - The training job has stopped. For more detailed information, see SecondaryStatus.
warmPoolStatus Optional	object The status of the warm pool associated with the training job.
warmPoolStatus.resourceRetainedBillableTimeInSeconds Optional	integer Optional. Indicates how many seconds the resource stayed in ResourceRetained state. Populated only after resource reaches ResourceReused or ResourceReleased state.
warmPoolStatus.reusedByJob Optional	string
warmPoolStatus.status Optional	string

TrainingJob

Metadata#

Spec#

Status#

Metadata

Spec

Status