TrainingJob
sagemaker.services.k8s.aws/v1alpha1
Type | Link |
---|---|
GoDoc | sagemaker-controller/apis/v1alpha1#TrainingJob |
Metadata
Property | Value |
---|---|
Scope | Namespaced |
Kind | TrainingJob |
ListKind | TrainingJobList |
Plural | trainingjobs |
Singular | trainingjob |
Contains information about a training job.
Spec
algorithmSpecification:
algorithmName: string
enableSageMakerMetricsTimeSeries: boolean
metricDefinitions:
- name: string
regex: string
trainingImage: string
trainingInputMode: string
checkpointConfig:
localPath: string
s3URI: string
debugHookConfig:
collectionConfigurations:
- collectionName: string
collectionParameters: {}
hookParameters: {}
localPath: string
s3OutputPath: string
debugRuleConfigurations:
- instanceType: string
localPath: string
ruleConfigurationName: string
ruleEvaluatorImage: string
ruleParameters: {}
s3OutputPath: string
volumeSizeInGB: integer
enableInterContainerTrafficEncryption: boolean
enableManagedSpotTraining: boolean
enableNetworkIsolation: boolean
environment: {}
experimentConfig:
experimentName: string
trialComponentDisplayName: string
trialName: string
hyperParameters: {}
infraCheckConfig:
enableInfraCheck: boolean
inputDataConfig:
- channelName: string
compressionType: string
contentType: string
dataSource:
fileSystemDataSource:
directoryPath: string
fileSystemAccessMode: string
fileSystemID: string
fileSystemType: string
s3DataSource:
attributeNames:
- string
instanceGroupNames:
- string
s3DataDistributionType: string
s3DataType: string
s3URI: string
inputMode: string
recordWrapperType: string
shuffleConfig:
seed: integer
outputDataConfig:
compressionType: string
kmsKeyID: string
s3OutputPath: string
profilerConfig:
profilingIntervalInMilliseconds: integer
profilingParameters: {}
s3OutputPath: string
profilerRuleConfigurations:
- instanceType: string
localPath: string
ruleConfigurationName: string
ruleEvaluatorImage: string
ruleParameters: {}
s3OutputPath: string
volumeSizeInGB: integer
remoteDebugConfig:
enableRemoteDebug: boolean
resourceConfig:
instanceCount: integer
instanceGroups:
- instanceCount: integer
instanceGroupName: string
instanceType: string
instanceType: string
keepAlivePeriodInSeconds: integer
volumeKMSKeyID: string
volumeSizeInGB: integer
retryStrategy:
maximumRetryAttempts: integer
roleARN: string
stoppingCondition:
maxPendingTimeInSeconds: integer
maxRuntimeInSeconds: integer
maxWaitTimeInSeconds: integer
tags:
- key: string
value: string
tensorBoardOutputConfig:
localPath: string
s3OutputPath: string
trainingJobName: string
vpcConfig:
securityGroupIDs:
- string
subnets:
- string
Field | Description |
---|---|
algorithmSpecification Required | object The registry path of the Docker image that contains the training algorithm and algorithm-specific metadata, including the input mode. For more information about algorithms provided by SageMaker, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). For information about providing your own algorithms, see Using Your Own Algorithms with Amazon SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html). |
algorithmSpecification.algorithmName Optional | string |
algorithmSpecification.enableSageMakerMetricsTimeSeries Optional | boolean |
algorithmSpecification.metricDefinitions Optional | array |
algorithmSpecification.metricDefinitions.[] Required | object Specifies a metric that the training algorithm writes to stderr or stdout. |
You can view these logs to understand how your training job performs and | |
check for any errors encountered during training. SageMaker hyperparameter | |
tuning captures all defined metrics. Specify one of the defined metrics to | |
use as an objective metric using the TuningObjective (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html#sagemaker-Type-HyperParameterTrainingJobDefinition-TuningObjective) | |
parameter in the HyperParameterTrainingJobDefinition API to evaluate job | |
performance during hyperparameter tuning. | |
algorithmSpecification.metricDefinitions.[].regex Optional | string |
algorithmSpecification.trainingImage Optional | string |
algorithmSpecification.trainingInputMode Optional | string The training input mode that the algorithm supports. For more information about input modes, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). Pipe mode If an algorithm supports Pipe mode, Amazon SageMaker streams data directly from Amazon S3 to the container. File mode If an algorithm supports File mode, SageMaker downloads the training data from S3 to the provisioned ML storage volume, and mounts the directory to the Docker volume for the training container. You must provision the ML storage volume with sufficient capacity to accommodate the data downloaded from S3. In addition to the training data, the ML storage volume also stores the output model. The algorithm container uses the ML storage volume to also store intermediate information, if any. For distributed algorithms, training data is distributed uniformly. Your training duration is predictable if the input data objects sizes are approximately the same. SageMaker does not split the files any further for model training. If the object sizes are skewed, training won’t be optimal as the data distribution is also skewed when one host in a training cluster is overloaded, thus becoming a bottleneck in training. FastFile mode If an algorithm supports FastFile mode, SageMaker streams data directly from S3 to the container with no code changes, and provides file system access to the data. Users can author their training script to interact with these files as if they were stored on disk. FastFile mode works best when the data is read sequentially. Augmented manifest files aren’t supported. The startup time is lower when there are fewer files in the S3 bucket provided. |
checkpointConfig Optional | object Contains information about the output location for managed spot training checkpoint data. |
checkpointConfig.localPath Optional | string |
checkpointConfig.s3URI Optional | string |
debugHookConfig Optional | object Configuration information for the Amazon SageMaker Debugger hook parameters, metric and tensor collections, and storage paths. To learn more about how to configure the DebugHookConfig parameter, see Use the SageMaker and Debugger Configuration API Operations to Create, Update, and Debug Your Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html). |
debugHookConfig.collectionConfigurations Optional | array |
debugHookConfig.collectionConfigurations.[] Required | object Configuration information for the Amazon SageMaker Debugger output tensor |
collections. | |
debugHookConfig.collectionConfigurations.[].collectionParameters Optional | object |
debugHookConfig.hookParameters Optional | object |
debugHookConfig.localPath Optional | string |
debugHookConfig.s3OutputPath Optional | string |
debugRuleConfigurations Optional | array Configuration information for Amazon SageMaker Debugger rules for debugging output tensors. |
debugRuleConfigurations.[] Required | object Configuration information for SageMaker Debugger rules for debugging. To |
learn more about how to configure the DebugRuleConfiguration parameter, see | |
Use the SageMaker and Debugger Configuration API Operations to Create, Update, | |
and Debug Your Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html). | |
debugRuleConfigurations.[].localPath Optional | string |
debugRuleConfigurations.[].ruleConfigurationName Optional | string |
debugRuleConfigurations.[].ruleEvaluatorImage Optional | string |
debugRuleConfigurations.[].ruleParameters Optional | object |
debugRuleConfigurations.[].s3OutputPath Optional | string |
debugRuleConfigurations.[].volumeSizeInGB Optional | integer |
enableInterContainerTrafficEncryption Optional | boolean To encrypt all communications between ML compute instances in distributed training, choose True. Encryption provides greater security for distributed training, but training might take longer. How long it takes depends on the amount of communication between compute instances, especially if you use a deep learning algorithm in distributed training. For more information, see Protect Communications Between ML Compute Instances in a Distributed Training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/train-encrypt.html). |
enableManagedSpotTraining Optional | boolean To train models using managed spot training, choose True. Managed spot training provides a fully managed and scalable infrastructure for training machine learning models. this option is useful when training jobs can be interrupted and when there is flexibility when the training job is run. The complete and intermediate results of jobs are stored in an Amazon S3 bucket, and can be used as a starting point to train models incrementally. Amazon SageMaker provides metrics and logs in CloudWatch. They can be used to see when managed spot training jobs are running, interrupted, resumed, or completed. |
enableNetworkIsolation Optional | boolean Isolates the training container. No inbound or outbound network calls can be made, except for calls between peers within a training cluster for distributed training. If you enable network isolation for training jobs that are configured to use a VPC, SageMaker downloads and uploads customer data and model artifacts through the specified VPC, but the training container does not have network access. |
environment Optional | object The environment variables to set in the Docker container. |
experimentConfig Optional | object Associates a SageMaker job as a trial component with an experiment and trial. Specified when you call the following APIs: * CreateProcessingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html) * CreateTrainingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) * CreateTransformJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) |
experimentConfig.experimentName Optional | string |
experimentConfig.trialComponentDisplayName Optional | string |
experimentConfig.trialName Optional | string |
hyperParameters Optional | object Algorithm-specific parameters that influence the quality of the model. You set hyperparameters before you start the learning process. For a list of hyperparameters for each training algorithm provided by SageMaker, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). You can specify a maximum of 100 hyperparameters. Each hyperparameter is a key-value pair. Each key and value is limited to 256 characters, as specified by the Length Constraint. Do not include any security-sensitive information including account access IDs, secrets or tokens in any hyperparameter field. If the use of security-sensitive credentials are detected, SageMaker will reject your training job request and return an exception error. |
infraCheckConfig Optional | object Contains information about the infrastructure health check configuration for the training job. |
infraCheckConfig.enableInfraCheck Optional | boolean |
inputDataConfig Optional | array An array of Channel objects. Each channel is a named input source. InputDataConfig describes the input data and its location. Algorithms can accept input data from one or more channels. For example, an algorithm might have two channels of input data, training_data and validation_data. The configuration for each channel provides the S3, EFS, or FSx location where the input data is stored. It also provides information about the stored data: the MIME type, compression method, and whether the data is wrapped in RecordIO format. Depending on the input mode that the algorithm supports, SageMaker either copies input data files from an S3 bucket to a local directory in the Docker container, or makes it available as input streams. For example, if you specify an EFS location, input data files are available as input streams. They do not need to be downloaded. Your input must be in the same Amazon Web Services region as your training job. |
inputDataConfig.[] Required | object A channel is a named input source that training algorithms can consume. |
inputDataConfig.[].compressionType Optional | string |
inputDataConfig.[].contentType Optional | string |
inputDataConfig.[].dataSource Optional | object Describes the location of the channel data. |
inputDataConfig.[].dataSource.fileSystemDataSource Optional | object Specifies a file system data source for a channel. |
inputDataConfig.[].dataSource.fileSystemDataSource.directoryPath Optional | string |
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemAccessMode Optional | string |
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemID Optional | string |
inputDataConfig.[].dataSource.fileSystemDataSource.fileSystemType Optional | string |
inputDataConfig.[].dataSource.s3DataSource Optional | object Describes the S3 data source. Your input bucket must be in the same Amazon Web Services region as your training job. |
inputDataConfig.[].dataSource.s3DataSource.attributeNames Optional | array |
inputDataConfig.[].dataSource.s3DataSource.attributeNames.[] Required | string |
inputDataConfig.[].dataSource.s3DataSource.instanceGroupNames.[] Required | string |
inputDataConfig.[].dataSource.s3DataSource.s3DataType Optional | string |
inputDataConfig.[].dataSource.s3DataSource.s3URI Optional | string |
inputDataConfig.[].inputMode Optional | string The training input mode that the algorithm supports. For more information about input modes, see Algorithms (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). Pipe mode If an algorithm supports Pipe mode, Amazon SageMaker streams data directly from Amazon S3 to the container. File mode If an algorithm supports File mode, SageMaker downloads the training data from S3 to the provisioned ML storage volume, and mounts the directory to the Docker volume for the training container. You must provision the ML storage volume with sufficient capacity to accommodate the data downloaded from S3. In addition to the training data, the ML storage volume also stores the output model. The algorithm container uses the ML storage volume to also store intermediate information, if any. For distributed algorithms, training data is distributed uniformly. Your training duration is predictable if the input data objects sizes are approximately the same. SageMaker does not split the files any further for model training. If the object sizes are skewed, training won’t be optimal as the data distribution is also skewed when one host in a training cluster is overloaded, thus becoming a bottleneck in training. FastFile mode If an algorithm supports FastFile mode, SageMaker streams data directly from S3 to the container with no code changes, and provides file system access to the data. Users can author their training script to interact with these files as if they were stored on disk. FastFile mode works best when the data is read sequentially. Augmented manifest files aren’t supported. The startup time is lower when there are fewer files in the S3 bucket provided. |
inputDataConfig.[].recordWrapperType Optional | string |
inputDataConfig.[].shuffleConfig Optional | object A configuration for a shuffle option for input data in a channel. If you use S3Prefix for S3DataType, the results of the S3 key prefix matches are shuffled. If you use ManifestFile, the order of the S3 object references in the ManifestFile is shuffled. If you use AugmentedManifestFile, the order of the JSON lines in the AugmentedManifestFile is shuffled. The shuffling order is determined using the Seed value. For Pipe input mode, when ShuffleConfig is specified shuffling is done at the start of every epoch. With large datasets, this ensures that the order of the training data is different for each epoch, and it helps reduce bias and possible overfitting. In a multi-node training job when ShuffleConfig is combined with S3DataDistributionType of ShardedByS3Key, the data is shuffled across nodes so that the content sent to a particular node on the first epoch might be sent to a different node on the second epoch. |
inputDataConfig.[].shuffleConfig.seed Optional | integer |
outputDataConfig Required | object Specifies the path to the S3 location where you want to store model artifacts. SageMaker creates subfolders for the artifacts. |
outputDataConfig.compressionType Optional | string |
outputDataConfig.kmsKeyID Optional | string |
outputDataConfig.s3OutputPath Optional | string |
profilerConfig Optional | object Configuration information for Amazon SageMaker Debugger system monitoring, framework profiling, and storage paths. |
profilerConfig.profilingIntervalInMilliseconds Optional | integer |
profilerConfig.profilingParameters Optional | object |
profilerConfig.s3OutputPath Optional | string |
profilerRuleConfigurations Optional | array Configuration information for Amazon SageMaker Debugger rules for profiling system and framework metrics. |
profilerRuleConfigurations.[] Required | object Configuration information for profiling rules. |
profilerRuleConfigurations.[].localPath Optional | string |
profilerRuleConfigurations.[].ruleConfigurationName Optional | string |
profilerRuleConfigurations.[].ruleEvaluatorImage Optional | string |
profilerRuleConfigurations.[].ruleParameters Optional | object |
profilerRuleConfigurations.[].s3OutputPath Optional | string |
profilerRuleConfigurations.[].volumeSizeInGB Optional | integer |
remoteDebugConfig Optional | object Configuration for remote debugging. To learn more about the remote debugging functionality of SageMaker, see Access a training container through Amazon Web Services Systems Manager (SSM) for remote debugging (https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-debugging.html). |
remoteDebugConfig.enableRemoteDebug Optional | boolean |
resourceConfig Required | object The resources, including the ML compute instances and ML storage volumes, to use for model training. ML storage volumes store model artifacts and incremental states. Training algorithms might also use ML storage volumes for scratch space. If you want SageMaker to use the ML storage volume to store the training data, choose File as the TrainingInputMode in the algorithm specification. For distributed training algorithms, specify an instance count greater than 1. |
resourceConfig.instanceCount Optional | integer |
resourceConfig.instanceGroups Optional | array |
resourceConfig.instanceGroups.[] Required | object Defines an instance group for heterogeneous cluster training. When requesting |
a training job using the CreateTrainingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) | |
API, you can configure multiple instance groups . | |
resourceConfig.instanceGroups.[].instanceGroupName Optional | string |
resourceConfig.instanceGroups.[].instanceType Optional | string |
resourceConfig.instanceType Optional | string |
resourceConfig.keepAlivePeriodInSeconds Optional | integer Optional. Customer requested period in seconds for which the Training cluster is kept alive after the job is finished. |
resourceConfig.volumeKMSKeyID Optional | string |
resourceConfig.volumeSizeInGB Optional | integer |
retryStrategy Optional | object The number of times to retry the job when the job fails due to an InternalServerError. |
retryStrategy.maximumRetryAttempts Optional | integer |
roleARN Required | string The Amazon Resource Name (ARN) of an IAM role that SageMaker can assume to perform tasks on your behalf. During model training, SageMaker needs your permission to read input data from an S3 bucket, download a Docker image that contains training code, write model artifacts to an S3 bucket, write logs to Amazon CloudWatch Logs, and publish metrics to Amazon CloudWatch. You grant permissions for all of these tasks to an IAM role. For more information, see SageMaker Roles (https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). To be able to pass this role to SageMaker, the caller of this API must have the iam:PassRole permission. |
stoppingCondition Required | object Specifies a limit to how long a model training job can run. It also specifies how long a managed Spot training job has to complete. When the job reaches the time limit, SageMaker ends the training job. Use this API to cap model training costs. To stop a job, SageMaker sends the algorithm the SIGTERM signal, which delays job termination for 120 seconds. Algorithms can use this 120-second window to save the model artifacts, so the results of training are not lost. |
stoppingCondition.maxPendingTimeInSeconds Optional | integer Maximum job scheduler pending time in seconds. |
stoppingCondition.maxRuntimeInSeconds Optional | integer |
stoppingCondition.maxWaitTimeInSeconds Optional | integer |
tags Optional | array An array of key-value pairs. You can use tags to categorize your Amazon Web Services resources in different ways, for example, by purpose, owner, or environment. For more information, see Tagging Amazon Web Services Resources (https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html). |
tags.[] Required | object A tag object that consists of a key and an optional value, used to manage |
metadata for SageMaker Amazon Web Services resources. |
You can add tags to notebook instances, training jobs, hyperparameter tuning jobs, batch transform jobs, models, labeling jobs, work teams, endpoint configurations, and endpoints. For more information on adding tags to SageMaker resources, see AddTags (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AddTags.html).
For more information on adding metadata to your Amazon Web Services resources
with tagging, see Tagging Amazon Web Services resources (https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html).
For advice on best practices for managing Amazon Web Services resources with
tagging, see Tagging Best Practices: Implement an Effective Amazon Web Services
Resource Tagging Strategy (https://d1.awsstatic.com/whitepapers/aws-tagging-best-practices.pdf). || tags.[].key
Optional | string
|
| tags.[].value
Optional | string
|
| tensorBoardOutputConfig
Optional | object
Configuration of storage locations for the Amazon SageMaker Debugger TensorBoard
output data. |
| tensorBoardOutputConfig.localPath
Optional | string
|
| tensorBoardOutputConfig.s3OutputPath
Optional | string
|
| trainingJobName
Required | string
The name of the training job. The name must be unique within an Amazon Web
Services Region in an Amazon Web Services account. |
| vpcConfig
Optional | object
A VpcConfig (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html)
object that specifies the VPC that you want your training job to connect
to. Control access to and from your training container by configuring the
VPC. For more information, see Protect Training Jobs by Using an Amazon Virtual
Private Cloud (https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html). |
| vpcConfig.securityGroupIDs
Optional | array
|
| vpcConfig.securityGroupIDs.[]
Required | string
|| vpcConfig.subnets
Optional | array
|
| vpcConfig.subnets.[]
Required | string
|
Status
ackResourceMetadata:
arn: string
ownerAccountID: string
region: string
conditions:
- lastTransitionTime: string
message: string
reason: string
status: string
type: string
creationTime: string
debugRuleEvaluationStatuses:
- lastModifiedTime: string
ruleConfigurationName: string
ruleEvaluationJobARN: string
ruleEvaluationStatus: string
statusDetails: string
failureReason: string
lastModifiedTime: string
modelArtifacts:
s3ModelArtifacts: string
profilerRuleEvaluationStatuses:
- lastModifiedTime: string
ruleConfigurationName: string
ruleEvaluationJobARN: string
ruleEvaluationStatus: string
statusDetails: string
profilingStatus: string
secondaryStatus: string
trainingJobStatus: string
warmPoolStatus:
resourceRetainedBillableTimeInSeconds: integer
reusedByJob: string
status: string
Field | Description |
---|---|
ackResourceMetadata Optional | object All CRs managed by ACK have a common Status.ACKResourceMetadata memberthat is used to contain resource sync state, account ownership, constructed ARN for the resource |
ackResourceMetadata.arn Optional | string ARN is the Amazon Resource Name for the resource. This is a globally-unique identifier and is set only by the ACK service controller once the controller has orchestrated the creation of the resource OR when it has verified that an “adopted” resource (a resource where the ARN annotation was set by the Kubernetes user on the CR) exists and matches the supplied CR’s Spec field values. TODO(vijat@): Find a better strategy for resources that do not have ARN in CreateOutputResponse https://github.com/aws/aws-controllers-k8s/issues/270 |
ackResourceMetadata.ownerAccountID Required | string OwnerAccountID is the AWS Account ID of the account that owns the backend AWS service API resource. |
ackResourceMetadata.region Required | string Region is the AWS region in which the resource exists or will exist. |
conditions Optional | array All CRS managed by ACK have a common Status.Conditions member thatcontains a collection of ackv1alpha1.Condition objects that describethe various terminal states of the CR and its backend AWS service API resource |
conditions.[] Required | object Condition is the common struct used by all CRDs managed by ACK service |
controllers to indicate terminal states of the CR and its backend AWS | |
service API resource | |
conditions.[].message Optional | string A human readable message indicating details about the transition. |
conditions.[].reason Optional | string The reason for the condition’s last transition. |
conditions.[].status Optional | string Status of the condition, one of True, False, Unknown. |
conditions.[].type Optional | string Type is the type of the Condition |
creationTime Optional | string A timestamp that indicates when the training job was created. |
debugRuleEvaluationStatuses Optional | array Evaluation status of Amazon SageMaker Debugger rules for debugging on a training job. |
debugRuleEvaluationStatuses.[] Required | object Information about the status of the rule evaluation. |
debugRuleEvaluationStatuses.[].ruleConfigurationName Optional | string |
debugRuleEvaluationStatuses.[].ruleEvaluationJobARN Optional | string |
debugRuleEvaluationStatuses.[].ruleEvaluationStatus Optional | string |
debugRuleEvaluationStatuses.[].statusDetails Optional | string |
failureReason Optional | string If the training job failed, the reason it failed. |
lastModifiedTime Optional | string A timestamp that indicates when the status of the training job was last modified. |
modelArtifacts Optional | object Information about the Amazon S3 location that is configured for storing model artifacts. |
modelArtifacts.s3ModelArtifacts Optional | string |
profilerRuleEvaluationStatuses Optional | array Evaluation status of Amazon SageMaker Debugger rules for profiling on a training job. |
profilerRuleEvaluationStatuses.[] Required | object Information about the status of the rule evaluation. |
profilerRuleEvaluationStatuses.[].ruleConfigurationName Optional | string |
profilerRuleEvaluationStatuses.[].ruleEvaluationJobARN Optional | string |
profilerRuleEvaluationStatuses.[].ruleEvaluationStatus Optional | string |
profilerRuleEvaluationStatuses.[].statusDetails Optional | string |
profilingStatus Optional | string Profiling status of a training job. |
secondaryStatus Optional | string Provides detailed information about the state of the training job. For detailed information on the secondary status of the training job, see StatusMessage under SecondaryStatusTransition (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_SecondaryStatusTransition.html). SageMaker provides primary statuses and secondary statuses that apply to each of them: InProgress * Starting - Starting the training job. * Downloading - An optional stage for algorithms that support File training input mode. It indicates that data is being downloaded to the ML storage volumes. * Training - Training is in progress. * Interrupted - The job stopped because the managed spot training instances were interrupted. * Uploading - Training is complete and the model artifacts are being uploaded to the S3 location. Completed * Completed - The training job has completed. Failed * Failed - The training job has failed. The reason for the failure is returned in the FailureReason field of DescribeTrainingJobResponse. Stopped * MaxRuntimeExceeded - The job stopped because it exceeded the maximum allowed runtime. * MaxWaitTimeExceeded - The job stopped because it exceeded the maximum allowed wait time. * Stopped - The training job has stopped. Stopping * Stopping - Stopping the training job. Valid values for SecondaryStatus are subject to change. We no longer support the following secondary statuses: * LaunchingMLInstances * PreparingTraining * DownloadingTrainingImage |
trainingJobStatus Optional | string The status of the training job. SageMaker provides the following training job statuses: * InProgress - The training is in progress. * Completed - The training job has completed. * Failed - The training job has failed. To see the reason for the failure, see the FailureReason field in the response to a DescribeTrainingJobResponse call. * Stopping - The training job is stopping. * Stopped - The training job has stopped. For more detailed information, see SecondaryStatus. |
warmPoolStatus Optional | object The status of the warm pool associated with the training job. |
warmPoolStatus.resourceRetainedBillableTimeInSeconds Optional | integer Optional. Indicates how many seconds the resource stayed in ResourceRetained state. Populated only after resource reaches ResourceReused or ResourceReleased state. |
warmPoolStatus.reusedByJob Optional | string |
warmPoolStatus.status Optional | string |