Orca API¶
orca.learn.bigdl.estimator¶
- class zoo.orca.learn.bigdl.estimator.BigDLEstimator(*, model, loss, optimizer=None, metrics=None, feature_preprocessing=None, label_preprocessing=None, model_dir=None)[source]¶
Bases:
zoo.orca.learn.spark_estimator.Estimator- clear_gradient_clipping()[source]¶
Clear gradient clipping parameters. In this case, gradient clipping will not be applied. In order to take effect, it needs to be called before fit.
- Returns
- evaluate(data, batch_size=32, feature_cols=None, label_cols=None)[source]¶
Evaluate model.
- Parameters
data – validation data. It can be XShards, each partition is a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a list of numpy arrays.
batch_size – Batch size used for validation. Default: 32.
feature_cols – (Not supported yet) Feature column name(s) of data. Only used when data is a Spark DataFrame. Default: None.
label_cols – (Not supported yet) Label column name(s) of data. Only used when data is a Spark DataFrame. Default: None.
- Returns
- fit(data, epochs, batch_size=32, feature_cols='features', label_cols='label', caching_sample=True, validation_data=None, validation_trigger=None, checkpoint_trigger=None)[source]¶
Train this BigDL model with train data.
- Parameters
data – train data. It can be XShards or Spark DataFrame. If data is XShards, each partition is a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a list of numpy arrays.
epochs – Number of epochs to train the model.
batch_size – Batch size used for training. Default: 32.
feature_cols – Feature column name(s) of data. Only used when data is a Spark DataFrame. Default: “features”.
label_cols – Label column name(s) of data. Only used when data is a Spark DataFrame. Default: “label”.
caching_sample – whether to cache the Samples after preprocessing. Default: True
validation_data – Validation data. XShards and Spark DataFrame are supported. If data is XShards, each partition is a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a list of numpy arrays.
validation_trigger – Orca Trigger to trigger validation computation.
checkpoint_trigger – Orca Trigger to set a checkpoint.
- Returns
- get_train_summary(tag=None)[source]¶
Get the scalar from model train summary Return list of summary data of [iteration_number, scalar_value, timestamp]
- Parameters
tag – The string variable represents the scalar wanted
- get_validation_summary(tag=None)[source]¶
Get the scalar from model validation summary Return list of summary data of [iteration_number, scalar_value, timestamp] Note: The metric and tag may not be consistent Please look up following form to pass tag parameter Left side is your metric during compile Right side is the tag you should pass ‘Accuracy’ | ‘Top1Accuracy’ ‘BinaryAccuracy’ | ‘Top1Accuracy’ ‘CategoricalAccuracy’ | ‘Top1Accuracy’ ‘SparseCategoricalAccuracy’ | ‘Top1Accuracy’ ‘AUC’ | ‘AucScore’ ‘HitRatio’ | ‘HitRate@k’ (k is Top-k) ‘Loss’ | ‘Loss’ ‘MAE’ | ‘MAE’ ‘NDCG’ | ‘NDCG’ ‘TFValidationMethod’ | ‘${name + ” ” + valMethod.toString()}’ ‘Top5Accuracy’ | ‘Top5Accuracy’ ‘TreeNNAccuracy’ | ‘TreeNNAccuracy()’ ‘MeanAveragePrecision’ | ‘MAP@k’ (k is Top-k) (BigDL) ‘MeanAveragePrecision’ | ‘PascalMeanAveragePrecision’ (Zoo) ‘StatelessMetric’ | ‘${name}’
- Parameters
tag – The string variable represents the scalar wanted
- load(checkpoint, optimizer=None, loss=None, feature_preprocessing=None, label_preprocessing=None, model_dir=None, is_checkpoint=False)[source]¶
Load existing BigDL model or checkpoint
- Parameters
checkpoint – Path to the existing model or checkpoint.
optimizer – BigDL optimizer.
loss – BigDL criterion.
feature_preprocessing –
Used when data in fit and predict is a Spark DataFrame. The param converts the data in feature column to a Tensor or to a Sample directly. It expects a List of Int as the size of the converted Tensor, or a Preprocessing[F, Tensor[T]]
If a List of Int is set as feature_preprocessing, it can only handle the case that feature column contains the following data types: Float, Double, Int, Array[Float], Array[Double], Array[Int] and MLlib Vector. The feature data are converted to Tensors with the specified sizes before sending to the model. Internally, a SeqToTensor is generated according to the size, and used as the feature_preprocessing.
Alternatively, user can set feature_preprocessing as Preprocessing[F, Tensor[T]] that transforms the feature data to a Tensor[T]. Some pre-defined Preprocessing are provided in package zoo.feature. Multiple Preprocessing can be combined as a ChainedPreprocessing.
The feature_preprocessing will also be copied to the generated NNModel and applied to feature column during transform.
label_preprocessing – Used when data in fit and predict is a Spark DataFrame. similar to feature_preprocessing, but applies to Label data.
model_dir – The path to save model. During the training, if checkpoint_trigger is defined and triggered, the model will be saved to model_dir.
is_checkpoint – Whether the path is a checkpoint or a saved BigDL model. Default: False.
- Returns
The loaded estimator object.
- load_latest_orca_checkpoint(path)[source]¶
Load latest Orca checkpoint under specified directory.
- Parameters
path – directory containing Orca checkpoint files.
- load_orca_checkpoint(path, version, prefix=None)[source]¶
Load existing checkpoint
- Parameters
path – Path to the existing checkpoint.
version – checkpoint version, which is the suffix of model.* file, i.e., for modle.4 file, the version is 4.
prefix – optimMethod prefix, for example ‘optimMethod-Sequentialf53bddcc’
- Returns
- predict(data, batch_size=4, feature_cols='features', sample_preprocessing=None)[source]¶
Predict input data
- Parameters
data – predict input data. It can be XShards or Spark DataFrame. If data is XShards, each partition is a dictionary of {‘x’: feature}, where feature is a numpy array or a list of numpy arrays.
batch_size – Batch size used for inference. Default: 4.
feature_cols – Feature column name(s) of data. Only used when data is a Spark DataFrame. Default: “features”.
sample_preprocessing – Used when data is a Spark DataFrame. If the user want change the default feature_preprocessing specified in Estimator.from_bigdl, the user can pass the new sample_preprocessing methods.
- Returns
predicted result. If input data is Spark DataFrame, the predict result is a DataFrame which includes original columns plus ‘prediction’ column. The ‘prediction’ column can be FloatType, VectorUDT or Array of VectorUDT depending on model outputs shape. If input data is an XShards, the predict result is a XShards, each partition of the XShards is a dictionary of {‘prediction’: result}, where result is a numpy array or a list of numpy arrays.
- save(model_path)[source]¶
Save the BigDL model to model_path
- Parameters
model_path – path to save the trained model.
- Returns
- class zoo.orca.learn.bigdl.estimator.Estimator[source]¶
Bases:
object- static from_bigdl(*, model, loss=None, optimizer=None, metrics=None, feature_preprocessing=None, label_preprocessing=None, model_dir=None)[source]¶
Construct an Estimator with BigDL model, loss function and Preprocessing for feature and label data.
- Parameters
model – BigDL Model to be trained.
loss – BigDL criterion.
optimizer – BigDL optimizer.
metrics – A evaluation metric or a list of evaluation metrics
feature_preprocessing –
Used when data in fit and predict is a Spark DataFrame. The param converts the data in feature column to a Tensor or to a Sample directly. It expects a List of Int as the size of the converted Tensor, or a Preprocessing[F, Tensor[T]]
If a List of Int is set as feature_preprocessing, it can only handle the case that feature column contains the following data types: Float, Double, Int, Array[Float], Array[Double], Array[Int] and MLlib Vector. The feature data are converted to Tensors with the specified sizes before sending to the model. Internally, a SeqToTensor is generated according to the size, and used as the feature_preprocessing.
Alternatively, user can set feature_preprocessing as Preprocessing[F, Tensor[T]] that transforms the feature data to a Tensor[T]. Some pre-defined Preprocessing are provided in package zoo.feature. Multiple Preprocessing can be combined as a ChainedPreprocessing.
The feature_preprocessing will also be copied to the generated NNModel and applied to feature column during transform.
label_preprocessing – Used when data in fit and predict is a Spark DataFrame. similar to feature_preprocessing, but applies to Label data.
model_dir – The path to save model. During the training, if checkpoint_trigger is defined and triggered, the model will be saved to model_dir.
- Returns
orca.learn.tf.estimator¶
- class zoo.orca.learn.tf.estimator.Estimator[source]¶
Bases:
zoo.orca.learn.spark_estimator.Estimator- clear_gradient_clipping()[source]¶
Clear gradient clipping parameters. In this case, gradient clipping will not be applied. In order to take effect, it needs to be called before fit.
- Returns
- evaluate(data, batch_size=32, feature_cols=None, label_cols=None, auto_shard_files=False)[source]¶
Evaluate model.
- Parameters
data – evaluation data. It can be XShards, Spark DataFrame, tf.data.Dataset. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a tuple of numpy arrays. If data is tf.data.Dataset, each element is a tuple of input tensors.
batch_size – batch size per thread.
feature_cols – feature_cols: feature column names if train data is Spark DataFrame or
XShards of Pandas Dataframe. :param label_cols: label column names if train data is Spark DataFrame or XShards
of Pandas Dataframe.
- Parameters
auto_shard_files – whether to automatically detect if the dataset is file-based and and apply sharding on files, otherwise sharding on records. Default is False.
- Returns
evaluation result as a dictionary of {‘metric name’: metric value}
- fit(data, epochs, batch_size=32, feature_cols=None, label_cols=None, validation_data=None, session_config=None, checkpoint_trigger=None, auto_shard_files=False)[source]¶
Train the model with train data.
- Parameters
data – train data. It can be XShards, Spark DataFrame, tf.data.Dataset. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a tuple of numpy arrays.
epochs – number of epochs to train.
batch_size – total batch size for each iteration. Default: 32.
feature_cols – feature column names if train data is Spark DataFrame or XShards of Pandas Dataframe.
label_cols – label column names if train data is Spark DataFrame or XShards of Pandas Dataframe.
validation_data – validation data. Validation data type should be the same as train data.
session_config – tensorflow session configuration for training. Should be object of tf.ConfigProto
checkpoint_trigger – when to trigger checkpoint during training. Should be a zoo.orca.learn.trigger, like EveryEpoch(), SeveralIteration( num_iterations),etc.
auto_shard_files – whether to automatically detect if the dataset is file-based and and apply sharding on files, otherwise sharding on records. Default is False.
- static from_graph(*, inputs, outputs=None, labels=None, loss=None, optimizer=None, metrics=None, clip_norm=None, clip_value=None, updates=None, sess=None, model_dir=None, backend='bigdl')[source]¶
Create an Estimator for tesorflow graph.
- Parameters
inputs – input tensorflow tensors.
outputs – output tensorflow tensors.
labels – label tensorflow tensors.
loss – The loss tensor of the TensorFlow model, should be a scalar
optimizer – tensorflow optimization method.
clip_norm – float >= 0. Gradients will be clipped when their L2 norm exceeds this value.
clip_value – a float >= 0 or a tuple of two floats. If clip_value is a float, gradients will be clipped when their absolute value exceeds this value. If clip_value is a tuple of two floats, gradients will be clipped when their value less than clip_value[0] or larger than clip_value[1].
metrics – metric tensor.
updates – Collection for the update ops. For example, when performing batch normalization, the moving_mean and moving_variance should be updated and the user should add tf.GraphKeys.UPDATE_OPS to updates. Default is None.
sess – the current TensorFlow Session, if you want to used a pre-trained model, you should use the Session to load the pre-trained variables and pass it to estimator
model_dir – location to save model checkpoint and summaries.
backend – backend for estimator. Now it only can be “bigdl”.
- Returns
an Estimator object.
- static from_keras(keras_model, metrics=None, model_dir=None, optimizer=None, backend='bigdl')[source]¶
Create an Estimator from a tensorflow.keras model. The model must be compiled.
- Parameters
keras_model – the tensorflow.keras model, which must be compiled.
metrics – user specified metric.
model_dir – location to save model checkpoint and summaries.
optimizer – an optional orca optimMethod that will override the optimizer in keras_model.compile
backend – backend for estimator. Now it only can be “bigdl”.
- Returns
an Estimator object.
- get_train_summary(tag=None)[source]¶
Get the scalar from model train summary Return list of summary data of [iteration_number, scalar_value, timestamp]
- Parameters
tag – The string variable represents the scalar wanted
- get_validation_summary(tag=None)[source]¶
Get the scalar from model validation summary Return list of summary data of [iteration_number, scalar_value, timestamp] Note: The metric and tag may not be consistent Please look up following form to pass tag parameter Left side is your metric during compile Right side is the tag you should pass ‘Accuracy’ | ‘Top1Accuracy’ ‘BinaryAccuracy’ | ‘Top1Accuracy’ ‘CategoricalAccuracy’ | ‘Top1Accuracy’ ‘SparseCategoricalAccuracy’ | ‘Top1Accuracy’ ‘AUC’ | ‘AucScore’ ‘HitRatio’ | ‘HitRate@k’ (k is Top-k) ‘Loss’ | ‘Loss’ ‘MAE’ | ‘MAE’ ‘NDCG’ | ‘NDCG’ ‘TFValidationMethod’ | ‘${name + ” ” + valMethod.toString()}’ ‘Top5Accuracy’ | ‘Top5Accuracy’ ‘TreeNNAccuracy’ | ‘TreeNNAccuracy()’ ‘MeanAveragePrecision’ | ‘MAP@k’ (k is Top-k) (BigDL) ‘MeanAveragePrecision’ | ‘PascalMeanAveragePrecision’ (Zoo) ‘StatelessMetric’ | ‘${name}’
- Parameters
tag – The string variable represents the scalar wanted
- load(checkpoint, **kwargs)[source]¶
Load existing checkpoint
- Parameters
checkpoint – Path to the existing checkpoint.
- Returns
- static load_keras_model(path)[source]¶
Create Estimator by loading an existing keras model (with weights) from HDF5 file.
- Parameters
path – String. The path to the pre-defined model.
- Returns
Orca TF Estimator.
- load_keras_weights(filepath, by_name=False)[source]¶
Save tensorflow keras model in this estimator.
- Parameters
filepath – keras model weights save path.
by_name – Boolean, whether to load weights by name or by topological order. Only topological loading is supported for weight files in TensorFlow format.
- load_latest_orca_checkpoint(path)[source]¶
Load latest Orca checkpoint under specified directory.
- Parameters
path – directory containing Orca checkpoint files.
- load_orca_checkpoint(path, version)[source]¶
Load specified Orca checkpoint.
- Parameters
path – checkpoint directory which contains model.* and optimMethod-TFParkTraining.* files.
version – checkpoint version, which is the suffix of model.* file, i.e., for modle.4 file, the version is 4.
- predict(data, batch_size=4, feature_cols=None, auto_shard_files=False)[source]¶
Predict input data
- Parameters
data – data to be predicted. It can be XShards, Spark DataFrame. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature}, where feature is a numpy array or a tuple of numpy arrays.
batch_size – batch size per thread
feature_cols – list of feature column names if input data is Spark DataFrame or
XShards of Pandas Dataframe. :param auto_shard_files: whether to automatically detect if the dataset is file-based and
and apply sharding on files, otherwise sharding on records. Default is False.
- Returns
predicted result. If input data is XShards or tf.data.Dataset, the predict result is a XShards, each partition of the XShards is a dictionary of {‘prediction’: result}, where the result is a numpy array or a list of numpy arrays. If input data is Spark DataFrame, the predict result is a DataFrame which includes original columns plus ‘prediction’ column. The ‘prediction’ column can be FloatType, VectorUDT or Array of VectorUDT depending on model outputs shape.
- save(model_path)[source]¶
Save model to model_path
- Parameters
model_path – path to save the trained model.
- Returns
- save_keras_model(path, overwrite=True)[source]¶
Save tensorflow keras model in this estimator.
- Parameters
path – keras model save path.
overwrite – Whether to silently overwrite any existing file at the target location.
- save_keras_weights(filepath, overwrite=True, save_format=None)[source]¶
Save tensorflow keras model weights in this estimator.
- Parameters
filepath – keras model weights save path.
overwrite – Whether to silently overwrite any existing file at the target location.
save_format – Either ‘tf’ or ‘h5’. A filepath ending in ‘.h5’ or ‘.keras’ will default to HDF5 if save_format is None. Otherwise None defaults to ‘tf’.
- save_tf_checkpoint(path)[source]¶
Save tensorflow checkpoint in this estimator.
- Parameters
path – tensorflow checkpoint path.
- class zoo.orca.learn.tf.estimator.KerasEstimator(keras_model, metrics, model_dir, optimizer)[source]¶
Bases:
zoo.orca.learn.tf.estimator.Estimator- clear_gradient_clipping()[source]¶
Clear gradient clipping parameters. In this case, gradient clipping will not be applied. In order to take effect, it needs to be called before fit.
- Returns
- evaluate(data, batch_size=32, feature_cols=None, label_cols=None, auto_shard_files=False)[source]¶
Evaluate model.
- Parameters
data – evaluation data. It can be XShards, Spark DataFrame, tf.data.Dataset. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a tuple of numpy arrays. If data is tf.data.Dataset, each element is [feature tensor tuple, label tensor tuple]
batch_size – batch size per thread.
feature_cols – feature_cols: feature column names if train data is Spark DataFrame or
XShards of Pandas DataFrame. :param label_cols: label column names if train data is Spark DataFrame or XShards
of Pandas DataFrame.
- Parameters
auto_shard_files – whether to automatically detect if the dataset is file-based and and apply sharding on files, otherwise sharding on records. Default is False.
- Returns
evaluation result as a dictionary of {‘metric name’: metric value}
- fit(data, epochs=1, batch_size=32, feature_cols=None, label_cols=None, validation_data=None, session_config=None, checkpoint_trigger=None, auto_shard_files=True)[source]¶
Train this keras model with train data.
- Parameters
data – train data. It can be XShards, Spark DataFrame, tf.data.Dataset. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a tuple of numpy arrays. If data is tf.data.Dataset, each element is [feature tensor tuple, label tensor tuple]
epochs – number of epochs to train.
batch_size – total batch size for each iteration.
feature_cols – feature column names if train data is Spark DataFrame or XShards of Pandas DataFrame.
label_cols – label column names if train data is Spark DataFrame or XShards of
Pandas DataFrame. :param validation_data: validation data. Validation data type should be the same
as train data.
- Parameters
session_config – tensorflow session configuration for training. Should be object of tf.ConfigProto
checkpoint_trigger – when to trigger checkpoint during training. Should be a zoo.orca.learn.trigger, like EveryEpoch(), SeveralIteration( num_iterations),etc.
auto_shard_files – whether to automatically detect if the dataset is file-based and and apply sharding on files, otherwise sharding on records. Default is False.
- load_keras_weights(filepath, by_name=False)[source]¶
Save tensorflow keras model in this estimator.
- Parameters
filepath – keras model weights save path.
by_name – Boolean, whether to load weights by name or by topological order. Only topological loading is supported for weight files in TensorFlow format.
- predict(data, batch_size=4, feature_cols=None, auto_shard_files=False)[source]¶
Predict input data
- Parameters
data – data to be predicted. It can be XShards, Spark DataFrame, or tf.data.Dataset. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature}, where feature is a numpy array or a tuple of numpy arrays. If data is tf.data.Dataset, each element is feature tensor tuple
batch_size – batch size per thread
feature_cols – list of feature column names if input data is Spark DataFrame or
- XShards
of Pandas DataFrame.
- Parameters
auto_shard_files – whether to automatically detect if the dataset is file-based and and apply sharding on files, otherwise sharding on records. Default is False.
- Returns
predicted result. If input data is XShards or tf.data.Dataset, the predict result is also a XShards, and the schema for each result is: {‘prediction’: predicted numpy array or list of predicted numpy arrays}. If input data is Spark DataFrame, the predict result is a DataFrame which includes original columns plus ‘prediction’ column. The ‘prediction’ column can be FloatType, VectorUDT or Array of VectorUDT depending on model outputs shape.
- save(model_path, overwrite=True)[source]¶
Save model to model_path
- Parameters
model_path – path to save the trained model.
overwrite – Whether to silently overwrite any existing file at the target location.
- Returns
- save_keras_model(path, overwrite=True)[source]¶
Save tensorflow keras model in this estimator.
- Parameters
path – keras model save path.
overwrite – Whether to silently overwrite any existing file at the target location.
- save_keras_weights(filepath, overwrite=True, save_format=None)[source]¶
Save tensorflow keras model weights in this estimator.
- Parameters
filepath – keras model weights save path.
overwrite – Whether to silently overwrite any existing file at the target location.
save_format – Either ‘tf’ or ‘h5’. A filepath ending in ‘.h5’ or ‘.keras’ will default to HDF5 if save_format is None. Otherwise None defaults to ‘tf’.
- class zoo.orca.learn.tf.estimator.TensorFlowEstimator(*, inputs, outputs, labels, loss, optimizer, clip_norm, clip_value, metrics, updates, sess, model_dir)[source]¶
Bases:
zoo.orca.learn.tf.estimator.Estimator- evaluate(data, batch_size=32, feature_cols=None, label_cols=None, auto_shard_files=False)[source]¶
Evaluate model.
- Parameters
data – evaluation data. It can be XShards, Spark DataFrame, tf.data.Dataset. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a tuple of numpy arrays. If data is tf.data.Dataset, each element is a tuple of input tensors.
batch_size – batch size per thread.
feature_cols – feature_cols: feature column names if train data is Spark DataFrame
or XShards of Pandas DataFrame. :param label_cols: label column names if train data is Spark DataFrame or XShards
of Pandas DataFrame.
- Parameters
auto_shard_files – whether to automatically detect if the dataset is file-based and and apply sharding on files, otherwise sharding on records. Default is False.
- Returns
evaluation result as a dictionary of {‘metric name’: metric value}
- fit(data, epochs=1, batch_size=32, feature_cols=None, label_cols=None, validation_data=None, session_config=None, checkpoint_trigger=None, auto_shard_files=False, feed_dict=None)[source]¶
Train this graph model with train data.
- Parameters
data – train data. It can be XShards, Spark DataFrame, tf.data.Dataset. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a tuple of numpy arrays. If data is tf.data.Dataset, each element is a tuple of input tensors.
epochs – number of epochs to train.
batch_size – total batch size for each iteration.
feature_cols – feature column names if train data is Spark DataFrame or XShards of Pandas Dataframe.
label_cols – label column names if train data is Spark DataFrame or XShards of
Pandas Dataframe. :param validation_data: validation data. Validation data type should be the same
as train data.
- Parameters
auto_shard_files – whether to automatically detect if the dataset is file-based and and apply sharding on files, otherwise sharding on records. Default is False.
session_config – tensorflow session configuration for training. Should be object of tf.ConfigProto
feed_dict – a dictionary. The key is TensorFlow tensor, usually a placeholder, the value of the dictionary is a tuple of two elements. The first one of the tuple is the value to feed to the tensor in training phase and the second one is the value to feed to the tensor in validation phase.
checkpoint_trigger – when to trigger checkpoint during training. Should be a zoo.orca.learn.trigger, like EveryEpoch(), SeveralIteration( num_iterations),etc.
- predict(data, batch_size=4, feature_cols=None, auto_shard_files=False)[source]¶
Predict input data
- Parameters
data – data to be predicted. It can be XShards, Spark DataFrame. If data is XShards, each partition can be Pandas Dataframe or a dictionary of {‘x’: feature}, where feature is a numpy array or a tuple of numpy arrays.
batch_size – batch size per thread
feature_cols – list of feature column names if input data is Spark DataFrame
or XShards of Pandas DataFrame. :param auto_shard_files: whether to automatically detect if the dataset is file-based and
and apply sharding on files, otherwise sharding on records. Default is False.
- Returns
predicted result. If input data is XShards or tf.data.Dataset, the predict result is a XShards, each partition of the XShards is a dictionary of {‘prediction’: result}, where the result is a numpy array or a list of numpy arrays. If input data is Spark DataFrame, the predict result is a DataFrame which includes original columns plus ‘prediction’ column. The ‘prediction’ column can be FloatType, VectorUDT or Array of VectorUDT depending on model outputs shape.
- save(model_path)[source]¶
Save model to model_path
- Parameters
model_path – path to save the trained model.
- Returns
- save_tf_checkpoint(path)[source]¶
Save tensorflow checkpoint in this estimator.
- Parameters
path – tensorflow checkpoint path.
orca.learn.tf2.estimator¶
orca.learn.pytorch.estimator¶
- class zoo.orca.learn.pytorch.estimator.Estimator[source]¶
Bases:
object- static from_torch(*, model, optimizer, loss=None, metrics=None, scheduler_creator=None, training_operator_cls=<class 'zoo.orca.learn.pytorch.training_operator.TrainingOperator'>, initialization_hook=None, config=None, scheduler_step_freq='batch', use_tqdm=False, workers_per_node=1, model_dir=None, backend='bigdl')[source]¶
Create an Estimator for torch.
- Parameters
model – PyTorch model if backend=”bigdl”, PyTorch model creator function if backend=”horovod” or “torch_distributed”
optimizer – Orca or PyTorch optimizer if backend=”bigdl”, PyTorch optimizer creator function if backend=”horovod” or “torch_distributed”
loss – PyTorch loss if backend=”bigdl”, PyTorch loss creator function if backend=”horovod” or “torch_distributed”
metrics – Orca validation methods for evaluate.
scheduler_creator – parameter for horovod and torch_distributed backends. a learning rate scheduler wrapping the optimizer. You will need to set
scheduler_step_freq="epoch"for the scheduler to be incremented correctly.config – parameter for horovod and torch_distributed backends. Config dict to create model, optimizer loss and data.
scheduler_step_freq – parameter for horovod and torch_distributed backends. “batch”, “epoch” or None. This will determine when
scheduler.stepis called. If “batch”,stepwill be called after every optimizer step. If “epoch”,stepwill be called after one pass of the DataLoader. If a scheduler is passed in, this value is expected to not be None.use_tqdm – parameter for horovod and torch_distributed backends. You can monitor training progress if use_tqdm=True.
workers_per_node – parameter for horovod and torch_distributed backends. worker number on each node. default: 1.
model_dir – parameter for bigdl backend. The path to save model. During the training, if checkpoint_trigger is defined and triggered, the model will be saved to model_dir.
backend – You can choose “horovod”, “torch_distributed” or “bigdl” as backend. Default: bigdl.
- Returns
an Estimator object.
- class zoo.orca.learn.pytorch.estimator.PyTorchRayEstimator(*, model_creator, optimizer_creator, loss_creator=None, metrics=None, scheduler_creator=None, training_operator_cls=<class 'zoo.orca.learn.pytorch.training_operator.TrainingOperator'>, initialization_hook=None, config=None, scheduler_step_freq='batch', use_tqdm=False, backend='torch_distributed', workers_per_node=1)[source]¶
Bases:
zoo.orca.learn.ray_estimator.Estimator- evaluate(data, batch_size=32, num_steps=None, profile=False, info=None, feature_cols=None, label_cols=None)[source]¶
Evaluates a PyTorch model given validation data. Note that only accuracy for classification with zero-based label is supported by default. You can override validate_batch in TrainingOperator for other metrics.
Calls TrainingOperator.validate() on N parallel workers simultaneously underneath the hood. :param data: An instance of SparkXShards, a Spark DataFrame or a function that
takes config and batch_size as argument and returns a PyTorch DataLoader for validation.
- Parameters
batch_size – The number of samples per batch for each worker. Default is 32. The total batch size would be workers_per_node*num_nodes. If your validation data is a function, you can set batch_size to be the input batch_size of the function for the PyTorch DataLoader.
num_steps – The number of batches to compute the validation results on. This corresponds to the number of times TrainingOperator.validate_batch is called.
profile – Boolean. Whether to return time stats for the training procedure. Default is False.
info – An optional dictionary that can be passed to the TrainingOperator for validate.
feature_cols – feature column names if train data is Spark DataFrame.
label_cols – label column names if train data is Spark DataFrame.
- :return A dictionary of metrics for the given data, including validation accuracy and loss.
You can also provide custom metrics by passing in a custom training_operator_cls when creating the Estimator.
- fit(data, epochs=1, batch_size=32, profile=False, reduce_results=True, info=None, feature_cols=None, label_cols=None)[source]¶
Trains a PyTorch model given training data for several epochs.
Calls TrainingOperator.train_epoch() on N parallel workers simultaneously underneath the hood. :param data: An instance of SparkXShards, a Spark DataFrame or a function that
takes config and batch_size as argument and returns a PyTorch DataLoader for training.
- Parameters
epochs – The number of epochs to train the model. Default is 1.
batch_size – The number of samples per batch for each worker. Default is 32. The total batch size would be workers_per_node*num_nodes. If your training data is a function, you can set batch_size to be the input batch_size of the function for the PyTorch DataLoader.
profile – Boolean. Whether to return time stats for the training procedure. Default is False.
reduce_results – Boolean. Whether to average all metrics across all workers into one dict. If a metric is a non-numerical value (or nested dictionaries), one value will be randomly selected among the workers. If False, returns a list of dicts for all workers. Default is True.
info – An optional dictionary that can be passed to the TrainingOperator for train_epoch and train_batch.
feature_cols – feature column names if data is Spark DataFrame.
label_cols – label column names if data is Spark DataFrame.
- :return A list of dictionary of metrics for every training epoch. If reduce_results is
False, this will return a nested list of metric dictionaries whose length will be equal to the total number of workers. You can also provide custom metrics by passing in a custom training_operator_cls when creating the Estimator.
- load(checkpoint)[source]¶
Loads the Estimator state (including model and optimizer) from the provided checkpoint.
- Parameters
checkpoint – (str) Path to target checkpoint file.
- predict(data, batch_size=32, feature_cols=None, profile=False)[source]¶
Using this PyTorch model to make predictions on the data.
- Parameters
data – An instance of SparkXShards or a Spark DataFrame
batch_size – The number of samples per batch for each worker. Default is 32.
profile – Boolean. Whether to return time stats for the training procedure. Default is False.
feature_cols – feature column names if data is a Spark DataFrame.
:return A SparkXShards that contains the predictions with key “prediction” in each shard
- class zoo.orca.learn.pytorch.estimator.PyTorchSparkEstimator(model, loss, optimizer, metrics=None, model_dir=None, bigdl_type='float')[source]¶
Bases:
zoo.orca.learn.spark_estimator.Estimator- clear_gradient_clipping()[source]¶
Clear gradient clipping parameters. In this case, gradient clipping will not be applied. In order to take effect, it needs to be called before fit.
- Returns
- evaluate(data, batch_size=32, feature_cols=None, label_cols=None, validation_metrics=None)[source]¶
Evaluate model.
- Parameters
data – data: evaluation data. It can be an XShards, Spark Dataframe, PyTorch DataLoader and PyTorch DataLoader creator function. If data is an XShards, each partition is a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a list of numpy arrays.
batch_size – Batch size used for evaluation. Only used when data is a SparkXShard.
feature_cols – Feature column name(s) of data. Only used when data is a Spark DataFrame. Default: None.
label_cols – Label column name(s) of data. Only used when data is a Spark DataFrame. Default: None.
validation_metrics – Orca validation metrics to be computed on validation_data.
- Returns
validation results.
- fit(data, epochs=1, batch_size=32, feature_cols=None, label_cols=None, validation_data=None, checkpoint_trigger=None)[source]¶
Train this torch model with train data.
- Parameters
data – train data. It can be a XShards, Spark Dataframe, PyTorch DataLoader and PyTorch DataLoader creator function. If data is an XShards, each partition is a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a list of numpy arrays.
epochs – Number of epochs to train the model. Default: 1.
batch_size – Batch size used for training. Only used when data is an XShards. Default: 32.
feature_cols – Feature column name(s) of data. Only used when data is a Spark DataFrame. Default: None.
label_cols – Label column name(s) of data. Only used when data is a Spark DataFrame. Default: None.
validation_data – Validation data. XShards, PyTorch DataLoader and PyTorch DataLoader creator function are supported. If data is XShards, each partition is a dictionary of {‘x’: feature, ‘y’: label}, where feature(label) is a numpy array or a list of numpy arrays.
checkpoint_trigger – Orca Trigger to set a checkpoint.
- Returns
The trained estimator object.
- get_train_summary(tag=None)[source]¶
Get the scalar from model train summary Return list of summary data of [iteration_number, scalar_value, timestamp]
- Parameters
tag – The string variable represents the scalar wanted
- get_validation_summary(tag=None)[source]¶
Get the scalar from model validation summary Return list of summary data of [iteration_number, scalar_value, timestamp] Note: The metric and tag may not be consistent Please look up following form to pass tag parameter Left side is your metric during compile Right side is the tag you should pass ‘Accuracy’ | ‘Top1Accuracy’ ‘BinaryAccuracy’ | ‘Top1Accuracy’ ‘CategoricalAccuracy’ | ‘Top1Accuracy’ ‘SparseCategoricalAccuracy’ | ‘Top1Accuracy’ ‘AUC’ | ‘AucScore’ ‘HitRatio’ | ‘HitRate@k’ (k is Top-k) ‘Loss’ | ‘Loss’ ‘MAE’ | ‘MAE’ ‘NDCG’ | ‘NDCG’ ‘TFValidationMethod’ | ‘${name + ” ” + valMethod.toString()}’ ‘Top5Accuracy’ | ‘Top5Accuracy’ ‘TreeNNAccuracy’ | ‘TreeNNAccuracy()’ ‘MeanAveragePrecision’ | ‘MAP@k’ (k is Top-k) (BigDL) ‘MeanAveragePrecision’ | ‘PascalMeanAveragePrecision’ (Zoo) ‘StatelessMetric’ | ‘${name}’
- Parameters
tag – The string variable represents the scalar wanted
- load(checkpoint, loss=None)[source]¶
Load existing model or checkpoint
- Parameters
checkpoint – Path to the existing model or checkpoint.
loss – PyTorch loss function.
- Returns
- load_latest_orca_checkpoint(path)[source]¶
Load latest Orca checkpoint under specified directory.
- Parameters
path – directory containing Orca checkpoint files.
- load_orca_checkpoint(path, version, prefix=None)[source]¶
Load existing checkpoint
- Parameters
path – Path to the existing checkpoint.
version – checkpoint version, which is the suffix of model.* file, i.e., for model.4 file, the version is 4.
prefix – optimMethod prefix, for example ‘optimMethod-TorchModelf53bddcc’
- Returns
- predict(data, batch_size=4, feature_cols=None)[source]¶
Predict input data.
- Parameters
data – data to be predicted. It can be an XShards or a Spark Dataframe. If it is an XShards, each partition is a dictionary of {‘x’: feature}, where feature is a numpy array or a list of numpy arrays.
batch_size – batch size used for inference.
feature_cols – Feature column name(s) of data. Only used when data is a Spark DataFrame. Default: None.
- Returns
predicted result. The predict result is a XShards, each partition of the XShards is a dictionary of {‘prediction’: result}, where result is a numpy array or a list of numpy arrays.
- save(model_path)[source]¶
Save is not supported in SparkPyTorchEstimator.
- Parameters
model_path – path to save the trained model.
- Returns
orca.learn.openvino.estimator¶
- class zoo.orca.learn.openvino.estimator.OpenvinoEstimator(*, model_path, batch_size=0)[source]¶
Bases:
zoo.orca.learn.spark_estimator.Estimator- evaluate(data, batch_size=32, feature_cols=None, label_cols=None)[source]¶
Evaluate is not supported in OpenVINOEstimator
- fit(data, epochs, batch_size=32, feature_cols=None, label_cols=None, validation_data=None, checkpoint_trigger=None)[source]¶
Fit is not supported in OpenVINOEstimator
- get_validation_summary(tag=None)[source]¶
Get_validation_summary is not supported in OpenVINOEstimator
- load(model_path, batch_size=0)[source]¶
Load an openVINO model.
- Parameters
model_path – String. The file path to the OpenVINO IR xml file.
batch_size – Int. Set batch Size, default is 0 (use default batch size).
- Returns
- load_latest_orca_checkpoint(path)[source]¶
Load_latest_orca_checkpoint is not supported in OpenVINOEstimator
- load_orca_checkpoint(path, version)[source]¶
Load_orca_checkpoint is not supported in OpenVINOEstimator
- predict(data)[source]¶
Predict input data
- Parameters
data – data to be predicted. XShards, numpy array and list of numpy arrays are supported. If data is XShards, each partition is a dictionary of {‘x’: feature}, where feature(label) is a numpy array or a list of numpy arrays.
- Returns
predicted result. If the input data is XShards, the predict result is a XShards, each partition of the XShards is a dictionary of {‘prediction’: result}, where the result is a numpy array or a list of numpy arrays. If the input data is numpy arrays or list of numpy arrays, the predict result is a numpy array or a list of numpy arrays.
- set_constant_gradient_clipping(min, max)[source]¶
Set_constant_gradient_clipping is not supported in OpenVINOEstimator