This chapter focuses on considerations related to data processing in network migration. For basic data processing, please refer to:
Optimizing the Data Processing
The whole basic process of data construction consists of two main aspects: dataset loading and data augmentation.
MindSpore provides interfaces for loading many common datasets. The most used ones are as follows:
Data Interfaces | Introduction |
---|---|
Cifar10Dataset | Cifar10 dataset read interface (you need to download the original bin file of Cifar10) |
MNIST | Minist handwritten digit recognition dataset (you need to download the original file) |
ImageFolderDataset | Dataset reading method of using file directories as the data organization format for classification (common for ImageNet) |
MindDataset | MindRecord data read interface |
GeneratorDataset | customized data interface |
FakeImageDataset | Constructing a fake image dataset |
For common dataset interfaces in different fields, refer to Loading interfaces to common datasets.
The basic process for constructing a custom Dataset object is as follows:
First create an iterator class and define three methods __init__
, __getitem__
and __len__
in that class.
import numpy as np
class MyDataset():
"""Self Defined dataset."""
def __init__(self, n):
self.data = []
self.label = []
for _ in range(n):
self.data.append(np.zeros((3, 4, 5)))
self.label.append(np.ones((1)))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
data = self.data[idx]
label = self.label[idx]
return data, label
It is better not to use MindSpore operators in the iterator class, because it usually adds multiple threads in the data processing phase, which may cause problems.
The output of the iterator needs to be a numpy array.
The iterator must set the
__len__
method, and the returned result must be the real dataset size. Setting it larger will cause problems when gettingitem takes values.
Then use GeneratorDataset to encapsulate the iterator class:
import mindspore.dataset as ds
my_dataset = MyDataset(10)
# corresponding to torch.utils.data.DataLoader(my_dataset)
dataset = ds.GeneratorDataset(my_dataset, column_names=["data", "label"])
When customizing the dataset, you need to set a name for each output, such as column_names=["data", "label"]
above, indicating that the first output column of the iterator is data
and the second is label
. In the subsequent data augmentation and data iteration obtaining phases, the different columns can be processed separately by name.
All MindSpore data interfaces have some general attribute controls, and some of the common ones are described here:
Attributes | Introduction |
---|---|
num_samples(int) | Specify the total number of data samples |
shuffle(bool) | Whether to do random disruptions to the data |
sampler(Sampler) | Data sampler, customizing data disruption, allocation. sampler setting and num_shards , shard_id mutually exclusive |
num_shards(int) | Used in distributed scenarios to divide data into several parts, used in conjunction with shard_id
|
shard_id(int) | For distributed scenarios, taking nth data (n ranges from 0 to n-1, and n is the set num_shards ), used in conjunction with num_shards
|
num_parallel_workers(int) | Number of threads in parallel configuration |
For example:
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0)
print(dataset.get_dataset_size())
# 1000
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0, num_samples=3)
print(dataset.get_dataset_size())
# 3
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0,
num_shards=8, shard_id=0)
print(dataset.get_dataset_size())
# 1000 / 8 = 125
1000
3
125
MindSpore dataset object uses the map interface for data augmentation. See map Interface
map(operations, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, **kwargs)
Given a set of data augmentation lists, data augmentations are applied to the dataset objects in order.
Each data augmentation operation takes one or more data columns in the dataset object as input and outputs the result of the data augmentation as one or more data columns. The first data augmentation operation takes the specified columns in input_columns as input. If there are multiple data augmentation operations in the data augmentation list, the output columns of the previous data augmentation will be used as input columns for the next data augmentation.
The column name of output column in the last data augmentation is specified by output_columns
. If output_columns
is not specified, the output column name is the same as input_columns
.
The above introduction may be tedious, but in short, map
is to do the operations specified in operations
on some columns of the dataset. Here operations
can be the data augmentation provided by MindSpore.
audio, text, vision, and transforms. For more details, refer to Data Transforms, which is also a method of python. You can use opencv, PIL, pandas and some other third party methods, like loading dataset. Don't use MindSpore operators.
MindSpore also provides some common random augmentation methods: Auto augmentation. When using data augmentation specifically, it is best to read Optimizing the Data Processing in the recommended order.
At the end of data augmentation, you can use the batch operation to merge batch_size
pieces of consecutive data in the dataset into a single batch data. For details, please refer to batch. Note that the parameter drop_remainder
needs to be set to True during training and False during inference.
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0)\
.batch(32, drop_remainder=True)
print(dataset.get_dataset_size())
# 1000 // 32 = 31
dataset = ds.FakeImageDataset(num_images=1000, image_size=(32, 32, 3), num_classes=10, base_seed=0)\
.batch(32, drop_remainder=False)
print(dataset.get_dataset_size())
# ceil(1000 / 32) = 32
31
32
The batch operation can also use some augmentation operations within batch. For details, see YOLOv3.
MindSpore data objects are obtained iteratively in the following ways.
Creates an iterator based on the dataset object, and the output data is of dictionary type.
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=20, image_size=(32, 32, 3), num_classes=10, base_seed=0)
dataset = dataset.batch(10, drop_remainder=True)
iterator = dataset.create_dict_iterator()
for data_dict in iterator:
for name in data_dict.keys():
print(name, data_dict[name].shape)
print("="*20)
image (10, 32, 32, 3)
label (10,)
====================
image (10, 32, 32, 3)
label (10,)
====================
Create an iterator based on the dataset object, and output data is a list of numpy.ndarray
data.
You can specify all column names and the order of the columns in the output by the parameter columns
. If columns is not specified, the order of the columns will remain the same.
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=20, image_size=(32, 32, 3), num_classes=10, base_seed=0)
dataset = dataset.batch(10, drop_remainder=True)
iterator = dataset.create_tuple_iterator()
for data_tuple in iterator:
for data in data_tuple:
print(data.shape)
print("="*20)
(10, 32, 32, 3)
(10,)
====================
(10, 32, 32, 3)
(10,)
====================
Note that this writing method does not
shuffle
after traversing an epoch, so it may affect the precision when used in training. The above two methods are recommended when direct data iterations are needed during training.
import mindspore.dataset as ds
dataset = ds.FakeImageDataset(num_images=20, image_size=(32, 32, 3), num_classes=10, base_seed=0)
dataset = dataset.batch(10, drop_remainder=True)
for data in dataset:
for data in data:
print(data.shape)
print("="*20)
(10, 32, 32, 3)
(10,)
====================
(10, 32, 32, 3)
(10,)
====================
The latter two of these can be used directly when the order of data read is the same as the order required by the network.
for data in dataset:
loss = net(*data)
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。