In reinforcement learning, ReplayBuffer is a common basic data storage method, whose functions is to store the data obtained from the interaction of an intelligent body with its environment.
Solve the following problems by using ReplayBuffer:
Typically, algorithms people use native Python data structures or Numpy data structures to construct ReplayBuffer, or general reinforcement learning frameworks also provide standard API encapsulation. The difference is that MindSpore implements the ReplayBuffer structure on the device side. On the one hand, the structure can reduce the frequent copying of data between Host and Device when using GPU hardware, and on the other hand, expressing the ReplayBuffer in the form of MindSpore operator can build a complete IR graph and enable various graph optimizations of MindSpore GRAPH_MODE to improve the overall performance.
In MindSpore, two kinds of ReplayBuffer are provided, UniformReplayBuffer and PriorityReplayBuffer, which are used for common FIFO storage and storage with priority, respectively. The following is an example of UniformReplayBuffer implementation and usage.
ReplayBuffer is represented as a List of Tensors, and each Tensor represents a set of data stored by column (e.g., a set of [state, action, reward]). The data that is newly put into the UniformReplayBuffer is updated in a FIFO mechanism with insert, search, and sample functions.
Create a UniformReplayBuffer with the initialization parameters batch_size, capacity, shapes, and types.
The insert method takes a set of data as input, and needs to satisfy that the shape and type of the data are the same as the created UniformReplayBuffer parameters. No output. To simulate the FIFO characteristics of a circular queue, we use two cursors to determine the head and effective length count of the queue. The following figure shows the process of several insertion operations.
The search method accepts an index as an input, indicating the specific location of the data to be found. The output is a set of Tensor, as shown in the following figure:
The sampling method has no input and the output is a set of Tensor with the size of the batch_size when the UniformReplayBuffer is created. This is shown in the following figure: Assuming that batch_size is 3, a random set of indexes will be generated in the operator, and this random set of indexes has two cases:
Both approaches have a slight impact on randomness, and the default is to use no order preserving to get the best performance.
MindSpore Reinforcement Learning provides a standard ReplayBuffer API. The user can use the ReplayBuffer created by the framework by means of a configuration file, shaped like the configuration file of dqn.
'replay_buffer':
{'number': 1,
'type': UniformReplayBuffer,
'capacity': 100000,
'data_shape': [(4,), (1,), (1,), (4,)],
'data_type': [ms.float32, ms.int32, ms.foat32, ms.float32],
'sample_size': 64}
Alternatively, users can use the interfaces directly to create the required data structures:
from mindspore_rl.core.uniform_replay_buffer import UniformReplayBuffer
import mindspore as ms
sample_size = 2
capacity = 100000
shapes = [(4,), (1,), (1,), (4,)]
types = [ms.float32, ms.int32, ms.float32, ms.float32]
replaybuffer = UniformReplayBuffer(sample_size, capacity, shapes, types)
Take UniformReplayBuffer created in the form of an API to perform data manipulation as an example:
state = ms.Tensor([0.1, 0.2, 0.3, 0.4], ms.float32)
action = ms.Tensor([1], ms.int32)
reward = ms.Tensor([1], ms.float32)
new_state = ms.Tensor([0.4, 0.3, 0.2, 0.1], ms.float32)
replaybuffer.insert([state, action, reward, new_state])
replaybuffer.insert([state, action, reward, new_state])
exp = replaybuffer.get_item(0)
samples = replaybuffer.sample()
replaybuffer.reset()
size = replaybuffer.size()
if replaybuffer.full():
print("Full use of this buffer.")
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。