Linux
Ascend
GPU
CPU
Data Preparation
Beginner
Intermediate
Expert
The mindspore.dataset
module provided by MindSpore enables users to customize their data fetching strategy from disk. At the same time, data processing and tokenization operators are applied to the data. Pipelined data processing produces a continuous flow of data to the training network, improving overall performance.
In addition, MindSpore supports data loading in distributed scenarios. Users can define the number of shards while loading. For more details, see Loading the Dataset in Data Parallel Mode.
This tutorial briefly demonstrates how to load and process text data using MindSpore.
Prepare the following text data.
Welcome to Beijing!
北京欢迎您!
我喜欢English!
Create the tokenizer.txt
file, copy the text data to the file, and save the file under ./test
directory. The directory structure is as follow.
└─test
└─tokenizer.txt
Import the mindspore.dataset
and mindspore.dataset.text
modules.
import mindspore.dataset as ds
import mindspore.dataset.text as text
MindSpore supports loading common datasets in the field of text processing that come in a variety of on-disk formats. Users can also implement custom dataset class to load customized data.
The following tutorial demonstrates loading datasets using the TextFileDataset
in the mindspore.dataset
module.
Configure the dataset directory as follows and create a dataset object.
DATA_FILE = "./test/tokenizer.txt"
dataset = ds.TextFileDataset(DATA_FILE, shuffle=False)
Create an iterator then obtain data through the iterator.
for data in dataset.create_dict_iterator(output_numpy=True):
print(text.to_str(data['text']))
The output without tokenization:
Welcome to Beijing!
北京欢迎您!
我喜欢English!
The following tutorial demonstrates how to perform data processing such as SlidingWindow
and shuffle
after a dataset
is created.
SlidingWindow
The following tutorial demonstrates how to use the SlidingWindow
to slice text data.
Load the text dataset.
inputs = [["大", "家", "早", "上", "好"]]
dataset = ds.NumpySlicesDataset(inputs, column_names=["text"], shuffle=False)
Print the results without any data processing.
for data in dataset.create_dict_iterator(output_numpy=True):
print(text.to_str(data['text']).tolist())
The output is as follows:
['大', '家', '早', '上', '好']
Perform the data processing operation.
dataset = dataset.map(operations=text.SlidingWindow(2, 0), input_columns=["text"])
Print the results after data processing.
for data in dataset.create_dict_iterator(output_numpy=True):
print(text.to_str(data['text']).tolist())
The output is as follows:
[['大', '家'],
['家', '早'],
['早', '上'],
['上', '好']]
shuffle
The following tutorial demonstrates how to shuffle text data while loading a dataset.
Load and shuffle the text dataset.
inputs = ["a", "b", "c", "d"]
dataset = ds.NumpySlicesDataset(inputs, column_names=["text"], shuffle=True)
Print the results after performing shuffle
.
for data in dataset.create_dict_iterator(output_numpy=True):
print(text.to_str(data['text']).tolist())
The output is as follows:
c
a
d
b
The following tutorial demonstrates how to use the WhitespaceTokenizer
to tokenize words with space.
Create a tokenizer
.
tokenizer = text.WhitespaceTokenizer()
Apply the tokenizer
.
dataset = dataset.map(operations=tokenizer)
Create an iterator and obtain data through the iterator.
for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
token = text.to_str(i['text']).tolist()
print(token)
The output after tokenization is as follows:
['Welcome', 'to', 'Beijing!']
['北京欢迎您!']
['我喜欢English!']
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。