TensorFlow 练习题 004

Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow 2nd Edition 第十三章课后练习题
中文翻译请参考本书第一版

chapter 4 Exercises

Why would you want to use the Data API?
What are the benefits of splitting a large dataset into multiple files?
During training, how can you tell that your input pipeline is the bottleneck? What can you do to fix it?
Can you save any binary data to a TFRecord file, or only serialized protocol buffers?
Why would you go through the hassle of converting all your data to the Example protobuf format? Why not use your own protobuf definition?
When using TFRecords, when would you want to activate compression? Why not do it systematically?
Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using TF Transform. Can you list a few pros and cons of each option?
Name a few common techniques you can use to encode categorical features. What about text?
Load the Fashion MNIST dataset (introduced in Chapter 10); split it into a training set, a validation set, and a test set; shuffle the training set; and save each dataset to multiple TFRecord files. Each record should be a serialized Example protobuf with two features: the serialized image (use tf.io.serialize_tensor() to serialize each image), and the label. Then use tf.data to create an efficient dataset for each set. Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature. Try to make the input pipeline as efficient as possible, using TensorBoard to visualize profiling data.
In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:
- Download the Large Movie Review Dataset, which contains 50,000 movies reviews from the Internet Movie Database. The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words), but we will ignore them in this exercise.
- Split the test set into a validation set (15,000) and a test set (10,000).
- Use tf.data to create an efficient dataset for each set.
- Create a binary classification model, using a TextVectorization layer to preprocess each review. If the TextVectorization layer is not yet available (or if you like a challenge), try to create your own custom preprocessing layer: you can use the functions in the tf.strings package, for example lower() to make everything lowercase, regex_replace() to replace punctuation with spaces, and split() to split words on spaces. You should use a lookup table to output word indices, which must be prepared in the adapt() method.
- Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.
- Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.
- Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews").

打卡模版

ID：

链接：

WhiteCloudTemple / PythonProgramming

TensorFlow 练习题 004

chapter 4 Exercises

打卡模版

简介

发行版

贡献者

近期动态

WhiteCloudTemple / PythonProgramming .gitee-modal { width: 500px !important; }

TensorFlow 练习题 004

chapter 4 Exercises

打卡模版

简介

发行版

贡献者

近期动态

搜索帮助

WhiteCloudTemple / PythonProgramming