> For the complete documentation index, see [llms.txt](https://baozoulin.gitbook.io/neural-networks-and-deep-learning/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://baozoulin.gitbook.io/neural-networks-and-deep-learning/di-san-men-ke-jie-gou-hua-ji-qi-xue-xi-xiang-mu-structuring-machine-learning-projects/di-san-men-ke-structuring-machine-learning-projects/ml-strategy/24-zai-bu-tong-de-huafen-shang-jin-xing-xun-lian-bing-ce-shi-ff08-training-and-testing-on-different.md).

# 2.4 在不同的划分上进行训练并测试（Training and testing on different distributions）

## 猫咪识别

假设只收集到10,000张用户上传的照片和超过20万张网上下载的高清猫图：

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/9a6cbca750b289408a25789e224aeefc.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/9a6cbca750b289408a25789e224aeefc.png)

**做法一**：将两组数据合并在一起，把这21万张照片随机分配到训练、开发和测试集中。假设已经确定开发集和测试集各包含2500个样本，训练集有205000个样本。

* 好处：训练集、开发集和测试集都来自同一分布
* 坏处：开发集的2500个样本中很多图片都来自网页下载的图片，并不是真正关心的数据分布，因为真正要处理的是来自手机的图片

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/9a6cbca750b289408a25789e224aeefc.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/9a6cbca750b289408a25789e224aeefc.png)

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/57c4cad6f0df4dc06ecf90c4f2d81a68.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/57c4cad6f0df4dc06ecf90c4f2d81a68.png)

2500个样本有$$2500\times \frac{200k}{210k} =2381$$张图来自网页下载，平均只有119张图来自手机上传。设立开发集的目的是告诉团队去瞄准的目标，而瞄准目标的大部分精力却都用在优化来自网页下载的图片

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/eb0178687dedc450e1c184b958adeef3.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/eb0178687dedc450e1c184b958adeef3.png)

建议：开发集和测试集都是2500张来自应用的图片，训练集包含来自网页的20万张图片还有5000张来自应用的图片，现在瞄准的目标就是想要处理的目标，才是真正关心的图片分布

## 语音激活后视镜

假设有很多不是来自语音激活后视镜的数据

分配：

* 训练集500k段语音，开发集和测试集各包含10k段语音（从实际的语音激活后视镜收集）
* 也可以拿一半放训练集里，训练集51万段语音，开发集和测试集各5000

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/ca34742f5f0b19239de5779dc80ad4d9.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/ca34742f5f0b19239de5779dc80ad4d9.png)

![](/files/-Le0cb6xlvatYCsmSxO0)

![](/files/-Le0cb6z6XJ3abAuJvEW)