# 2.4 在不同的划分上进行训练并测试（Training and testing on different distributions）

## 猫咪识别

假设只收集到10,000张用户上传的照片和超过20万张网上下载的高清猫图：

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/9a6cbca750b289408a25789e224aeefc.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/9a6cbca750b289408a25789e224aeefc.png)

**做法一**：将两组数据合并在一起，把这21万张照片随机分配到训练、开发和测试集中。假设已经确定开发集和测试集各包含2500个样本，训练集有205000个样本。

* 好处：训练集、开发集和测试集都来自同一分布
* 坏处：开发集的2500个样本中很多图片都来自网页下载的图片，并不是真正关心的数据分布，因为真正要处理的是来自手机的图片

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/9a6cbca750b289408a25789e224aeefc.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/9a6cbca750b289408a25789e224aeefc.png)

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/57c4cad6f0df4dc06ecf90c4f2d81a68.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/57c4cad6f0df4dc06ecf90c4f2d81a68.png)

2500个样本有$$2500\times \frac{200k}{210k} =2381$$张图来自网页下载，平均只有119张图来自手机上传。设立开发集的目的是告诉团队去瞄准的目标，而瞄准目标的大部分精力却都用在优化来自网页下载的图片

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/eb0178687dedc450e1c184b958adeef3.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/eb0178687dedc450e1c184b958adeef3.png)

建议：开发集和测试集都是2500张来自应用的图片，训练集包含来自网页的20万张图片还有5000张来自应用的图片，现在瞄准的目标就是想要处理的目标，才是真正关心的图片分布

## 语音激活后视镜

假设有很多不是来自语音激活后视镜的数据

分配：

* 训练集500k段语音，开发集和测试集各包含10k段语音（从实际的语音激活后视镜收集）
* 也可以拿一半放训练集里，训练集51万段语音，开发集和测试集各5000

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/ca34742f5f0b19239de5779dc80ad4d9.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/ca34742f5f0b19239de5779dc80ad4d9.png)

![](/files/-Le0cb6xlvatYCsmSxO0)

![](/files/-Le0cb6z6XJ3abAuJvEW)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://baozoulin.gitbook.io/neural-networks-and-deep-learning/di-san-men-ke-jie-gou-hua-ji-qi-xue-xi-xiang-mu-structuring-machine-learning-projects/di-san-men-ke-structuring-machine-learning-projects/ml-strategy/24-zai-bu-tong-de-huafen-shang-jin-xing-xun-lian-bing-ce-shi-ff08-training-and-testing-on-different.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
