# 2.5 不匹配数据划分的偏差和方差（Bias and Variance with mismatched data distributions）

当训练集和开发集、测试集不同分布时，分析偏差和方差的方式：

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/5cbede5222b199f84dc491e0550435b6.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/5cbede5222b199f84dc491e0550435b6.png)

分析的问题在于，当看训练误差，再看开发误差，有两件事变了，很难确认这增加的9%误差率有多少是因为：

* 算法只见过训练集数据，没见过开发集数据（方差）
* 开发集数据来自不同的分布

为了弄清楚哪个因素影响更大，定义一组新的数据，称之为**训练-开发集**，是一个新的数据子集。从训练集的分布里分出来，但不会用来训练网络

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/66fcfec7152b504adb2e6124291f4a68.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/66fcfec7152b504adb2e6124291f4a68.png)

随机打散训练集，分出一部分训练集作为训练-开发集（training-dev），训练集、训练-开发集来自同一分布

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/6a3c48f8a71b678c2769165f38523635.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/6a3c48f8a71b678c2769165f38523635.png)

只在训练集训练神经网络，不让神经网络在训练-开发集上跑后向传播。为了进行误差分析，应该看分类器在训练集上的误差、训练-开发集上的误差、开发集上的误差

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/c5d2293143857294c49859eb875272f5.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/c5d2293143857294c49859eb875272f5.png)

* 假设训练误差是1%，训练-开发集上的误差是9%，开发集误差是10%，存在方差，因为训练-开发集的错误率是在和训练集来自同一分布的数据中测得的，尽管神经网络在训练集中表现良好，但无法泛化到来自相同分布的训练-开发集
* 假设训练误差为1%，训练-开发误差为1.5%，开发集错误率10%。方差很小，当转到开发集时错误率大大上升，是**数据不匹配**的问题

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/b997fa8695062ca7332b18d51767b7df.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/b997fa8695062ca7332b18d51767b7df.png)

* 如果训练集误差是10%，训练-开发误差是11%，开发误差为12%，人类水平对贝叶斯错误率的估计大概是0%，存在可避免偏差问题
* 如果训练集误差是10%，训练-开发误差是11%，开发误差是20%，有两个问题
  * 可避免偏差问题
  * 数据不匹配问题

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/5bbfa44bc294dd33f01346b1aa87d930.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/5bbfa44bc294dd33f01346b1aa87d930.png)

如果加入测试集错误率，而开发集表现和测试集表现有很大差距，可能对开发集过拟合，需要一个更大的开发集

如果人类的表现是4%，训练错误率是7%，训练-开发错误率是10%。开发集是6%。可能开发测试集分布比实际处理的数据容易得多，错误率可能会下降

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/347df851fe3809b308850a9e14cfdbb0.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/347df851fe3809b308850a9e14cfdbb0.png)

Human level 4%和Training error 7%衡量了可避免偏差大小，Training error 7%和Training-dev error 10%衡量了方差大小，Training-dev error 10%和Dev/Test dev 6%衡量了数据不匹配问题的大小

rearview mirror speech data 6%和Error on examples trained on 6%：获得这个数字的方式是让一些人标记他们的后视镜语音识别数据，看看人类在这个任务里能做多好，然后收集一些后视镜语音识别数据，放在训练集中，让神经网络去学习，测量那个数据子集上的错误率，如果得到rearview mirror speech data 6%和Error on examples trained on 6%，说明在后视镜语音数据上达到人类水平

General speech recognition Human level 4%和rearview mirror speech data 6%：说明后视镜的语音数据比一般语音识别更难，因为人类都有6%的错误，而不是4%的错误

[![](https://github.com/fengdu78/deeplearning_ai_books/raw/master/images/347df851fe3809b308850a9e14cfdbb0.png)](https://github.com/fengdu78/deeplearning_ai_books/blob/master/images/347df851fe3809b308850a9e14cfdbb0.png)

**总结**：

开发集、测试集不同分布：

* 可以提供更多训练数据，有助于提高学习算法的性能
* 潜在问题不只是偏差和方差问题，还有数据不匹配

![](/files/-Le0ccTcnrUyOgk1QDcZ)![](/files/-Le0ccTeI902ugx21jul)

![](/files/-Le0ccTgYezF24iAjWNU)

![](/files/-Le0ccTibAEdUUuppZ_p)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://baozoulin.gitbook.io/neural-networks-and-deep-learning/di-san-men-ke-jie-gou-hua-ji-qi-xue-xi-xiang-mu-structuring-machine-learning-projects/di-san-men-ke-structuring-machine-learning-projects/ml-strategy/25-bu-pi-pei-shu-ju-huafen-de-pian-cha-he-fang-cha-ff08-bias-and-variance-with-mismatched-data-distr.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
