# 1.5 训练/开发/测试集划分（Train/dev/test distributions）

训练、开发、测试集选择设置的一些规则和意见：

* 训练、开发、测试集的设置会对产品带来非常大的影响；
* 在选择**开发集**和**测试集**时要使二者来自同一分布，且从所有数据中随机选取；
* 所选择的开发集和测试集中的数据，要与未来想要或者能够得到的数据类似，即模型数据和未来数据要具有相似性；
* 设置的测试集只要足够大，使其能够在过拟合的系统中给出高方差的结果就可以，也许10000左右的数目足够；
* 设置开发集只要足够使其能够检测不同算法、不同模型之间的优劣差异就可以，百万大数据中1%的大小就足够；

尽量保证dev sets和test sets来源于同一分布且都反映了实际样本的情况。如果dev sets和test sets不来自同一分布，从dev sets上选择的“最佳”模型往往不能够在test sets上表现得很好。好比在dev sets上找到最接近一个靶的靶心的箭，但是test sets提供的靶心却远远偏离dev sets上的靶心，结果肯定无法射中test sets上的靶心位置

![](/files/-Le0cdpL8YXykT2gyFy7)

![](/files/-Le0cdpN7YugNKttHWLb)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://baozoulin.gitbook.io/neural-networks-and-deep-learning/di-san-men-ke-jie-gou-hua-ji-qi-xue-xi-xiang-mu-structuring-machine-learning-projects/di-san-men-ke-structuring-machine-learning-projects/di-yi-zhou-ml-strategy/15-xun-7ec3-kai-53d1-ce-shi-jihua-fen-ff08-train-dev-test-distributions.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
