python中的KFold到底做什么？

Z时代
2024-01-10
分类：问答

我正在看本教程：https : //www.dataquest.io/mission/74/getting-started-with-

kaggle

我进入第9部分，进行预测。在此数据帧中有一些数据称为titanic，然后使用以下方法将数据分成几部分：

# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

我不确定它到底在做什么以及kf是什么样的对象。我尝试阅读文档，但并没有太大帮助。另外，有三折（n_folds =

3），为什么以后为什么只在这一行中访问训练和测试（我怎么知道它们被称为训练和测试）？

for train, test in kf:

回答：

KFold将提供训练/测试索引，以在训练和测试集中拆分数据。它将数据集分成k连续的折叠（默认情况下不进行混洗），然后每个折叠使用一次验证集，而k -

1其余的折叠则组成训练集（来源）。

假设您有一些从1到10的数据索引。如果使用n_fold=k，则在第一次迭代中，您将获得i‘th

(i<=k)fold作为测试索引，其余的(k-1)折叠（没有i“ th fold”）一起成为火车索引。

一个例子

import numpy as np
from sklearn.cross_validation import KFold
x = [1,2,3,4,5,6,7,8,9,10,11,12]
kf = KFold(12, n_folds=3)
for train_index, test_index in kf:
    print (train_index, test_index)

输出量