房价预测|Pytorch
一、数据预处理
z-分数归一化(Z-score normalization)也称为标准差标准化,是数据预处理中的一种常用技术,用于将特征缩放到标准正态分布(均值为 0,标准差为 1)。
独热编码(One-Hot Encoding)是处理分类特征的一种常用技术,它将分类变量转换为二进制向量,使得机器学习模型能够正确处理这些非数值数据。
二、神经网络
本项目采用的是全连接神经网络(Fully Connected Network)
每一层的神经元与下一层所有神经元连接。
1.输入层(Input Layer):
接收原始数据,不进行计算,仅传递数据。
神经元数量 = 输入数据的维度(数据集特征)。
2.隐藏层(Hidden Layer):
位于输入层和输出层之间,负责提取数据的特征。
本文仅设置了一个隐藏层,共有100个神经节点。
3.输出层(Output Layer):
输出模型的最终结果(回归任务的预测值)。
神经元数量 = 任务目标的维度(回归任务的输出维度为1)。
4.激活函数
激活函数是神经网络能拟合复杂模式的关键,其核心作用是引入非线性变换。
本项目所使用的激活函数是relu函数:
5.计算损失(Loss Calculation)
用损失函数(Loss Function)衡量预测结果与真实标签的差距。
在本文中采用均方根误差
6.正则化
L2 正则化(Ridge Regression):倾向于减小参数值
在损失函数中添加参数的平方和作为惩罚项:损失函数= 原始损失 +
7.优化器
Adam(Adaptive Moment Estimation)是深度学习中最流行的优化算法之一,结合了 Adagrad 和 RMSProp 的优点,能够自适应地调整每个参数的学习率。
将最原始的导数通过以下公式变换:
(t为迭代次数)
最后更新的参数公式为:
8、交叉验证
-
评估模型稳定性:单一训练集 / 测试集划分可能导致评估结果波动较大。
-
充分利用数据:在数据量有限时,避免测试集浪费。
-
防止过拟合:更全面地检测模型在不同数据分布下的表现。
K 折交叉验证(K-Fold CV)
- 步骤:
- 将数据集均分为 K 个子集(折)。
- 轮流选择 1 个子集作为验证集,其余 K-1 个子集作为训练集。
- 重复 K 次,计算平均验证分数。
三、代码
工具模块
import pandas as pdimport torchfrom d2l import torch as d2lfrom torch import nnfrom torch.utils.tensorboard import SummaryWriter
加载数据
train_data = pd.read_csv(\'data/house-prices-advanced-regression-techniques/train.csv\')test_data = pd.read_csv(\'data/house-prices-advanced-regression-techniques/test.csv\')
查看数据集信息,查看数据集的大小以及前4列的具体信息
print(train_data.shape)print(test_data.shape)print(train_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])print(test_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])
数据预处理
数据集中\'Id\'列,该数据对于模型训练无其他作用,所以需将此列去除。在训练过程中需要将特征及标签分离开,最后将训练集与测试集数据合并
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
获取数据集中的数值型的特征字段,并将数值型的数据进行归一化,缺失值填充为0(即均值)
numeric_features = all_features.dtypes[all_features.dtypes != \'object\'].indexall_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))all_features[numeric_features] = all_features[numeric_features].fillna(0)
将非数值型的特征进行独热编码,设置参数,使其中的缺失值视为有效的特征值
all_features = pd.get_dummies(all_features, dummy_na=True)
将训练集特征及测试集特征分离开,获取训练标签,为训练做准备
n_train = train_data.shape[0]train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)train_labels = torch.tensor(train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32)
设置损失函数,MSELoss为平方误差,即
clamp函数设置函数范围,防止出现log 0的情况
loss = nn.MSELoss()def log_rmse(net, features, labels): clipped_preds = torch.clamp(net(features), 1, float(\'inf\')) rmse = torch.sqrt(loss(torch.log(clipped_preds), torch.log(labels))) return rmse.item()
设置网络框架
in_features = train_features.shape[1]temp_features=100def get_net(): net = nn.Sequential( nn.Linear(in_features,temp_features), nn.ReLU(), nn.Linear(temp_features,1) ) return net
赋值迭代器,设置adam优化器 ,将训练结果可视化
writer=SummaryWriter(\"./log/houseprices\")def train(net,i, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay, batch_size): train_ls, test_ls = [], [] train_iter = d2l.load_array((train_features, train_labels), batch_size) optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate, weight_decay=weight_decay) for epoch in range(num_epochs): for x, y in train_iter: optimizer.zero_grad() l = loss(net(x), y) l.backward() optimizer.step() train_ls.append(log_rmse(net, train_features, train_labels)) writer.add_scalar(\"{}_fold/train\".format(i+1),train_ls[epoch],epoch) if test_labels is not None: test_ls.append(log_rmse(net, test_features, test_labels)) writer.add_scalar(\"{}_fold/test\".format(i + 1), test_ls[epoch], epoch) return train_ls, test_ls
将训练集分成k分,每次返回第i分数据作为验证集,其余作为训练集
def get_k_fold_data(k, i, X, y): assert k > 1 fold_size = X.shape[0] // k X_train, y_train = None, None for j in range(k): idx = slice(j * fold_size, (j + 1) * fold_size) X_part, y_part = X[idx, :], y[idx] if j == i: X_valid, y_valid = X_part, y_part elif X_train is None: X_train, y_train = X_part, y_part else: X_train = torch.cat([X_train, X_part], 0) y_train = torch.cat([y_train, y_part], 0) return X_train, y_train, X_valid, y_valid
k—折交叉验证
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size): train_l_sum, valid_l_sum = 0, 0 for i in range(k): data = get_k_fold_data(k, i, X_train, y_train) net = get_net() train_ls, valid_ls = train(net,i, *data, num_epochs, learning_rate, weight_decay, batch_size) train_l_sum += train_ls[-1] valid_l_sum += valid_ls[-1] print(f\'折{i + 1},训练log rmse{float(train_ls[-1]):f},\' f\'验证log rmse{float(valid_ls[-1]):f}\') return train_l_sum / k, valid_l_sum / k
将全部训练集数据进行训练,并将测试集数据放入网络中进行训练,并生成预测结果文档
def train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size): net=get_net() train_ls,_=train(net,5,train_features,train_labels,None,None,num_epochs,lr,weight_decay,batch_size) print(f\'训练log rmse{float(train_ls[-1]):f}\') preds=net(test_features).detach().numpy() test_data[\'SalePrice\']=pd.Series(preds.reshape(1,-1)[0]) submission=pd.concat([test_data[\'Id\'],test_data[\'SalePrice\']],axis=1) submission.to_csv(\'submission.csv\',index=False)
全部代码(其中的超参数为随机参数,需自调):
import pandas as pdimport torchfrom d2l import torch as d2lfrom torch import nn# 获取数据from torch.utils.tensorboard import SummaryWritertrain_data = pd.read_csv(\'data/house-prices-advanced-regression-techniques/train.csv\')test_data = pd.read_csv(\'data/house-prices-advanced-regression-techniques/test.csv\')# 查看数据具体信息# print(train_data.shape)# print(test_data.shape)# print(train_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])# print(test_data.iloc[0:4,[0,1,2,3,-3,-2,-1]])# 去除id列all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))# 数据预处理# 获取数值型特征numeric_features = all_features.dtypes[all_features.dtypes != \'object\'].index#归一化all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))#处理缺失值all_features[numeric_features] = all_features[numeric_features].fillna(0)#独热编码all_features = pd.get_dummies(all_features, dummy_na=True)#获取训练数据n_train = train_data.shape[0]train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)train_labels = torch.tensor(train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32)#损失函数loss = nn.MSELoss()def log_rmse(net, features, labels): clipped_preds = torch.clamp(net(features), 1, float(\'inf\')) rmse = torch.sqrt(loss(torch.log(clipped_preds), torch.log(labels))) return rmse.item()#网络框架in_features = train_features.shape[1]temp_features=100def get_net(): net = nn.Sequential( nn.Linear(in_features,temp_features), nn.ReLU(), nn.Linear(temp_features,1) ) return netwriter=SummaryWriter(\"./log/houseprices\")def train(net,i, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay, batch_size): train_ls, test_ls = [], [] train_iter = d2l.load_array((train_features, train_labels), batch_size) optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate, weight_decay=weight_decay) for epoch in range(num_epochs): for x, y in train_iter: optimizer.zero_grad() l = loss(net(x), y) l.backward() optimizer.step() train_ls.append(log_rmse(net, train_features, train_labels)) writer.add_scalar(\"{}_fold/train\".format(i+1),train_ls[epoch],epoch) if test_labels is not None: test_ls.append(log_rmse(net, test_features, test_labels)) writer.add_scalar(\"{}_fold/test\".format(i + 1), test_ls[epoch], epoch) return train_ls, test_lsdef get_k_fold_data(k, i, X, y): assert k > 1 fold_size = X.shape[0] // k X_train, y_train = None, None for j in range(k): idx = slice(j * fold_size, (j + 1) * fold_size) X_part, y_part = X[idx, :], y[idx] if j == i: X_valid, y_valid = X_part, y_part elif X_train is None: X_train, y_train = X_part, y_part else: X_train = torch.cat([X_train, X_part], 0) y_train = torch.cat([y_train, y_part], 0) return X_train, y_train, X_valid, y_validdef k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size): train_l_sum, valid_l_sum = 0, 0 for i in range(k): data = get_k_fold_data(k, i, X_train, y_train) net = get_net() train_ls, valid_ls = train(net,i, *data, num_epochs, learning_rate, weight_decay, batch_size) train_l_sum += train_ls[-1] valid_l_sum += valid_ls[-1] print(f\'折{i + 1},训练log rmse{float(train_ls[-1]):f},\' f\'验证log rmse{float(valid_ls[-1]):f}\') return train_l_sum / k, valid_l_sum / kdef train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size): net=get_net() train_ls,_=train(net,5,train_features,train_labels,None,None,num_epochs,lr,weight_decay,batch_size) print(f\'训练log rmse{float(train_ls[-1]):f}\') preds=net(test_features).detach().numpy() test_data[\'SalePrice\']=pd.Series(preds.reshape(1,-1)[0]) submission=pd.concat([test_data[\'Id\'],test_data[\'SalePrice\']],axis=1) submission.to_csv(\'submission.csv\',index=False)k, num_epochs, lr, weight_decay, batch_size = 5,100,5,0,64train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr, weight_decay, batch_size)with open(\'./weight/houseprice.txt\', \'a\', encoding=\'utf-8\') as file: file.write(\'-\'*50) file.write(\'\\n\') line = \'k:{},num_epochs:{},lr:{},weight_decay:{},batch_size:{},temp_feaatures:{}\\n\'.format(k,num_epochs,lr,weight_decay,batch_size,temp_features) file.write(line) line=f\'{k}-折验证:平均训练log rmse{float(train_l):f},平均验证log rmse{float(valid_l):f}\\n\' file.write(line)print(f\'{k}-折验证:平均训练log rmse{float(train_l):f},\' f\'平均验证log rmse{float(valid_l):f}\')train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size)