LightFM推荐系统框架学习笔记
lightFM 首先需要注意的是其并非是FM算法的简单实现,而是可以利用隐式反馈和用户产品信息的推荐系统框架,如果不考虑用户/产品侧信息,其基本实现就是CF算法,同时融合了MF中对用户/产品进行矩阵分解的思想。
特点
- 集成了BPR & WARP ranking losses
- 多线程
- incorporate both item and user metadata,可以解决用户/产品冷启动
问题记录
1.输入数据
根据lightFM的API来看,主要有三种类型的数据,iteractions,user_features, item_features,这三种数据也是推荐系统的常规数据。
需要注意的是,数据要以csr_matrix
或者coo_matrix
的格式存在所以需要对数据进行相关转化, 之所以这样转化是因为类别型变量one-hot编码后会庞大又稀疏,这样转化可以节省内存。
- csr_matrix(compressed sparse row matrix)
- coo_matrix(sparse matrix in coordinate format)
import pandas as pdfrom scipy.sparse import csr_matrix, coo_matrixfm_data = pd.pivot_table(train_data, index="user_id", columns="yewu_id", values='label', aggfunc="sum").fillna(0)iteractions = csr_matrix(fm_data)# iteractions# <609885x28 sparse matrix of type ''#with 673172 stored elements in Compressed Sparse Row format>
除了利用scipy的接口,lightfm本身也提供了相关接口lightfm.data.dataset
from lightfm.data import Datasetdataset = Dataset()dataset.fit((x[1]['user_id'] for x in train.iterrows()), (x[1]['yewu_id'] for x in train.iterrows()))num_users, num_items = dataset.interactions_shape()print('Num users: {}, num_items {}.'.format(num_users, num_items))# Num users: 609885, num_items 28.(interactions, weights) = dataset.build_interactions( ((x[1]['user_id'], x[1]['yewu_id']) for x in train.iterrows()))
2. fit_patial()
把源码里面的描述先粘贴下:
Unlike fit, repeated call to this method will cause training to resume from the current model state.
这里的resume可以理解为continue from last pause,是续接训练,其实直接阅读源码也可以看出端倪。
## 源码中的fit函数其实是调用了fit_partial函数 def fit( self, interactions, user_features=None, item_features=None, sample_weight=None, epochs=1, num_threads=1, verbose=False,): self._reset_state() # Discard old results, if any return self.fit_partial( interactions, user_features=user_features, item_features=item_features, sample_weight=sample_weight, epochs=epochs, num_threads=num_threads, verbose=verbose, )
可以看出,fit函数其实是调用了fit_partial函数,但是在调用前进行了reset_state操作,清除了之前的参数状态。
3.item_cold_start
item官方以及给了相关的例子[4]:
其中测试集包含了10%的交互信息是训练中包含的,另外是训练集中没有任何交互信息的items
import numpy as npfrom lightfm.datasets import fetch_stackexchangedata = fetch_stackexchange('crossvalidated', test_set_fraction=0.1, indicator_features=False, tag_features=True)train = data['train']test = data['test']train.toarray().shapetest.toarray().shape# (3213, 72360)
分析发现测试集和训练集都是同样的shape,意味着把没有任何交互数据的item也是放到了训练集中的,这个比较有意思,理解起来就是即使我这个item没有任何交互信息,训练的时候也需要把item放到interaction_matrix
然后item冷启动是个什么概念呢,通过[5]知道,只有ID肯定是不行的,所需要的是item侧的一些特征
item_features = data['item_features']tag_labels = data['item_feature_labels']print('There are %s distinct tags, with values like %s.' % (item_features.shape[1], tag_labels[:3].tolist()))# There are 1246 distinct tags, with values like [u'bayesian', u'prior', u'elicitation'].item_features.toarray().shape# (72360, 1246)
可以看出,训练时item_features也是和item的维度保持一致的。
下面就是正常的训练流程了
# Define a new model instancemodel = LightFM(loss='warp', item_alpha=ITEM_ALPHA, no_components=NUM_COMPONENTS)# Fit the hybrid model. Note that this time, we pass# in the item features matrix.model = model.fit(train, item_features=item_features, epochs=NUM_EPOCHS, num_threads=NUM_THREADS) test_auc = auc_score(model, test, train_interactions=train, item_features=item_features, num_threads=NUM_THREADS, check_intersections=False).mean()print('Hybrid test set AUC: %s' % test_auc) # Hybrid test set AUC: 0.703039
4.user cold-start和user_features
在做项目用户侧信息比较丰富,但是交互信息极其稀疏,可以说训练集中有些用户的交互信息为0,所以可以理解为用户冷启动问题,根据前面的item cold start 例子,需要明确两个问题:
1.推理用户即使没有任何交互信息,也需要在训练中进行体现
2.官方文档[2]在Building datasets这里写了一个整合数据的例子,但是item_features只有一个特征,
而我的user_features可不止一个特征,如何整合特征形成user_features成了一个大坑,后面在[6]中找到了答案。
feature_columns = ['user_id', 'yewu_id','term_brand','price_range', 'arpu_chrg_last3_avg', 'total_flow']train, test = train_data[feature_columns], test_data[feature_columns]# total flow & arpu_chrg_last3_avg bin cuttotal_flow_b = [-1000, 1024, 3072, 5120, 10240, 20480, 30720, 50000, 100000, 30000000]total_flow_l = [x for x in range(len(total_flow_b) - 1)]train.loc[:, 'total_flow'] = pd.cut(train['total_flow'], bins=total_flow_b)test.loc[:, 'total_flow'] = pd.cut(test['total_flow'], bins=total_flow_b)arpu_b = [-10000, 30, 60, 80, 100, 120, 150, 200, 400, 100000]arpu_l = [x for x in range(len(arpu_b) - 1)]train.loc[:, 'arpu_chrg_last3_avg'] = pd.cut(train['arpu_chrg_last3_avg'], bins=arpu_b)test.loc[:, 'arpu_chrg_last3_avg'] = pd.cut(test['arpu_chrg_last3_avg'], bins=arpu_b)train.loc[:, 'price_range'] = train['price_range'].astype(int)test.loc[:, 'price_ragne'] = test['price_range'].astype(int)train_test = pd.concat([train, test], axis=0)pairs = train_test[['user_id', 'yewu_id']].drop_duplicates()user_features = train_test[['user_id', 'term_brand', 'price_range', 'arpu_chrg_last3_avg', 'total_flow']].drop_duplicates()from lightfm.data import Datasetdataset = Dataset()dataset.fit(users=(x[1]['user_id'] for x in pairs.iterrows()), items=(x[1]['yewu_id'] for x in pairs.iterrows()) )(interactions, weights) = dataset.build_interactions(((x[1]['user_id'], x[1]['yewu_id'])for x in train.iterrows()))# user_features_matrix generationuser_features_list = list()for tag in ['term_brand', 'price_range', 'total_flow', 'arpu_chrg_last3_avg']: user_features_list += list(user_features[tag].unique())dataset.fit_partial(users=(x[1]['user_id'] for x in user_features.iterrows()), user_features=user_features_list) user_features_matrix = dataset.build_user_features([(x[0], list(x[1:])) for x in user_features[['user_id', 'term_brand', 'total_flow', 'arpu_chrg_last3_avg']].values])# model fit&validatefrom lightfm import LightFMmodel = LightFM(loss='bpr')model.fit(interactions, user_features=user_features_matrix)from lightfm.evaluation import auc_scoretrain_auc = auc_score(model, interactions, user_features=user_features_matrix).mean()print(train_auc)# 0.775(test_interactions, test_weights) = dataset.build_interactions(((x[1]['user_id'], x[1]['yewu_id']) for x in test.iterrows()))test_auc = auc_score(model,test_interactions,train_interactions=interactions,user_features=user_features_matrix).mean()print(test_auc)# 0.774
之前在user_features_matrix这里,总是报错提示要先fit,后来发现是fit的写法有问题,改了之后,后面运行就OK了,4个特征auc能做到0.77,很可以了。
Reference:
- github
- documents
- Recommendation System in Python: LightFM
- item cold-start
- handling user item cold-start
- error bulid user features