> 文档中心 > LightFM推荐系统框架学习笔记

LightFM推荐系统框架学习笔记

lightFM 首先需要注意的是其并非是FM算法的简单实现,而是可以利用隐式反馈和用户产品信息的推荐系统框架,如果不考虑用户/产品侧信息,其基本实现就是CF算法,同时融合了MF中对用户/产品进行矩阵分解的思想。

特点

  • 集成了BPR & WARP ranking losses
  • 多线程
  • incorporate both item and user metadata,可以解决用户/产品冷启动

问题记录

1.输入数据

根据lightFM的API来看,主要有三种类型的数据,iteractions,user_features, item_features,这三种数据也是推荐系统的常规数据。
需要注意的是,数据要以csr_matrix或者coo_matrix的格式存在所以需要对数据进行相关转化, 之所以这样转化是因为类别型变量one-hot编码后会庞大又稀疏,这样转化可以节省内存。

  • csr_matrix(compressed sparse row matrix)
  • coo_matrix(sparse matrix in coordinate format)
import pandas as pdfrom scipy.sparse import csr_matrix, coo_matrixfm_data = pd.pivot_table(train_data,     index="user_id",     columns="yewu_id",     values='label', aggfunc="sum").fillna(0)iteractions = csr_matrix(fm_data)# iteractions# <609885x28 sparse matrix of type ''#with 673172 stored elements in Compressed Sparse Row format>

除了利用scipy的接口,lightfm本身也提供了相关接口lightfm.data.dataset

from lightfm.data import Datasetdataset = Dataset()dataset.fit((x[1]['user_id'] for x in train.iterrows()),     (x[1]['yewu_id'] for x in train.iterrows()))num_users, num_items = dataset.interactions_shape()print('Num users: {}, num_items {}.'.format(num_users, num_items))# Num users: 609885, num_items 28.(interactions, weights) = dataset.build_interactions(    ((x[1]['user_id'], x[1]['yewu_id'])    for x in train.iterrows()))

2. fit_patial()

把源码里面的描述先粘贴下:
Unlike fit, repeated call to this method will cause training to resume from the current model state.
这里的resume可以理解为continue from last pause,是续接训练,其实直接阅读源码也可以看出端倪。

## 源码中的fit函数其实是调用了fit_partial函数 def fit( self, interactions, user_features=None, item_features=None, sample_weight=None, epochs=1, num_threads=1, verbose=False,):  self._reset_state() # Discard old results, if any  return self.fit_partial(     interactions,     user_features=user_features,     item_features=item_features,     sample_weight=sample_weight,     epochs=epochs,     num_threads=num_threads,     verbose=verbose, )

可以看出,fit函数其实是调用了fit_partial函数,但是在调用前进行了reset_state操作,清除了之前的参数状态。

3.item_cold_start

item官方以及给了相关的例子[4]:
其中测试集包含了10%的交互信息是训练中包含的,另外是训练集中没有任何交互信息的items

import numpy as npfrom lightfm.datasets import fetch_stackexchangedata = fetch_stackexchange('crossvalidated',      test_set_fraction=0.1,      indicator_features=False,      tag_features=True)train = data['train']test = data['test']train.toarray().shapetest.toarray().shape# (3213, 72360)

分析发现测试集和训练集都是同样的shape,意味着把没有任何交互数据的item也是放到了训练集中的,这个比较有意思,理解起来就是即使我这个item没有任何交互信息,训练的时候也需要把item放到interaction_matrix
然后item冷启动是个什么概念呢,通过[5]知道,只有ID肯定是不行的,所需要的是item侧的一些特征

item_features = data['item_features']tag_labels = data['item_feature_labels']print('There are %s distinct tags, with values like %s.' % (item_features.shape[1], tag_labels[:3].tolist()))# There are 1246 distinct tags, with values like [u'bayesian', u'prior', u'elicitation'].item_features.toarray().shape# (72360, 1246)

可以看出,训练时item_features也是和item的维度保持一致的。
下面就是正常的训练流程了

# Define a new model instancemodel = LightFM(loss='warp',  item_alpha=ITEM_ALPHA,  no_components=NUM_COMPONENTS)# Fit the hybrid model. Note that this time, we pass# in the item features matrix.model = model.fit(train,  item_features=item_features,  epochs=NUM_EPOCHS,  num_threads=NUM_THREADS)  test_auc = auc_score(model,      test,      train_interactions=train,      item_features=item_features,      num_threads=NUM_THREADS,      check_intersections=False).mean()print('Hybrid test set AUC: %s' % test_auc) # Hybrid test set AUC: 0.703039   

4.user cold-start和user_features

在做项目用户侧信息比较丰富,但是交互信息极其稀疏,可以说训练集中有些用户的交互信息为0,所以可以理解为用户冷启动问题,根据前面的item cold start 例子,需要明确两个问题:
1.推理用户即使没有任何交互信息,也需要在训练中进行体现
2.官方文档[2]在Building datasets这里写了一个整合数据的例子,但是item_features只有一个特征,
而我的user_features可不止一个特征,如何整合特征形成user_features成了一个大坑,后面在[6]中找到了答案。

feature_columns = ['user_id', 'yewu_id','term_brand','price_range', 'arpu_chrg_last3_avg', 'total_flow']train, test = train_data[feature_columns], test_data[feature_columns]# total flow & arpu_chrg_last3_avg bin cuttotal_flow_b = [-1000, 1024, 3072, 5120, 10240, 20480, 30720, 50000, 100000, 30000000]total_flow_l = [x for x in range(len(total_flow_b) - 1)]train.loc[:, 'total_flow'] = pd.cut(train['total_flow'], bins=total_flow_b)test.loc[:, 'total_flow'] = pd.cut(test['total_flow'], bins=total_flow_b)arpu_b = [-10000, 30, 60, 80, 100, 120, 150, 200, 400, 100000]arpu_l = [x for x in range(len(arpu_b) - 1)]train.loc[:, 'arpu_chrg_last3_avg'] = pd.cut(train['arpu_chrg_last3_avg'], bins=arpu_b)test.loc[:, 'arpu_chrg_last3_avg'] = pd.cut(test['arpu_chrg_last3_avg'], bins=arpu_b)train.loc[:, 'price_range'] = train['price_range'].astype(int)test.loc[:, 'price_ragne'] = test['price_range'].astype(int)train_test = pd.concat([train, test], axis=0)pairs = train_test[['user_id', 'yewu_id']].drop_duplicates()user_features = train_test[['user_id', 'term_brand', 'price_range', 'arpu_chrg_last3_avg', 'total_flow']].drop_duplicates()from lightfm.data import Datasetdataset = Dataset()dataset.fit(users=(x[1]['user_id'] for x in pairs.iterrows()),     items=(x[1]['yewu_id'] for x in pairs.iterrows())    )(interactions, weights) = dataset.build_interactions(((x[1]['user_id'], x[1]['yewu_id'])for x in train.iterrows()))# user_features_matrix generationuser_features_list = list()for tag in ['term_brand', 'price_range', 'total_flow', 'arpu_chrg_last3_avg']:    user_features_list += list(user_features[tag].unique())dataset.fit_partial(users=(x[1]['user_id'] for x in user_features.iterrows()),     user_features=user_features_list)     user_features_matrix = dataset.build_user_features([(x[0], list(x[1:]))  for x in user_features[['user_id', 'term_brand', 'total_flow', 'arpu_chrg_last3_avg']].values])# model fit&validatefrom lightfm import LightFMmodel = LightFM(loss='bpr')model.fit(interactions, user_features=user_features_matrix)from lightfm.evaluation import auc_scoretrain_auc = auc_score(model, interactions, user_features=user_features_matrix).mean()print(train_auc)# 0.775(test_interactions, test_weights) = dataset.build_interactions(((x[1]['user_id'], x[1]['yewu_id'])     for x in test.iterrows()))test_auc = auc_score(model,test_interactions,train_interactions=interactions,user_features=user_features_matrix).mean()print(test_auc)# 0.774

之前在user_features_matrix这里,总是报错提示要先fit,后来发现是fit的写法有问题,改了之后,后面运行就OK了,4个特征auc能做到0.77,很可以了。

Reference:

  1. github
  2. documents
  3. Recommendation System in Python: LightFM
  4. item cold-start
  5. handling user item cold-start
  6. error bulid user features