Python版scorecardpy库woebin函数使用

技术文档

scorecardpy 是一款专门用于评分卡模型开发的 Python 库，由谢士晨博士开发，该软件包是R软件包评分卡的Python版本。量级较轻，依赖更少，旨在简化传统信用风险计分卡模型的开发过程，使这些模型的构建更加高效且易于操作。

本文主要讲解 scorecardpy 库的变量分箱 woebin 函数的使用，让你了解函数中每个入参的使用姿势，快速高效地进行评分卡建模分析。分箱的原理在之前的 sklearn-逻辑回归-制作评分卡中有讲过，可自行跳转了解。

scorecardpy安装

scorecardpy提供的功能

woebin 函数定义

woebin 函数参数解析

参数1：dt

参数2：y

check_y 函数对标签值进行检测

代码解析

参数3：x

参数4：var_skip

参数5：breaks_list

参数6：special_values

参数7：count_distr_limit

参数8：stop_limit

IV增长率

卡方值

参数9：bin_num_limit

参数10：positive

positive 参数的设置

参数11：no_cores

参数12：print_step

参数取值详解

使用建议

参数13：method

支持的分箱方法

1、tree（决策树分箱）‌：

‌2、chimerge（卡方分箱）‌：

使用建议

参数14：ignore_const_cols

参数15：ignore_datetime_cols

参数16：check_cate_num

参数17：replace_blank

参数18：save_breaks_list

scorecardpy安装

使用 Python 包管理器 pip 进行安装

pip install scorecardpy

scorecardpy提供的功能

为了使评分卡建模流程更加便捷，scorecardpy 库对建模过程中的关键步骤都封装好了函数，在不同环节可以调用不同的函数

数据集划分‌：通过split_df函数将数据集分割成训练集和测试集‌
变量筛选‌：使用var_filter函数根据变量的缺失率、IV值、同值性等因素进行筛选
变量分箱‌：提供woebin函数进行变量分箱，并可以生成分箱的可视化图表‌
评分转换‌：使用scorecard函数进行评分转换‌
效果评估‌：包括性能评估（perf_eva）和PSI（Population Stability Index）评估（perf_psi）‌

`woebin 函数定义`

def woebin(dt, y, x=None, var_skip=None, breaks_list=None, special_values=None, stop_limit=0.1, count_distr_limit=0.05, bin_num_limit=8, # min_perc_fine_bin=0.02, min_perc_coarse_bin=0.05, max_num_bin=8, positive=\"bad|1\", no_cores=None, print_step=0, method=\"tree\",  ignore_const_cols=True, ignore_datetime_cols=True, check_cate_num=True, replace_blank=True, save_breaks_list=None, **kwargs): pass

woebin 函数适用于对特征进行分箱（bining）和权重证据转换（weight of evidence）的工具，它使用决策树分箱或者卡方分箱的方法对变量进行最佳分箱。

默认 woe 计算是 ln(Distr_Bad_i/Distr_Good_i)，如果需要实现 ln(Distr_Good_i/Distr_Bad_i)，需要将入参 positive 设置为想反的值，比如 0 或者 \'good\'。

`woebin` 函数参数解析

参数1：dt

类型是 pandas.DataFrame，包含要分箱的特征和标签变量 y 的 DataFrame

参数2：y

标签值的变量名称

传入的 dt 中，标签 y 列的取值不能有空的情况，如果检测到空，对应的行将会被删除，同时会提示 “There are NaNs in \\\'{}\\\' column”

 # remove na in y if dat[y].isnull().any(): warnings.warn(\"There are NaNs in \\\'{}\\\' column. The rows with NaN in \\\'{}\\\' were removed from dat.\".format(y,y)) dat = dat.dropna(subset=[y]) # dat = dat[pd.notna(dat[y])]

check_y 函数对标签值进行检测

check_y() 函数会对传入数据的 y 值进行检测，确保都是合理的，入参包含 dat, y, positive.

代码解析

def check_y(dat, y, positive): \"\"\" :param dat: 数据集，pd.DataFrame 类型 :param y: 标签y列名，string 类型，或者只有1个元素的 list 类型 :param positive: 目标变量，string 类型 :return: 数据集 dat \"\"\" positive = str(positive) # 数据集 dat 必须属于 pd.DataFrame 类型数据，且至少有2列，（一列特征，一列标签） if not isinstance(dat, pd.DataFrame): raise Exception(\"Incorrect inputs; dat should be a DataFrame.\") elif dat.shape[1] <= 1: raise Exception(\"Incorrect inputs; dat should be a DataFrame with at least two columns.\") # 如果 y 入参是 string 类型，在这里会被处理成只有1个元素的 list 类型数据 y = str_to_list(y) # y 入参必须是只有1个元素，即一个数据集只能有一个标签列 if len(y) != 1: raise Exception(\"Incorrect inputs; the length of y should be one\") y = y[0] # y 标签列必须出现在数据集中 if y not in dat.columns: raise Exception(\"Incorrect inputs; there is no \\\'{}\\\' column in dat.\".format(y)) # 如果有数据的标签取值为空，则该数据会被删除，不参与分箱评分 if dat[y].isnull().any(): warnings.warn( \"There are NaNs in \\\'{}\\\' column. The rows with NaN in \\\'{}\\\' were removed from dat.\".format(y, y)) dat = dat.dropna(subset=[y]) # y 列数据转换成 int 类型数据 if is_numeric_dtype(dat[y]): dat.loc[:, y] = dat[y].apply(lambda x: x if pd.isnull(x) else int(x)) # dat[y].astype(int) # y 的取值枚举必须是2中，否则抛出异常 unique_y = np.unique(dat[y].values) if len(unique_y) == 2: # unique_y 存储了y取值的枚举, positive 传值必须有一个值属于 unique_y 的某一个枚举 # re.search() 函数第一个入参是正则表达式，第二个入参是要搜索的字符串 # positive 默认值为 \'bad|1\'，表示搜索字符串中的 \"bad\" 或 \"1\"。 # 假如 unique_y 取值为 [0, 1]，positive取默认值，if 条件即判断 True 是否包含在[False, True] 中 if True in [bool(re.search(positive, str(v))) for v in unique_y]: y1 = dat[y] # re.split() 函数第一个入参是分隔符，第二个入参是字符串，代表使用分隔符将字符串分成一个列表 # 因为 \'|\'是特殊字符，代表\'或\'，因此前面要加转义字符\'\\\' # lambda 接收一个 dat[y] 中的取值，判断这个取值是否出现在 positive 列表中，出现则为1，否则为0 y2 = dat[y].apply(lambda x: 1 if str(x) in re.split(\'\\|\', positive) else 0) # y1 和 y2 两个Series对象中的每个对应元素是否不相等。 # 如果至少有一个元素不相等，.any()方法会返回True；如果所有元素都相等，则返回False if (y1 != y2).any(): # 如果有不相等的，则将 y2 赋值给 dat[y] 列 # loc() 函数接受两个参数：第一个参数是行的选择器，第二个参数是列的选择器 # [:, y]：这是.loc[]属性中的选择器参数。冒号:表示选择所有的行，而y表示选择名为y的列 dat.loc[:, y] = y2 # dat[y] = y2 warnings.warn(\"The positive value in \\\"{}\\\" was replaced by 1 and negative value by 0.\".format(y)) else: raise Exception(\"Incorrect inputs; the positive value in \\\"{}\\\" is not specified\".format(y)) else: raise Exception(\"Incorrect inputs; the length of unique values in y column \\\'{}\\\' != 2.\".format(y)) return dat

参数3：x

要分箱的特征列表，默认值是 None，如果该参数不传，将会默认 dt 中所有的除了y的列，都会参与分箱

参数4：var_skip

不参与分箱的特征列表，如果传入的是 string 类型，会自动转换成一个元素的 list 类型，并参与分箱特征的排除

参数5：breaks_list

分箱的边界值，是个 list 类型，默认一般是 None，如果该参数传入了值，则会根据传入的边界值进行分箱。假设传入的是 [0,10,20,30],使用左开右闭进行分箱，则会被分成 4 个箱子，即 (0,10]，(10,20]，(20,30]，(30,+∞)

参数6：special_values

特殊值列表，默认是 None。如果传入该参数，那么取值在该参数列表中的元素，将会被分到独立的箱子中

假设传入 [-90,-9999]，那个取值为-90，-9999的都会在一个特殊箱子里

参数7：count_distr_limit

每个箱内的样本数量占总样本数量的最小占比，默认值是 0.05，即最小占比 5%

该参数可以确保分箱结果更加合理和实用，特别是在要处理不平衡数据集或需要严格控制复杂度时

通过限制每个箱内的最小样本数，可以减少过拟合的风险，并提高模型在新数据上的泛化能力

需要注意的是，该参数的具体行为和效果可能受到其它参数（如分箱方法，分箱数量等）的影响，在使用时需要合理设置

参数8：stop_limit

控制分箱停止条件的参数，当统计量（比如IV增长率，卡方值）的增长率小于设置的 stop_limit 参数时，停止继续分箱，取值范围是 0-0.5，默认值是 0.1

该参数主要用于决定何时停止进一步的分箱操作，以避免过拟合或生成过多不必要的箱

该参数默认值设置为0.1，是一个比较小的数值，这意味着只有当统计量的增长率显著时，分箱才会继续。通过调整 stop_limit 的值，用户可以在分箱的数量和模型的复杂度之间找到平衡。

需要注意的是，该参数只是控制分箱停止条件的参数之一，在使用时需要合理结合设置，以确保分箱既有效又高效。

IV增长率

如果 woebin 函数使用信息值（IV）作为分箱的依据，stop_limit 可以设定为一个阈值，当相邻两次分箱后IV值的增长率小于这个阈值时，分箱停止

卡方值

在一些实现中，stop_limit 也可能与卡方值相关。当卡方值小于某个基于 stop_limit 计算出的临界值时，分箱也会停止

参数9：bin_num_limit

整数类型，可以分的最大箱子数量，默认值是 8

该参数是限制分箱算法可以生成的最大箱子数量，从而避免过度分箱导致的模型复杂度过高或数据过拟合问题。

当 woebin 函数对变量进行分箱时，它会考虑这个限制，并尝试在不超过 bin_num_limit 设定的箱数的前提下，找到最优的分箱方案

如果 bin_num_limit 设定为一个较小的值，分箱算法会倾向于生成较少的、包含较多样本的箱，这可能会简化模型并减少过拟合的风险。
如果 bin_num_limit 设定为一个较大的值，分箱算法则有更多的自由度来生成更多的、包含较少样本的箱，这可能会提高模型的精细度，但同时也可能增加模型的复杂度和过拟合的风险

在使用 bin_num_limit 参数时，需要根据具体的数据集和建模需求来选择合适的值。如果数据集较大且变量分布复杂，可能需要更多的箱来捕捉数据的细节特征；而如果数据集较小或变量分布相对简单，则较少的箱可能就足够了

该参数与 stop_limit 参数、count_distr_limit参数结合控制分箱数量，以共同控制分箱的过程和结果

参数10：positive

用于指定目标变量是好类别的标签，通过该参数的设置，用来检测 dt 中的 y 列取值是否规范，如果不规范，将会被check函数检测出来，抛出异常终止建模。

该入参默认值是 \"bad|1\"

在信用评分卡建模中，这通常指的是那些我们希望模型能够识别并预测出的正面事件，比如客户会偿还贷款（即“好”客户）的情况

`positive` 参数的设置

positive 参数应该设置为目标变量中代表正面事件的唯一值或值的列表。这个参数对于函数来说很重要，因为它决定了如何计算诸如好坏比率（Good/Bad Ratio）、信息值（IV）等关键指标，这些指标在信用评分卡的开发中至关重要。

如果目标变量是二元的（比如，只有“好”和“坏”两种可能），positive 参数就应该设置为表示“好”类别的那个值。
如果目标变量有多个类别，但其中只有一个被视为正面事件，那么positive 参数同样应该设置为那个代表正面事件的值。
在某些情况下，如果目标变量使用了不同的编码方式（比如，用1表示“好”，用0表示“坏”），那么positive 参数就应该设置为对应的编码值。

参数11：no_cores

并发的CPU核数。默认值是None，如果该参数传的是None，会看 x 特征变量的数量，如果小于10个特征，则使用 1 核 CPU，如果大于等于 10 个特征，则使用全部的 CPU。

参数12：print_step

该参数控制函数在执行分箱（binning）过程中的信息打印级别

默认值为 0 或者 False

参数取值详解

当 print_step = 0 或 False 时‌：
- 函数将不会打印任何分箱过程中的步骤信息。
- 这适用于不希望看到详细执行过程，只关心最终结果的用户。
‌当 print_step > 0 或 True 时‌：
- 函数将打印分箱过程中的一些关键步骤信息，如每个变量的分箱结果、每个分箱的坏账率（Bad Rate）、权重（Weight of Evidence, WoE）等。
- 打印的信息量可能随着 print_step 值的增加而增加，但具体行为取决于函数的实现。
- 这对于调试、理解分箱过程或查看中间结果非常有用。

使用建议

在初次使用 woebin 函数或对新数据进行分箱时，可以将 print_step 设置为一个大于 0 的值或 True，以便查看分箱过程中的详细信息，确保分箱结果符合预期。
如果已经熟悉分箱过程，并且只关心最终结果，可以将 print_step 设置为 0 或 False，以减少不必要的输出信息。
该参数的使用根据个人实际情况即可。

参数13：method

用于指定分箱的方法，默认值为 tree

支持的分箱方法

`1、tree`（决策树分箱）‌：

决策树分箱是一种基于决策树算法的分箱方法。它通过递归地划分数据集来生成最优的分箱结果。
决策树分箱的优点是能够处理连续型变量和类别型变量，并且通常能够生成较为均衡的分箱结果。
缺点是计算复杂度较高，可能需要较长的计算时间，尤其是在处理大数据集时。

def woebin2_tree(dtm, init_count_distr=0.02, count_distr_limit=0.05,  stop_limit=0.1, bin_num_limit=8, breaks=None, spl_val=None): # initial binning bin_list = woebin2_init_bin(dtm, init_count_distr=init_count_distr, breaks=breaks, spl_val=spl_val) initial_binning = bin_list[\'initial_binning\'] binning_sv = bin_list[\'binning_sv\'] if len(initial_binning.index) == 1: return {\'binning_sv\': binning_sv, \'binning\': initial_binning} # initialize parameters len_brks = len(initial_binning.index) bestbreaks = None IVt1 = IVt2 = 1e-10 IVchg = 1 ## IV gain ratio step_num = 1 # best breaks from three to n+1 bins binning_tree = None while (IVchg >= stop_limit) and (step_num + 1 <= min([bin_num_limit, len_brks])): binning_tree = woebin2_tree_add_1brkp(dtm, initial_binning, count_distr_limit, bestbreaks) # best breaks bestbreaks = binning_tree.loc[lambda x: x[\'bstbrkp\'] != float(\'-inf\'), \'bstbrkp\'].tolist() # information value IVt2 = binning_tree[\'total_iv\'].tolist()[0] IVchg = IVt2 / IVt1 - 1 ## ratio gain IVt1 = IVt2 # step_num step_num = step_num + 1 if binning_tree is None: binning_tree = initial_binning # return return {\'binning_sv\': binning_sv, \'binning\': binning_tree}

‌2、`chimerge`（卡方分箱）‌：

卡方分箱是一种基于卡方统计量的分箱方法。它通过合并相邻的区间来减少区间数量，直到满足某个停止条件为止。
卡方分箱的优点是能够处理连续型变量，并且生成的分箱结果通常具有较好的单调性。
缺点是可能无法处理类别型变量，并且需要指定分箱的数量或停止条件。

卡方算法参考文献：

ChiMerge算法详解：数据离散化与应用-CSDN博客

ChiMerge:Discretization of numeric attributs

def woebin2_chimerge(dtm, init_count_distr=0.02, count_distr_limit=0.05,stop_limit=0.1, bin_num_limit=8, breaks=None, spl_val=None): # chisq = function(a11, a12, a21, a22) { # A = list(a1 = c(a11, a12), a2 = c(a21, a22)) # Adf = do.call(rbind, A) # # Edf = # matrix(rowSums(Adf), ncol = 1) %*% # matrix(colSums(Adf), nrow = 1) / # sum(Adf) # # sum((Adf-Edf)^2/Edf) # } # function to create a chisq column in initial_binning def add_chisq(initial_binning): chisq_df = pd.melt(initial_binning,  id_vars=[\"brkp\", \"variable\", \"bin\"], value_vars=[\"good\", \"bad\"], var_name=\'goodbad\', value_name=\'a\')\\ .sort_values(by=[\'goodbad\', \'brkp\']).reset_index(drop=True) ### chisq_df[\'a_lag\'] = chisq_df.groupby(\'goodbad\')[\'a\'].apply(lambda x: x.shift(1))#.reset_index(drop=True) chisq_df[\'a_rowsum\'] = chisq_df.groupby(\'brkp\')[\'a\'].transform(lambda x: sum(x))#.reset_index(drop=True) chisq_df[\'a_lag_rowsum\'] = chisq_df.groupby(\'brkp\')[\'a_lag\'].transform(lambda x: sum(x))#.reset_index(drop=True) ### chisq_df = pd.merge( chisq_df.assign(a_colsum = lambda df: df.a+df.a_lag),  chisq_df.groupby(\'brkp\').apply(lambda df: sum(df.a+df.a_lag)).reset_index(name=\'a_sum\'))\\ .assign( e = lambda df: df.a_rowsum*df.a_colsum/df.a_sum, e_lag = lambda df: df.a_lag_rowsum*df.a_colsum/df.a_sum ).assign( ae = lambda df: (df.a-df.e)**2/df.e + (df.a_lag-df.e_lag)**2/df.e_lag ).groupby(\'brkp\').apply(lambda x: sum(x.ae)).reset_index(name=\'chisq\') # return return pd.merge(initial_binning.assign(count = lambda x: x[\'good\']+x[\'bad\']), chisq_df, how=\'left\') # initial binning bin_list = woebin2_init_bin(dtm, init_count_distr=init_count_distr, breaks=breaks, spl_val=spl_val) initial_binning = bin_list[\'initial_binning\'] binning_sv = bin_list[\'binning_sv\'] # return initial binning if its row number equals 1 if len(initial_binning.index)==1: return {\'binning_sv\':binning_sv, \'binning\':initial_binning} # dtm_rows dtm_rows = len(dtm.index) # chisq limit from scipy.special import chdtri chisq_limit = chdtri(1, stop_limit) # binning with chisq column binning_chisq = add_chisq(initial_binning) # param bin_chisq_min = binning_chisq.chisq.min() bin_count_distr_min = min(binning_chisq[\'count\']/dtm_rows) bin_nrow = len(binning_chisq.index) # remove brkp if chisq < chisq_limit while bin_chisq_min < chisq_limit or bin_count_distr_min  bin_num_limit: # brkp needs to be removed if bin_chisq_min < chisq_limit: rm_brkp = binning_chisq.assign(merge_tolead = False).sort_values(by=[\'chisq\', \'count\']).iloc[0,] elif bin_count_distr_min  x[\'chisq_lead\']) # replace merge_tolead as True rm_brkp.loc[np.isnan(rm_brkp[\'chisq\']), \'merge_tolead\']=True # order select 1st rm_brkp = rm_brkp.sort_values(by=[\'count_distr\']).iloc[0,] elif bin_nrow > bin_num_limit: rm_brkp = binning_chisq.assign(merge_tolead = False).sort_values(by=[\'chisq\', \'count\']).iloc[0,] else: break # set brkp to lead\'s or lag\'s shift_period = -1 if rm_brkp[\'merge_tolead\'] else 1 binning_chisq = binning_chisq.assign(brkp2 = lambda x: x[\'brkp\'].shift(shift_period))\\ .assign(brkp = lambda x:np.where(x[\'brkp\'] == rm_brkp[\'brkp\'], x[\'brkp2\'], x[\'brkp\'])) # groupby brkp binning_chisq = binning_chisq.groupby(\'brkp\').agg({ \'variable\':lambda x:np.unique(x), \'bin\': lambda x: \'%,%\'.join(x), \'good\': sum, \'bad\': sum }).assign(badprob = lambda x: x[\'bad\']/(x[\'good\']+x[\'bad\']))\\ .reset_index() # update ## add chisq to new binning dataframe binning_chisq = add_chisq(binning_chisq) ## param bin_nrow = len(binning_chisq.index) if bin_nrow == 1: break bin_chisq_min = binning_chisq.chisq.min() bin_count_distr_min = min(binning_chisq[\'count\']/dtm_rows) # format init_bin # remove (.+\\\\)%,%\\\\[.+,) if is_numeric_dtype(dtm[\'value\']): binning_chisq = binning_chisq\\ .assign(bin = lambda x: [re.sub(r\'(?<=,).+%,%.+,\', \'\', i) if (\'%,%\' in i) else i for i in x[\'bin\']])\\ .assign(brkp = lambda x: [float(re.match(\'^\\[(.*),.+\', i).group(1)) for i in x[\'bin\']]) # return return {\'binning_sv\':binning_sv, \'binning\':binning_chisq}

使用建议

在选择分箱方法时，应根据数据的类型（连续型或类别型）、数据的特点（如分布情况、缺失值情况等）以及具体的业务需求来进行选择

如果数据中包含较多的连续型变量，并且希望分箱结果具有较好的单调性，可以考虑使用卡方分箱（chimerge）。
如果数据中包含较多的类别型变量，或者希望分箱过程能够自动处理不同类型的数据，可以考虑使用决策树分箱（tree）

参数14：ignore_const_cols

用于控制是否忽略常量列的分箱处理，参数类型为 bool

默认值：True，即忽略常量列的分箱处理

常量列指在整个数据集中所有行的值都相同的列，这些列对于建模通常没有提供有用的信息，因此可以忽略，从而减少不必要的计算，提高分箱的效率。

参数15：ignore_datetime_cols

用于控制是否忽略日期时间列的分箱处理，参数类型为 bool

默认值：True，即忽略日期时间列的分箱处理

日期时间列通常包含时间戳、日期或时间信息，这些信息对于某些类型的分析可能是有用的，但在分箱过程中可能需要特殊的处理。

当需要对日期时间列进行分箱处理，即ignore_datetime_cols = False 时‌，这可能需要对日期时间列进行额外的预处理，例如将它们转换为数值型特征或提取特定的日期时间组件（如年、月、日等）。

参数16：check_cate_num

用于控制在对类别型变量（categorical variables）进行分箱时，是否检查并限制每个类别中的样本数量，并在必要时进行合并或处理，参数类型为 bool

默认值：True

如果类别型变量的类别数量超过50，即会给出提示，由用户判断是否继续分箱。如果样本中有很多类别型变量，并且每个变量枚举值都非常多，那么也会非常影响建模效率。

def check_cateCols_uniqueValues(dat, var_skip = None): # character columns with too many unique values char_cols = [i for i in list(dat) if not is_numeric_dtype(dat[i])] if var_skip is not None: char_cols = list(set(char_cols) - set(str_to_list(var_skip))) char_cols_too_many_unique = [i for i in char_cols if len(dat[i].unique()) >= 50] if len(char_cols_too_many_unique) > 0: print(\'>>> There are {} variables have too many unique non-numberic values, which might cause the binning process slow. Please double check the following variables: \\n{}\'.format(len(char_cols_too_many_unique), \', \'.join(char_cols_too_many_unique))) print(\'>>> Continue the binning process?\') print(\'1: yes \\n2: no\') cont = int(input(\"Selection: \")) while cont not in [1, 2]: cont = int(input(\"Selection: \")) if cont == 2: raise SystemExit(0) return None

参数17：replace_blank

用于控制如何处理数据中的空白（或缺失）值，参数类型为 bool

默认值：True

def rep_blank_na(dat): # cant replace blank string in categorical value with nan # 如果有重复的索引，则重置索引 if dat.index.duplicated().any(): dat = dat.reset_index(drop = True) warnings.warn(\'There are duplicated index in dataset. The index has been reseted.\') blank_cols = [i for i in list(dat) if dat[i].astype(str).str.findall(r\'^\\s*$\').apply(lambda x:0 if len(x)==0 else 1).sum()>0] if len(blank_cols) > 0: warnings.warn(\'There are blank strings in {} columns, which are replaced with NaN. \\n (ColumnNames: {})\'.format(len(blank_cols), \', \'.join(blank_cols)))# dat[dat == [\' \',\'\']] = np.nan# dat2 = dat.apply(lambda x: x.str.strip()).replace(r\'^\\s*$\', np.nan, regex=True) dat.replace(r\'^\\s*$\', np.nan, regex=True) return dat

参数18：save_breaks_list

用于控制是否将分箱后的断点（或称为区间边界）保存为一个列表。这个参数对于后续的分箱处理、模型训练或结果分析可能非常重要，因为它决定了分箱的具体方式和每个箱子的边界

默认值：None

如果该参数不为 None，传入的是 String 类型数据，则会用于文件名，将分箱信息保存在文件中。具体由 bins_to_breaks 函数实现

def bins_to_breaks(bins, dt, to_string=False, save_string=None): if isinstance(bins, dict): bins = pd.concat(bins, ignore_index=True) # x variables xs_all = bins[\'variable\'].unique() # dtypes of variables vars_class = pd.DataFrame({ \'variable\': xs_all, \'not_numeric\': [not is_numeric_dtype(dt[i]) for i in xs_all] }) # breakslist of bins bins_breakslist = bins[~bins[\'breaks\'].isin([\"-inf\",\"inf\",\"missing\"]) & ~bins[\'is_special_values\']] bins_breakslist = pd.merge(bins_breakslist[[\'variable\', \'breaks\']], vars_class, how=\'left\', on=\'variable\') bins_breakslist.loc[bins_breakslist[\'not_numeric\'], \'breaks\'] = \'\\\'\'+bins_breakslist.loc[bins_breakslist[\'not_numeric\'], \'breaks\']+\'\\\'\' bins_breakslist = bins_breakslist.groupby(\'variable\')[\'breaks\'].agg(lambda x: \',\'.join(x)) if to_string: bins_breakslist = \"breaks_list={\\n\"+\', \\n\'.join(\'\\\'\'+bins_breakslist.index[i]+\'\\\': [\'+bins_breakslist[i]+\']\' for i in np.arange(len(bins_breakslist)))+\"}\" if save_string is not None: brk_lst_name = \'{}_{}.py\'.format(save_string, time.strftime(\'%Y%m%d_%H%M%S\', time.localtime(time.time()))) with open(brk_lst_name, \'w\') as f: f.write(bins_breakslist) print(\'[INFO] The breaks_list is saved as {}\'.format(brk_lst_name)) return return bins_breakslist

气垫床使用护理

Python版scorecardpy库woebin函数使用

scorecardpy安装

scorecardpy提供的功能

`woebin 函数定义`

`woebin` 函数参数解析

参数1：dt

参数2：y

check_y 函数对标签值进行检测

代码解析

参数3：x

参数4：var_skip

参数5：breaks_list

参数6：special_values

参数7：count_distr_limit

参数8：stop_limit

IV增长率

卡方值

参数9：bin_num_limit

参数10：positive

`positive` 参数的设置

参数11：no_cores

参数12：print_step

参数取值详解

使用建议

参数13：method

支持的分箱方法

`1、tree`（决策树分箱）‌：

‌2、`chimerge`（卡方分箱）‌：

使用建议

参数14：ignore_const_cols

参数15：ignore_datetime_cols

参数16：check_cate_num

参数17：replace_blank

参数18：save_breaks_list

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

Python版scorecardpy库woebin函数使用

scorecardpy安装

scorecardpy提供的功能

woebin 函数定义

woebin 函数参数解析

参数1：dt

参数2：y

check_y 函数对标签值进行检测

代码解析

参数3：x

参数4：var_skip

参数5：breaks_list

参数6：special_values

参数7：count_distr_limit

参数8：stop_limit

IV增长率

卡方值

参数9：bin_num_limit

参数10：positive

positive 参数的设置

参数11：no_cores

参数12：print_step

参数取值详解

使用建议

参数13：method

支持的分箱方法

1、tree（决策树分箱）‌：

‌2、chimerge（卡方分箱）‌：

使用建议

参数14：ignore_const_cols

参数15：ignore_datetime_cols

参数16：check_cate_num

参数17：replace_blank

参数18：save_breaks_list

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

`woebin 函数定义`

`woebin` 函数参数解析

`positive` 参数的设置

`1、tree`（决策树分箱）‌：

‌2、`chimerge`（卡方分箱）‌：