https://travis-ci.org/MacHu-GWU/elementary_math-project.svg?branch=master https://img.shields.io/pypi/v/hccEncoding.svghttps://img.shields.io/pypi/l/hccEncoding.svghttps://img.shields.io/pypi/pyversions/hccEncoding.svg

Welcome to hccEncoding’s documentation!

Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.

This python package implements some preprocessing strategies for high-cardinality categorical data that allows this class of attributes to be used in predictive models. Currently there are two major methods, whihc are based on Daniele Micci-Barreca ‘s empirical Bayes method [ref1] and Owen Zhang’s leave-one-out encoding[ref2].

Functions

  • BayesEncoding(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False)
  • BayesEncodingKfold(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False,fold=5)
  • LOOEncoding(train,test,target,feature,noise=0.01,drop_origin_feature=False)
  • LOOEncodingKfold(train,test,target,feature,noise=0.01,drop_origin_feature=False,fold=5)

Please see example for detailed explanation - Example

General Parameters

  • train - train dataset, datatype: pandas dataframe
  • test - test dataset, datatype: pandas dataframe
  • target - name of target for prediction, datatype: string
  • feature - name of features that need to be encoded, datatype: string
  • k [default=5] - parameter for BayesEncoding and BayesEncodingKfold, determines half of the minimal sample size of which we completely ‘trust’ the estimate of transition between the cell’s posterior probability and the prior probability, datatype: int
  • f [default=1] - parameter for BayesEncoding and BayesEncodingKfold,controls how quickly the weight changes from the prior to the posterior as the size of the group increases, to further understand k and f’s meaning datatype: int
  • noise [default=0.01] - a manually added noise after encoding. For classification problems, a random uniform-distributed noise in the range of [-noise,noise]*data is added. For regression problem, a random normal-distributed noise in the range of norm(0,noise) is added, datatype: double
  • drop_origin_feature [default=False] - whether dropping the original feature or not, datatype: boolean
  • fold [default=5] - parameter for LOOEncodingKfold and BayesEncodingKfold, represent the number of folds that the train dataset will be splitted into. datatype: int

References

Indices and tables