We are living in era where Data analytics upgrade form mere main stream small talk topic to key business driver. Almost everybody tries to get better in Advanced analytics, AI or other data-based activities. Some do so from sheer conviction of its importance, some are forced by group think or competition moves. Let’s admit it, for most organizations and teams this is quite leap. So they, …
Leaning on methods
… lean on whatever literature or manuals there are around. I feel strange mixture of sympathy and disdain for them, while almost all self-learning courses and books on analytics primarily focus on methods of Data Science. You can read on which algorithms are suited for which tasks, how to program them or use open source packages to calculate them (as black boxes) for you. If you dwell in analytics for a bit more time, you probably smile already and know where I am heading with this point. Large part of the novices (and late- or returning-comers) fail to understand that the HOW (what AI method) is getting more and more commodity. Thus, the more you try to specialize in those categories the less prospective your future in analytics might be. Don’t get me wrong, I am far from suggesting that new data Scientist should (or need) not to understand the methods behind, quite the contrary. My point is that trying to get better at writing logistic regression or training decision tree is like trying to get better in potatoes digging in era of tractors. Machines got so good in applying the analytical constructs that it becomes Don Quijote-ish to fight with them. Algorithms are becoming commodities, even commodities that are free there in community. So where shall you channel your skill improvement in analytics?
Where to go?
From all the areas that are needed for successful analytics (see here), Feature engineering seems to be the most promising bet in attempt to make significant impact. While it is the first step of the analytical funnel, whatever defects (or data loss) you introduce there is carried over the whole analytical process. It is proven by Kaggle competitions that informational disadvantage is almost never compensated by “smarter” algorithm down the road of model training. What surprises me, though, that in ever mushrooming inundation of books and analytics, you find very little on Feature engineering. Ironically, it is difficult to stumble even upon a definition of what feature engineering should and should not do. Not even talking about best practices in this area.
That is exactly the reason I decided to sum up in this blog shortly what my experience in Machine Learning suggests that Good Feature engineering should include:
1] Extending the set of features. Whenever you start to build some advanced analytical model, you only start with some group of raw parameters (= inputs). As the Features are ultimately what the food are for humans, your live is much lively if you have both enough and variety of the food. To achieve that in analytical efforts feature engineering need to address make sure that you “chew” the raw inputs and get some aggregated features as well. For example, you might be having history of customer purchases, but it is also important to calculate what is the lowest, usual and highest ever amount purchased. You also would find useful to know if the customer was buying certain part of the year more or what is the time since his/her last purchase. These are all aggregates. As you can probably smell from the examples, these are often standardized (like MIN, MAX, AVG, …) and foreseeable steps to take, so good Feature engineering should include automated generation of the aggregates. However, besides the obvious aggregates, one needs also to create new pieces of information that are not directly represented in raw inputs. When you are trying to predict interest in buying ice cream for particular person, it is certainly interesting to know for how long this person has been already buying ice cream products and what is total consumption of it. But if you need to predict consumption in certain short time frame, the ratio of ice creams per week would be more telling then the total consumption metrics. That is when it comes to transformations, cross-pollination of features and completely new dimensions. Therefore, good feature engineering should extend the set of features BOTH aggregated features and newly derived features.
2] Sorting out the hopeless cases. Plenty is certainly good start, because from quantity you can breed quality. Being open in step one but also means you will have a lot of hopeless predictors that obviously (from its design) can have little impact on prediction. There are many reasons, why some features should be weeded out, but let me elaborate on one very common pitfall. Imagine you have 5 different sofas from spartan one to really fancy ones. Your model is to decide which if the sofa versions will be most appealing to the customer. If you have parameter that has only two values (think male and female) and these are evenly distributed in the sample, it is difficult for this parameter to classify customers into 5 groups with just 2 values (Yes, I said difficult, not impossible). The other way around, if you have colour scheme of the sofas coded into 10000 colour hue and you have about 5000 customers in the sample, some colour options will not have even single customer interested so their predictive relevance gets very questionable as well. Feaure engineering should give you plethora of inputs but should also spare you from pointless cases.
3] Prune to have more efficient model training and operation. Some methods down right struggle under burden of the too many input parameters (think Logistic regression). Yet other can swallow the load but take ages to recalculate or retrain the model. So, it is strongly encouraged to allow models more breathing space. In order to do so one has to care for two dimension: You should not allow clutter from too many distinct values in one variable. For instance modelling health of the person based on the age is probably sensible assumption, but nothing serious will come out, if you count the age meticulously in number of days person had when accident happened, rather then in years (or even decades of years). So good Feature engineering should do (automatic) binning of the values. Secondly the model should not struggle with maintaining the parameters that bring little additional value for the precision of the prediction and it is role of Feature engineering to prune the input set. Good model is not the most precise one, but the one with least parameters needed to achieve acceptable precision. Because only then you achieve quality through quantity maxim mentioned before.
4] Signal, if I missed anything relevant. Lastly but not least, good feature engineering approach should be your buddy. Your guide through the process. It should be able to point you to information space areas that are under-served or completely missing from your input set. If you are modelling probability to buy certain product, you should have features on how pricy the product is included. If you don’t the feature engineering system should remind you of that. You might scratch your head how would the system know. Well there are 2 common approaches how to achieve this. You can either define permanent dictionary of all possible features and create some additional layers/tags (like pricing issues) on top of the raw list. Upon reading the set you are just about to consider the system would detect there is no representative from the pricing category. If you do not have the resources to maintain such a dictionary, then you can use historical knowledge as the challenger. Your system reviews similar models done (by your) organization and collects the areas of the features used in given models. Then it can verify, if you covered all historically covered areas also in your next model. Though this might sound like wacky idea, in larger teams or in organizations with “predictive model factories” having such a tool is close to must.
We went together through requirements that Good feature engineering should include. Now that you know, you can return to drawing board of your analytical project and think about how to achieve that. If you happen to work in Python you can get a bit more inspiration on this topic from my recent presentation at AI in Action Meetup from Berlin.