What (should) good Feature engineering look like?

We are living in era where Data analytics upgrade form mere main stream small talk topic to key business driver. Almost everybody tries to get better in Advanced analytics, AI or other data-based activities. Some do so from sheer conviction of its importance, some are forced by group think or competition moves. Let’s admit it, for most organizations and teams this is quite leap. So they, …

Leaning on methods

… lean on whatever literature or manuals there are around. I feel strange mixture of sympathy and disdain for them, while almost all self-learning courses and books on analytics primarily focus on methods of Data Science. You can read on which algorithms are suited for which tasks, how to program them or use open source packages to calculate them (as black boxes) for you. If you dwell in analytics for a bit more time, you probably smile already and know where I am heading with this point. Large part of the novices (and late- or returning-comers) fail to understand that the HOW (what AI method) is getting more and more commodity. Thus, the more you try to specialize in those categories the less prospective your future in analytics might be. Don’t get me wrong, I am far from suggesting that new data Scientist should (or need) not to understand the methods behind, quite the contrary. My point is that trying to get better at writing logistic regression or training decision tree is like trying to get better in potatoes digging in era of tractors. Machines got so good in applying the analytical constructs that it becomes Don Quijote-ish to fight with them. Algorithms are becoming commodities, even commodities that are free there in community. So where shall you channel your skill improvement in analytics?

Where to go?

From all the areas that are needed for successful analytics (see here), Feature engineering seems to be the most promising bet in attempt to make significant impact. While it is the first step of the analytical funnel, whatever defects (or data loss) you introduce there is carried over the whole analytical process. It is proven by Kaggle competitions that informational disadvantage is almost never compensated by “smarter” algorithm down the road of model training. What surprises me, though, that in ever mushrooming inundation of books and analytics, you find very little on Feature engineering. Ironically, it is difficult to stumble even upon a definition of what feature engineering should and should not do. Not even talking about best practices in this area.

That is exactly the reason I decided to sum up in this blog shortly what my experience in Machine Learning suggests that Good Feature engineering should include:

1] Extending the set of features. Whenever you start to build some advanced analytical model, you only start with some group of raw parameters (= inputs). As the Features are ultimately what the food are for humans, your live is much lively if you have both enough and variety of the food. To achieve that in analytical efforts feature engineering need to address make sure that you “chew” the raw inputs and get some aggregated features as well. For example, you might be having history of customer purchases, but it is also important to calculate what is the lowest, usual and highest ever amount purchased. You also would find useful to know if the customer was buying certain part of the year more or what is the time since his/her last purchase. These are all aggregates. As you can probably smell from the examples, these are often standardized (like MIN, MAX, AVG, …) and foreseeable steps to take, so good Feature engineering should include automated generation of the aggregates. However, besides the obvious aggregates, one needs also to create new pieces of information that are not directly represented in raw inputs. When you are trying to predict interest in buying ice cream for particular person, it is certainly interesting to know for how long this person has been already buying ice cream products and what is total consumption of it. But if you need to predict consumption in certain short time frame, the ratio of ice creams per week would be more telling then the total consumption metrics.  That is when it comes to transformations, cross-pollination of features and completely new dimensions. Therefore, good feature engineering should extend the set of features BOTH aggregated features and newly derived features.

2] Sorting out the hopeless cases. Plenty is certainly good start, because from quantity you can breed quality. Being open in step one but also means you will have a lot of hopeless predictors that obviously (from its design) can have little impact on prediction. There are many reasons, why some features should be weeded out, but let me elaborate on one very common pitfall. Imagine you have 5 different sofas from spartan one to really fancy ones. Your model is to decide which if the sofa versions will be most appealing to the customer. If you have parameter that has only two values (think male and female) and these are evenly distributed in the sample, it is difficult for this parameter to classify customers into 5 groups with just 2 values (Yes, I said difficult, not impossible). The other way around, if you have colour scheme of the sofas coded into 10000 colour hue and you have about 5000 customers in the sample, some colour options will not have even single customer interested so their predictive relevance gets very questionable as well. Feaure engineering should give you plethora of inputs but should also spare you from pointless cases.

3] Prune to have more efficient model training and operation. Some methods down right struggle under burden of the too many input parameters (think Logistic regression). Yet other can swallow the load but take ages to recalculate or retrain the model. So, it is strongly encouraged to allow models more breathing space. In order to do so one has to care for two dimension: You should not allow clutter from too many distinct values in one variable. For instance modelling health of the person based on the age is probably sensible assumption, but nothing serious will come out, if you count the age meticulously in number of days person had when accident happened, rather then in years (or even decades of years). So good Feature engineering should do (automatic) binning of the values. Secondly the model should not struggle with maintaining the parameters that bring little additional value for the precision of the prediction and it is role of Feature engineering to prune the input set. Good model is not the most precise one, but the one with least parameters needed to achieve acceptable precision.  Because only then you achieve quality through quantity maxim mentioned before.

4] Signal, if I missed anything relevant. Lastly but not least, good feature engineering approach should be your buddy. Your guide through the process. It should be able to point you to information space areas that are under-served or completely missing from your input set. If you are modelling probability to buy certain product, you should have features on how pricy the product is included. If you don’t the feature engineering system should remind you of that. You might scratch your head how would the system know. Well there are 2 common approaches how to achieve this. You can either define permanent dictionary of all possible features and create some additional layers/tags (like pricing issues) on top of the raw list. Upon reading the set you are just about to consider the system would detect there is no representative from the pricing category. If you do not have the resources to maintain such a dictionary, then you can use historical knowledge as the challenger. Your system reviews similar models done (by your) organization and collects the areas of the features used in given models. Then it can verify, if you covered all historically covered areas also in your next model. Though this might sound like wacky idea, in larger teams or in organizations with “predictive model factories” having such a tool is close to must.

We went together through requirements that Good feature engineering should include. Now that you know, you can return to drawing board of your analytical project and think about how to achieve that. If you happen to work in Python you can get a bit more inspiration on this topic from my recent presentation at AI in Action Meetup from Berlin.

 

Berlin Meetup: Cool Feature engineering [my slides]

AI_in_ACTION

 

 

 

 

 

 

 

 

 

 

Dear fellows, 

on Wednesday 20th Feb 2019 I have been invited to speak at AI IN ACTION Meetup organized by ALLDUS in Berlin. The topic was one of my favorite issues, namely Feature engineering. This time we looked at the issue from the of How To Do Cool Feature Engineering In Python. If you had the chance to be in the Meetup crowd and failed to note down some figure, or if you are interested to read about what were the ideas discussed even thought you were not there, here you can find attached the presentation slides from that MeetUp.

slides >>> FILIP_VITEK_TeamViewer_Feature_Selection_IN_PYTHON

If you have any question of different opinion on some of the debated issues, do not hesitate to drop me a few lines on info@mocnedata.sk ;

Social Media have too much of Diesel

This is not an ecology section of TheMightyData.com portal. Neither is this blog about if Facebook or LinkedIn drive electric cars or classical combustion engines. The issue is more serious here, our privacy and its handling are at stake. Well, just make your own call on this:

Indisputably the biggest scandal of the automotive industry was the Diesel Gate. In this well protracted case car producers modified the software in the car in a way that it runs more efficiently (read with lower engine output) but only in situation when car computer realized it is docked to emission measurement device. As a result the congestion smog emitted by car was much lower during the test compared to values from real drive out there on the roads. While most of the countries only realize emission tests in technical review sites (enabled with those dockers), diesel cars appeared to be more ecological as they truly are.

Important twist of this scandal was fact that Automotive (by car software manipulation) created environment where they were the only guards to themselves. Being your own judge does not necessarily mean you will cheat. But if you add strong competitive fight and cheaper market entrants the odds of the misbehaving are rising. Therefore, often the only factor swerving from fair customer treatment to fraud is the management integrity. In retrospect we know that automotive managers failed to hold the principles railing.

If the manipulated cars were passing one emission test after the other, you might wonder: How, on Earth, could they have been caught? The punchline of the story is actually very interesting and carries over the lessons learned for social media that are mentioned in blog headline. The misbehaving of the car producers was detected by NGO which is testing the car economy in real life usage. Their emission readings were for Diesel cars order of magnitude off the  laboratory (docker) measurements, while regular gasoline engines’results were quite close to official numbers. That rose suspicion or experts and tipped the avalanche of revelation of these fraudulent practices. But how does this relate to social media?

Recent social media businesses are just before similar period that car manufacturers were. When it comes to Fake news control of hoaxes eradication, they are both the messenger of the news and the quality (and user impact) watchdog. It is only subject to their inner moral integrity how well they gonna police the standards. What is more, they are in similar position when it comes to our private information and it (commercial) usage. Even though their business mode is built on monetizing information of its own users, there is next to none regulation setting the limits or reason or punishing greed of those Goliath’s. Cambridge Analytics issue was the the poster-child example of what we are talking about. If you think that GDPR has brought some justice to the topic, look at how pathetic the improvement of some companies are. When it comes to XAI (Explainable AI, depicted here) there is not even an general guideline proposed.

Therefore, the answer to this treacherous mode might be similar as was in Diesel Gate. After the emission full outbreak, honest automotive companies desperately called for independent agencies measuring the real emitted levels of gases in real day-to-day car driving to run the public test of actual emissions. They realized that their business (and brands) are user trust dependent. Car, after all, is The tool that we all put our lives and lives of our families into chance. Thus, line of though such as “If they had cheated about such banal thing  as emissions, what else from security features could they have lied about?” is dangerous rope to balance on. Unfair ecology treatment (let’s face it most of us is ignorant about) might easily turn into customer mistreating Wolkswagen (or other brand). So the prevention of repeating of the fraud, has been guarded by free third parties.

The problem of social media is that they still are in Diesel phase.  They did bot understand/admit the value of third party auditing. Mark Zuckerberg (and other social media executives) are talking us down with statements that best prevention is to hand over the fake news fight to internally developed detection routines. They do so also in midst of several blunt failures about conspiration theories or scandals of mistreatment of privacy data of own users. Social media companies simply don’t realize that they risk scenario that hit the automotive industry. They also neglect fact that if Facebook (or social network) users feel like they are being lied or bullied about own privacy data,  users will trigger massive exodus from that platform. If you count on human laziness to prevail and people swallowing their concerns, you better know that 1] for forthcoming generation the Facebook is already not a first social media choice; 2] Similar piece of mind about high (mental) switching cost was held by mobile operators, banks or utilities and they have to struggle to keep their user-bases.

Social media still have too much of the Diesel. If you want to benefit out of this, don’t go selling electric cars to Zuckerberg, but rather design for them algorithms to evaluate their work with user data. Or launch a new social media network that is going to have transparent and accessible audit of user data handling built directly into core functions from very beginning. Because social media will also go through their Diesel Gate, And probably pretty soon …

Presentation from Data Natives Berlin 2018

Dear all,

as you probably have noticed in my LinkedIn  I had the pleasure to talk at the DATA NATIVES conference in Berlin on Feature Engineering Is Your Ticket To Survival In Analytics. As some of you probably could not make it I decided to publish the slides here on the blog, for you to take benefit of them:

 

FILIP_VITEK_TeamViewer_Feature_Selection_EN

Looking forward to any feedback of yours!

When using info from this presentation, please honor the author’s rights to be properly cited as source of your ideas.

FREE TEST to detect your VIBA profile

You might have heard about VIBA, the attempt to describe different Data Scientist profiles. You maybe even heard what V-I-B-A categories stand for and are wondering what profile is actually your one? Then this short test is exactly aimed at you. Via answering below listed questions, you can gain indication on what is most likely your VIBA “letter”

VIBA

Instructions to FREE VIBA test

Please answer all of the 8 questions below. For each question pick one answer only. If you feel more answers might apply to you in given question, try to pick one, which is the closest to your situation at the moment. For each answer you will be assigned certain number of points stated in brackets. After answering all the questions, please sum the points earned for answers to these 8 questions. Total score achieved will guide you to what test believes is your existing Analytical letter.

 

1.What is the prevailing data types that are inputs for your Data Science work?

a] unstructured data OR sound/video data [12 points]

b] stream or batches of transactional data  [1 p]

c] structured data of longer time periods with many aggregated Features or Proxies    [4 p]

d] sensory data, readings from IoT devices or physical measurement [8p]

2. Which part of the company is most often requesting (or being user of) data science outputs that you create?

a] Online commerce, Social media or PR  [1 p]

b] Innovation, Research & Development teams [12 points]

c] Traditional (human operated) Sales, Strategy or Business development    [4 p]

d] Operations, Finance or  IT dept. [8p]

3. How does the training of the new data science models you (and your team) generate happens most of the times?

a] through annotated examples, most likely generated by humans [12 points]

b] through data from experiments, observations or simulations of the reality [8p]

c] using short time window samples of long going process   [1 p]

d] based on Features  that are human selected aggregates or proxies of raw data   [4 p]

4. What Data Science methods are dominantly used in Data Science tasks your team is working on? If more of the below listed are used, which of those would you keep if allowed to have only one category available?

a] Advanced Machine Learning, Random Forests, Regressions or simple FF neural networks  [4 p]

b] Deep learning (often CNN, DCN, KN, LSTM…)  [12 points]

c] Time series, simple ML classifiers or Graph analytical methods   [1 p]

d] Rule engines generators, Genetic algorithms, or more advanced typologies of Neural Networks [8p]

5. What tools/platforms do you CERTAINLY have to have AVAILABLE for your work?

a] Keras, TensorFlow or similar, NLP or other text mining tools  [10 points]

b] Google Analytics  or APIs to Social media  [1 p]

c] SQL and analytical packages or Opensource ML platforms.   [4 p]

6. How does a typical task that you are asked to deliver in your team look like?

a] Describe and analyze user flow , conversion rates or user usage specifics   [1 p]

b] Determine probability to do something OR describe segments of the users  [4 p]

c] Teach systems to decide or replace human role in processes [8p]

d] Detect similarities or patterns in objects or texts [12p]

7. When doing Data Science in your team, what is the most used domain of the data?

a] users/clients preferences or online data  [1 p]

b]  Physical (2D or 3D) objects, art or result of some creative work  [12p]

c] Purchase data, products or off-line customer data  [4 p]

d] Processes and their stages, Motion or Logistics of the things [8p]

8. How long the necessary/typical time window of input data that you need to train your models or prepare your Data Science deliverable?

a]  Time does not play role. Many repeatings/variations of the same object(s).  [12p]

b]  Short time windows, usually below 3 months [1 p]

c] (Near) Real time based, often  of many different types or sources from same time window [8p]

d] Longer time windows, usually 6M+ of the analyzed matter/event  [4 p]

So what VIBA letter you are?

If you scored 0 – 23 points your most likely letter is I, the Internet and Social media related tribe of Data Science. The closer to 8 points you are the more evident this is. The closer you came to 23 points, the more inclinations or overlap with other letters there might be.

If you scored 24 – 49 points your most likely letter is B, the Behavioral Analytics group of Data Science. Staying on lower bound of the interval indicates that you also probably asked to analyze online behavior of the users. The closer you came to 49 upper limit you are, the more your work might be used also to improve decision making of the processes of automate things.

If you scored 50 – 72 points your domain in Analytics is most likely A, the Automating & Autonomous space of Data Science. If you ended up just few points above the 50 lower limit, we would guess that your automation is still in area with strong Human aspect. Staying closer to upper limit of 72 points means that autonomous aspects are paramount and your models probably also rely on reality measuring or sensory inputs.

If you scored 73 – 94 points your are living your analytical life as  V, in the Visual & Voice analytics & Words analytics arena of Data Science. Scores in 70’s range would indicate that your work is somehow useful or needed for the decision machines or automating things. Scores on the higher end of the interval signal pure sensory orientation, most likely living off Deep learning algorithms.

 

Happy about your VIBA letter? Surprised?  Maybe you want to read more/again about your profile, now that you know which you are?

Or soon there will be coming) next blog discussing what should good analyst of your type do and know or how to move to make transition to VIBA letter .

Are you “V” or “A”? What type of Data Scientist are you?

Are you more of ESTJ or rather INFP? If you ever went through MBTI personality test you probably know, what I am referring to. If you did not come across Myers-Briggs test, I strongly suggest you take the free test here, you will learn a lot about yourself (and better understand also main topic of this blog).

It has been a long time since first people found their jobs as data analysts (in 1980’s). Structured Query Language (or SQL as we know it now) gained in popularity heavily with advent of ANSI standardized rules. Most of the 1990 – 2010 period, you did not have to think about what sort of the analyst you are. The only data around (anyway) had been the structured data in databases or formatted data files. Processing data for e-shop was just about the same task as analyzing the utility data of water or gas consumption. That led also to golden era of data analyst: if you mastered SQL (and some statistics on top of that), you could easily swim between jobs or even industries.

Admittedly, I joined the labour force in this period as well, so when choosing the career path of data analyst, I did not crack my head too hard with thinking what kind of analyst I would like to be. After all, being analyst felt so generalist-ish. Almost like living in Ford-times and being perplexed by question, if you can drive car, truck or bus. Hey, man, what’s the difference?!

But After 2010 things got a bit different twist. As I regularly go to several expert conferences on analytics, I started to notice that strange feeling building somewhere in my background. I have seen colleague speakers to present their use cases of analytics and they felt somewhat different to what my team was mainly working on. I still understood the principles, I could replicate the projects, if asked to, but have realized that we deem those marginal and rarely pursue them in our company.

Few years down the road, and I was getting certain: Analytics is not a single continent, where you can move freely from one area to another. It is rather an archipelago of torn islands and tectonic forces push it pieces more and more apart. You still can see there is party on the other island, but you would need to get really wet to get there to take part in it. I started to investigate how many pieces has the analytics continent disintegrated into. Then, on one of my business flights, the article from Kai-Fu Lee in Nov 2018 Europe edition of Fortune nudged me. Wow, these are exactly the clusters, I have been thinking of! A bit different naming and some thoughts in another order, but it strongly converged with what I believed in.

Thus, I decided to wrap my older thoughts around ideas seeded by Kai-Fu Lee and came with VIBA, (most likely) the first data scientists’ persona model attempt, helping you to orientate where you stand on the scale of analytics. It is dwarfed by usefulness of MBTI (or similar) profiling, but gives you basic frame of where you are … And most importantly where you should go. It constitutes 4 different tribes of analytics, each living on its island, slightly moving apart from each other with every quarter of year. V-I-B-A namely stands for:

VIBA_VVisual & Voice analytics & Words analytics. Primarily attempting to detect patterns in images, video, voice and text. Building their models on unstructured data, that needs to be annotated to help the engine to learn the patterns.  Trying to make sense of perception-based data and to compete with (or even replace) the human senses.  Mostly working on different topologies of neural networks, tapping into tools like Keras, TensorFlow and NLP. If you work for this area, you encounter mostly objects, sound samples or text excerpts. Projects usually have longer leading session for exploring data and extensive training. Requirements of analytical progress are framed by Innovation teams or R&D departments.

Internet and Social media related. Striving to describe user flow or conversion in online apps, e-commerce or through interactions on social media. Analytics based mainly on streams of user generated data from clicks & purchases, often monitoring rather short time window of data.  VIBA_IRelying on Time series analysis, some forms of Machine learning or Graph database analytics. Supporting Online and Marketing departments of the larger companies or start-ups in these fields. To survive on this island, you need to be familiar with Google Analytics and/or APIs to major social media networks. Crunching and storing of data often happens in external cloud, analytical results are needed in (near) real-time basis. Projects have short span of time; their results often morph into permanent monitoring of the discovered patterns.

VIBA_BBehavioral Analytics. Serving requests coming from Traditional sales channels, CRM dept’s or Product management teams. Delivers predictive models about crucial business behaviour of the clients/users. Alternatively, clusters the portfolio of users or stipulates their life-time value. Inputs come in form of numerical or categorical features about clients, often still calculated from underlying relational databases. Features are rather aggregated, human suggested info about behaviour. Analytical effort on this island requires at least (6+) months of history to arrive at stable results. To operate in this area, you still need to master SQL and some of (maybe proprietary) analytical packages for regressions, decision forests or classification algorithms. Analysis barely happens in cloud, most of the data is stored on own premises. Projects happen in days, max weeks span, leading to regular, ongoing scoring of the user-base via developed models.

Automating & Autonomous. This type of analytics aims to generate rule engines and sophisticated models to drive decision in certain process, on request of Operations or R&D teams. Takes (streamed,) low level sensory data, coming from experiment or real-life readings. Tries to makeVIBA_A sense of them through multiple layers of own features. Using Deep Learning methods teaches machine to decide within stated error rate allowance. Data are either real-time of historical logs of some process or motions’ sequence. Uses opensource Neural network-based packages like TensorFlow, Keras or Caffe. Working in this area, you most likely come across human driven processes being handed over to machines. Analytics happen on/close to mechanical hardware or centrally in cloud for large span of parallel process iterations.

Do you already know, what is YOUR PRIMARY letter in analytics? To support you in this effort, I created a free VIBA test I suggest you take it right now, to assure (or surprise) yourself about it.

As mentioned before, V-I-B-A profiles are distinct, though you might be on edge of more than one “letters”. What is important to understand, that each VIBA letter has its own reason to exist. Similarly, to MBTI, there is no good or bad profile, just one more suited for some jobs as the another is for other job. You might be pleased with where VIBA profiles you or less so. As with personality test, you even might chose to change yourself from one VIBA island to another. Or stick harder to tribe you belong to now. In both cases it is essential for you to understand what constitutes to be a good “V”, ”I”, “B” or “A”. For that matter I am preparing (soon to come) a separate blog explaining what each letter should do improve their mastery and what skills one needs to acquire, if (s)he wants to hop to another VIBA island.

I am more than happy to receive on info@mocnedata.sk any feedback or learnings that you might arrive at while working with VIBA. If you want to share VIBA with your friends to compare if you are part of same tribe, feel free to pass on this blog post.  If you intend to use the methodology in your professional (or academic) work, please, pay tribute by properly citing authorship of this article.