Presentation from Data Natives Berlin 2018

Dear all,

as you probably have noticed in my LinkedIn  I had the pleasure to talk at the DATA NATIVES conference in Berlin on Feature Engineering Is Your Ticket To Survival In Analytics. As some of you probably could not make it I decided to publish the slides here on the blog, for you to take benefit of them:

 

FILIP_VITEK_TeamViewer_Feature_Selection_EN

Looking forward to any feedback of yours!

When using info from this presentation, please honor the author’s rights to be properly cited as source of your ideas.

FREE TEST to detect your VIBA profile

You might have heard about VIBA, the attempt to describe different Data Scientist profiles. You maybe even heard what V-I-B-A categories stand for and are wondering what profile is actually your one? Then this short test is exactly aimed at you. Via answering below listed questions, you can gain indication on what is most likely your VIBA “letter”

VIBA

Instructions to FREE VIBA test

Please answer all of the 8 questions below. For each question pick one answer only. If you feel more answers might apply to you in given question, try to pick one, which is the closest to your situation at the moment. For each answer you will be assigned certain number of points stated in brackets. After answering all the questions, please sum the points earned for answers to these 8 questions. Total score achieved will guide you to what test believes is your existing Analytical letter.

 

1.What is the prevailing data types that are inputs for your Data Science work?

a] unstructured data OR sound/video data [12 points]

b] stream or batches of transactional data  [1 p]

c] structured data of longer time periods with many aggregated Features or Proxies    [4 p]

d] sensory data, readings from IoT devices or physical measurement [8p]

2. Which part of the company is most often requesting (or being user of) data science outputs that you create?

a] Online commerce, Social media or PR  [1 p]

b] Innovation, Research & Development teams [12 points]

c] Traditional (human operated) Sales, Strategy or Business development    [4 p]

d] Operations, Finance or  IT dept. [8p]

3. How does the training of the new data science models you (and your team) generate happens most of the times?

a] through annotated examples, most likely generated by humans [12 points]

b] through data from experiments, observations or simulations of the reality [8p]

c] using short time window samples of long going process   [1 p]

d] based on Features  that are human selected aggregates or proxies of raw data   [4 p]

4. What Data Science methods are dominantly used in Data Science tasks your team is working on? If more of the below listed are used, which of those would you keep if allowed to have only one category available?

a] Advanced Machine Learning, Random Forests, Regressions or simple FF neural networks  [4 p]

b] Deep learning (often CNN, DCN, KN, LSTM…)  [12 points]

c] Time series, simple ML classifiers or Graph analytical methods   [1 p]

d] Rule engines generators, Genetic algorithms, or more advanced typologies of Neural Networks [8p]

5. What tools/platforms do you CERTAINLY have to have AVAILABLE for your work?

a] Keras, TensorFlow or similar, NLP or other text mining tools  [10 points]

b] Google Analytics  or APIs to Social media  [1 p]

c] SQL and analytical packages or Opensource ML platforms.   [4 p]

6. How does a typical task that you are asked to deliver in your team look like?

a] Describe and analyze user flow , conversion rates or user usage specifics   [1 p]

b] Determine probability to do something OR describe segments of the users  [4 p]

c] Teach systems to decide or replace human role in processes [8p]

d] Detect similarities or patterns in objects or texts [12p]

7. When doing Data Science in your team, what is the most used domain of the data?

a] users/clients preferences or online data  [1 p]

b]  Physical (2D or 3D) objects, art or result of some creative work  [12p]

c] Purchase data, products or off-line customer data  [4 p]

d] Processes and their stages, Motion or Logistics of the things [8p]

8. How long the necessary/typical time window of input data that you need to train your models or prepare your Data Science deliverable?

a]  Time does not play role. Many repeatings/variations of the same object(s).  [12p]

b]  Short time windows, usually below 3 months [1 p]

c] (Near) Real time based, often  of many different types or sources from same time window [8p]

d] Longer time windows, usually 6M+ of the analyzed matter/event  [4 p]

So what VIBA letter you are?

If you scored 0 – 23 points your most likely letter is I, the Internet and Social media related tribe of Data Science. The closer to 8 points you are the more evident this is. The closer you came to 23 points, the more inclinations or overlap with other letters there might be.

If you scored 24 – 49 points your most likely letter is B, the Behavioral Analytics group of Data Science. Staying on lower bound of the interval indicates that you also probably asked to analyze online behavior of the users. The closer you came to 49 upper limit you are, the more your work might be used also to improve decision making of the processes of automate things.

If you scored 50 – 72 points your domain in Analytics is most likely A, the Automating & Autonomous space of Data Science. If you ended up just few points above the 50 lower limit, we would guess that your automation is still in area with strong Human aspect. Staying closer to upper limit of 72 points means that autonomous aspects are paramount and your models probably also rely on reality measuring or sensory inputs.

If you scored 73 – 94 points your are living your analytical life as  V, in the Visual & Voice analytics & Words analytics arena of Data Science. Scores in 70’s range would indicate that your work is somehow useful or needed for the decision machines or automating things. Scores on the higher end of the interval signal pure sensory orientation, most likely living off Deep learning algorithms.

 

Happy about your VIBA letter? Surprised?  Maybe you want to read more/again about your profile, now that you know which you are?

Or soon there will be coming) next blog discussing what should good analyst of your type do and know or how to move to make transition to VIBA letter .

Are you “V” or “A”? What type of Data Scientist are you?

Are you more of ESTJ or rather INFP? If you ever went through MBTI personality test you probably know, what I am referring to. If you did not come across Myers-Briggs test, I strongly suggest you take the free test here, you will learn a lot about yourself (and better understand also main topic of this blog).

It has been a long time since first people found their jobs as data analysts (in 1980’s). Structured Query Language (or SQL as we know it now) gained in popularity heavily with advent of ANSI standardized rules. Most of the 1990 – 2010 period, you did not have to think about what sort of the analyst you are. The only data around (anyway) had been the structured data in databases or formatted data files. Processing data for e-shop was just about the same task as analyzing the utility data of water or gas consumption. That led also to golden era of data analyst: if you mastered SQL (and some statistics on top of that), you could easily swim between jobs or even industries.

Admittedly, I joined the labour force in this period as well, so when choosing the career path of data analyst, I did not crack my head too hard with thinking what kind of analyst I would like to be. After all, being analyst felt so generalist-ish. Almost like living in Ford-times and being perplexed by question, if you can drive car, truck or bus. Hey, man, what’s the difference?!

But After 2010 things got a bit different twist. As I regularly go to several expert conferences on analytics, I started to notice that strange feeling building somewhere in my background. I have seen colleague speakers to present their use cases of analytics and they felt somewhat different to what my team was mainly working on. I still understood the principles, I could replicate the projects, if asked to, but have realized that we deem those marginal and rarely pursue them in our company.

Few years down the road, and I was getting certain: Analytics is not a single continent, where you can move freely from one area to another. It is rather an archipelago of torn islands and tectonic forces push it pieces more and more apart. You still can see there is party on the other island, but you would need to get really wet to get there to take part in it. I started to investigate how many pieces has the analytics continent disintegrated into. Then, on one of my business flights, the article from Kai-Fu Lee in Nov 2018 Europe edition of Fortune nudged me. Wow, these are exactly the clusters, I have been thinking of! A bit different naming and some thoughts in another order, but it strongly converged with what I believed in.

Thus, I decided to wrap my older thoughts around ideas seeded by Kai-Fu Lee and came with VIBA, (most likely) the first data scientists’ persona model attempt, helping you to orientate where you stand on the scale of analytics. It is dwarfed by usefulness of MBTI (or similar) profiling, but gives you basic frame of where you are … And most importantly where you should go. It constitutes 4 different tribes of analytics, each living on its island, slightly moving apart from each other with every quarter of year. V-I-B-A namely stands for:

VIBA_VVisual & Voice analytics & Words analytics. Primarily attempting to detect patterns in images, video, voice and text. Building their models on unstructured data, that needs to be annotated to help the engine to learn the patterns.  Trying to make sense of perception-based data and to compete with (or even replace) the human senses.  Mostly working on different topologies of neural networks, tapping into tools like Keras, TensorFlow and NLP. If you work for this area, you encounter mostly objects, sound samples or text excerpts. Projects usually have longer leading session for exploring data and extensive training. Requirements of analytical progress are framed by Innovation teams or R&D departments.

Internet and Social media related. Striving to describe user flow or conversion in online apps, e-commerce or through interactions on social media. Analytics based mainly on streams of user generated data from clicks & purchases, often monitoring rather short time window of data.  VIBA_IRelying on Time series analysis, some forms of Machine learning or Graph database analytics. Supporting Online and Marketing departments of the larger companies or start-ups in these fields. To survive on this island, you need to be familiar with Google Analytics and/or APIs to major social media networks. Crunching and storing of data often happens in external cloud, analytical results are needed in (near) real-time basis. Projects have short span of time; their results often morph into permanent monitoring of the discovered patterns.

VIBA_BBehavioral Analytics. Serving requests coming from Traditional sales channels, CRM dept’s or Product management teams. Delivers predictive models about crucial business behaviour of the clients/users. Alternatively, clusters the portfolio of users or stipulates their life-time value. Inputs come in form of numerical or categorical features about clients, often still calculated from underlying relational databases. Features are rather aggregated, human suggested info about behaviour. Analytical effort on this island requires at least (6+) months of history to arrive at stable results. To operate in this area, you still need to master SQL and some of (maybe proprietary) analytical packages for regressions, decision forests or classification algorithms. Analysis barely happens in cloud, most of the data is stored on own premises. Projects happen in days, max weeks span, leading to regular, ongoing scoring of the user-base via developed models.

Automating & Autonomous. This type of analytics aims to generate rule engines and sophisticated models to drive decision in certain process, on request of Operations or R&D teams. Takes (streamed,) low level sensory data, coming from experiment or real-life readings. Tries to makeVIBA_A sense of them through multiple layers of own features. Using Deep Learning methods teaches machine to decide within stated error rate allowance. Data are either real-time of historical logs of some process or motions’ sequence. Uses opensource Neural network-based packages like TensorFlow, Keras or Caffe. Working in this area, you most likely come across human driven processes being handed over to machines. Analytics happen on/close to mechanical hardware or centrally in cloud for large span of parallel process iterations.

Do you already know, what is YOUR PRIMARY letter in analytics? To support you in this effort, I created a free VIBA test I suggest you take it right now, to assure (or surprise) yourself about it.

As mentioned before, V-I-B-A profiles are distinct, though you might be on edge of more than one “letters”. What is important to understand, that each VIBA letter has its own reason to exist. Similarly, to MBTI, there is no good or bad profile, just one more suited for some jobs as the another is for other job. You might be pleased with where VIBA profiles you or less so. As with personality test, you even might chose to change yourself from one VIBA island to another. Or stick harder to tribe you belong to now. In both cases it is essential for you to understand what constitutes to be a good “V”, ”I”, “B” or “A”. For that matter I am preparing (soon to come) a separate blog explaining what each letter should do improve their mastery and what skills one needs to acquire, if (s)he wants to hop to another VIBA island.

I am more than happy to receive on info@mocnedata.sk any feedback or learnings that you might arrive at while working with VIBA. If you want to share VIBA with your friends to compare if you are part of same tribe, feel free to pass on this blog post.  If you intend to use the methodology in your professional (or academic) work, please, pay tribute by properly citing authorship of this article.

5+1 interesting AI videos

After reviews of some AI related books caught quite a interest from TheMightyData.com  community, I decided it to elaborate on these points a bit and point you to yet another insights. To prevent turning you into bookworms, I decided to pick a bit more engaging media this time and I would like to inspire you to see some good AI related video speeches. Enjoy!

AI_videos_KAI_FULEEHow AI can save our humanity | Kai-Fu Lee

If you work in AI community name of Kai-Fu-Lee might not be unknown to you, as he stood behind boost of AI in Apple (and few other platforms). What is more, his view on AI is rather encouraging. In his life (and attached video) he strives to point our areas, where people can succeed even after AI fully kicks in. Is it hope worthy following? Well, that conclusion I leave up to you to make after seeing this great video.

A brain in a supercomputer | Henry MarkramAI_videos_HENRY_MARKRAM

There are at least 2 things fascinating about this video. The first one that it was filmed 8 years (!) ago. That depicts how far away the Oxford guys have been in the respective topic already back then. Even 10 years later down the road, some research teams are not at verge of the same understanding of matter. The second fascinating aspect of the video is that in less than 16 minutes, it walks the spectator from basics of Neuroscience up to expert insight on Neural Simulations. And that is certainly worth your 16 minutes.

AI_videos_ANDREW_ZEITLERThe Truth Behind Artificial Intelligence | Andrew Zeitler

Andrew was only (hard to believe) 17 years old when he delivered this speech. And if you pardon his young enthusiasm (here and there bordering with affect) he will introduce you to load of interesting thoughts on what are the next milestones for General AI, as well as how will General AI most likely behave, when it comes. For those educated in matter, some parts of the speech might be, I admit,  a bit sluggish. However, have You pumped into applause any hall of that size when you were 17?

 

 

Where AI is today and where it’s going. | Richard SocherAI_videos_RICHARD_SOCHER

Richard Socher is professor at Stanford Computer Science Department, mainly focusing on Deep Learning. And he is awesome in clearly and humorously explaining what progress there has been achieved in neural networks lately, as well as stating where AI still drags its feet (as you will see sometime literally). For those facing master or PhD thesis this might be good short list of  highly desired topics.  If you out of school and happen to already work in AI area, maybe an inspiration to educate more about some areas of AI, you did not cross upon yet. But even if you are lightly interested in AI topics, Richard is a great entertainer, so its worth seeing the video just for fun sake.

 

AI_videos_CURTISMusic and Art Generation using Machine Learning | Curtis Hawthorne

Creativity is one of the often cited to be “last fortress of human superiority”. Machines cannot be creative, so at least here we are safe to assume human dominance for longer time, right? Well, to assess to what extent that is really true, I suggest you see this video by Curtis Hawthorne, who reports on how far (they in Google) machines got so far.Next time you hear this soothing self-defense of  humans, you already will have an educated arguments to discuss.

The final bonus+1 track  id nobody else but  Nick Bolstrom, the author of the book SuperIntelligence. In his video he describes basic principles of his book. So if you have not read my review of this fascinating piece of reading take a try with the author himself to motivate you to read it.

Unconventional methods of Feature engineering

My dear fellows,

after settling down in Berlin Datascience community and visiting few interesting, local meet-ups, I agreed to contribute to the community knowledge sharing as well. Thus, if you have a while you can join Sebastian Meier [Technologiestiftung Berlin], Sandris Murins [Iconiq Lab] and me on June 14th 2018 18:00 at AI IN DATA SCIENCE – BERLIN at Wayfair DE Office, Köpenicker Str. 180, 10997 Berlin, Germany.

To ensure you that it really pays off to reserve some time for this meet-up let me make a sneak-preview of what you might get from this event:

BERLIN AI TALK - Filip Vitek

COMING_SOONIn my presentation, titled  “Feature engineering – your trump card in Machine Learning”  you will find out what are the 2 key reasons for feature engineering future. First explains how variables can become competitive advantage of your models. The second one goes even beyond that and hopefully opens your eyes why Feature engineering might be actually essential for Data Scientists’ future job security.

After getting warmed up on the overall role of features, we shall jump into how variables are actually designed/generated in real life. What are the most used  approaches to design the features? But more importantly, what are the common pitfalls that we all often fall into while designing our feature sets for Machine Learning predictions? Are there tricks to prevent us from failing on these ?

uncoventional

However, just to demote the traditional ways and solely point out their notorious downsides without naming the alternative(s) would be a bit unfair. Therefore, in second half of my talk I want to walk you through unconventional ways how to design your features. What is more, I will take help of 4 specific case studies where unconventional features were needed. I hope you might get inspired for your own work.

If you cannot make it (for whatever reason) to this talk and you are member of MighytDataCommunity, you will find the presentation slides here after event, in member-only-restricted blog post. If you are not member of the free MightyData Community yet, you can get your free membership to this community in less then 2 minutes, by registering here.

Looking forward to seeing you all at the AI Meet-up event !

 

 

 

Bread for 8 Facebook likes. You can already buy food for personal data.

Data are labelled to be commodity in corporate speak for already quite some time. Monetization (e.g. selling of) the data is not just trendy concept but turned to be essence of business for horde of smaller or medium companies. (let me name Market locator as one of the shiny examples) However, so far, the data trade has been a B2B business mainly. Ordinary person could not pay for his regular food supplies with his/her own data. That is over for good. Grocery store has been opened in Germany, where you pay for purchase solely by personal data.

DATEN_markt_chleba

Milk for 10 photos, bread for 8 likes

No, this is not any April’s fool or hidden prank. In German city of Hamburg Datenmarkt (= datamarket) has opened its first store, where you pay solely by personal data from Facebook. Toast bread will cost you 8 Facebook likes or you can take pack of filled dumplings for full disclosure of 5 different Facebook messages. As you would expect the Grocery to, Datenmarkt ran lately special weekly action on most of the fruit with price cut down to just 5 Facebook photos per kilo.

DATEN_markt_frucht

Payment is organized into similar process as you know from the credit card payment via POS terminal. But instead of entering the PIN into the POS terminal to authorize the transaction you have to log into your Facebook account and indicate which Likes given, Facebook photos posted or messages from FB messenger, you decided to pay with. The cashier receipt also showcases the data that you decided to sacrifice. These provided data are at full disposal of the Datenmarket investors to sell them (along your metadata) to any third party interested. By paying for the goods by your social data, you give full consent to Datenmarkt for this trading.

DATEN_markt_ucet_II

Insights we give away

What might look like elegant and definitive solution to homeless people starving has actually much higher ambitions. Authors of this project want to also raise the awareness, that in our age any data point has its value. Reminding us that the sheer volume of data we freely give away to Titans like Google, Facebook or Youtube, has actually real monetary value attached to them.

In more mature markets the echo of “what is the value of digital track of person” resonates more intensively with every quarter of they year. To put it differently, if we hand over data to “attention brokers” (as labelled them Columbia University, USA professor Tim Wu, studying the matter) what do we actually get in return for those data? We might come to the point where outstanding debt forces us not only to “smash our piggy bank” but to sacrifice our Facebook assets as well.  Most of the governments have set monetary equivalent to human life (often calculated as lifetime taxes contribution). But value of our digital life remained still to be added. Bizarre? Well, not any more in our times.

Food Yes, but not for the services

The opening of the Hamburg Datenmarket, , sadly, confirms the hypotheses researched by behavioural economists for more than 5 years. In monetization of their data people simply tend to stick to same principles as by sweets (or other tiny sins). Immediate gratification wins above promise of some future value. Most of the citizens comprehend that their personal data might have some value. If given the option of using the personal data to get cheaper or better services (e.g. insurance or next haircut for free) soothing up to 80% of them turn the offer down. However, if the same people were given choice of something tangible immediately, they were willing to trade their personal data for as trivial things as one pizza.

DATEN_markt_gulky

Now your thoughts go: “Wait, but the above mentioned digital giants are not offering anything tangible, all of the Google, Facebook and YouTube business is in form of the services, isn’t it? And you may be right but the point here is that all of them offer immediate gratification. (immediate likes on Facebook, hunger for quick search on Google or immediate fun with YouTube videos.) Quite a few people would see turning their digital data (deemed to be as material as air we breath in and out) for something consumable as “great deal”. If this trend confirms to be true, it is likely that service exchanging yogurt or ticket to concert for our personal data will be mushrooming around. If you have similar business idea in your head, you better act on it fast. Commission for themightydata.com portal for seeding this idea to your head remains voluntary for you 😊.

Hold your (data) hats …

Immediately after roll-out of personal data payment for existing goods, there will be someone willing to give you goods for personal data loan (or data mortgage). In other words, you will hand over your author rights in advance, even before the content is created. For those in financial difficulties this still might seem more reasonable option to give away than fridge or you shelter. But here, we are already crossing the bridge to digital slavery of a kind. Keep in mind that the most vulnerable would be the young lads, still having long (and thus precious) digital life ahead, but short on funds to (e.g. by new motorbike) in this life stage. The whole trend will be probably infused by some industries (like utilities) that long the data to be their saviour from recent misery of tiny margins.

Soon we shall expect that besides Bitcoin, Ethereum of Blackcoin fever, person would be able to mine (from own Facebook account) the currency like FB-photoCoin or FB-LikeCoin. Ans our age will gain yet another bizarre level on top. Just imagine the headlines “Celebrity donated to charity 4 full albums of her Facebook photos? But, nobody can eat from that? Or can he now?

You might be interested to read:

3 WAYS how to TEACH ROBOTS human skills

What PETER SAGAN learned about his sprint from Helicopter?

CHILDREN had it REALLY tough on Titanic