Uncategorized – Page 9 – THE MIGHTY DATA

Social Media have too much of Diesel

This is not an ecology section of TheMightyData.com portal. Neither is this blog about if Facebook or LinkedIn drive electric cars or classical combustion engines. The issue is more serious here, our privacy and its handling are at stake. Well, just make your own call on this:

Indisputably the biggest scandal of the automotive industry was the Diesel Gate. In this well protracted case car producers modified the software in the car in a way that it runs more efficiently (read with lower engine output) but only in situation when car computer realized it is docked to emission measurement device. As a result the congestion smog emitted by car was much lower during the test compared to values from real drive out there on the roads. While most of the countries only realize emission tests in technical review sites (enabled with those dockers), diesel cars appeared to be more ecological as they truly are.

Important twist of this scandal was fact that Automotive (by car software manipulation) created environment where they were the only guards to themselves. Being your own judge does not necessarily mean you will cheat. But if you add strong competitive fight and cheaper market entrants the odds of the misbehaving are rising. Therefore, often the only factor swerving from fair customer treatment to fraud is the management integrity. In retrospect we know that automotive managers failed to hold the principles railing.

If the manipulated cars were passing one emission test after the other, you might wonder: How, on Earth, could they have been caught? The punchline of the story is actually very interesting and carries over the lessons learned for social media that are mentioned in blog headline. The misbehaving of the car producers was detected by NGO which is testing the car economy in real life usage. Their emission readings were for Diesel cars order of magnitude off the laboratory (docker) measurements, while regular gasoline engines’results were quite close to official numbers. That rose suspicion or experts and tipped the avalanche of revelation of these fraudulent practices. But how does this relate to social media?

Recent social media businesses are just before similar period that car manufacturers were. When it comes to Fake news control of hoaxes eradication, they are both the messenger of the news and the quality (and user impact) watchdog. It is only subject to their inner moral integrity how well they gonna police the standards. What is more, they are in similar position when it comes to our private information and it (commercial) usage. Even though their business mode is built on monetizing information of its own users, there is next to none regulation setting the limits or reason or punishing greed of those Goliath’s. Cambridge Analytics issue was the the poster-child example of what we are talking about. If you think that GDPR has brought some justice to the topic, look at how pathetic the improvement of some companies are. When it comes to XAI (Explainable AI, depicted here) there is not even an general guideline proposed.

Therefore, the answer to this treacherous mode might be similar as was in Diesel Gate. After the emission full outbreak, honest automotive companies desperately called for independent agencies measuring the real emitted levels of gases in real day-to-day car driving to run the public test of actual emissions. They realized that their business (and brands) are user trust dependent. Car, after all, is The tool that we all put our lives and lives of our families into chance. Thus, line of though such as “If they had cheated about such banal thing as emissions, what else from security features could they have lied about?” is dangerous rope to balance on. Unfair ecology treatment (let’s face it most of us is ignorant about) might easily turn into customer mistreating Wolkswagen (or other brand). So the prevention of repeating of the fraud, has been guarded by free third parties.

The problem of social media is that they still are in Diesel phase. They did bot understand/admit the value of third party auditing. Mark Zuckerberg (and other social media executives) are talking us down with statements that best prevention is to hand over the fake news fight to internally developed detection routines. They do so also in midst of several blunt failures about conspiration theories or scandals of mistreatment of privacy data of own users. Social media companies simply don’t realize that they risk scenario that hit the automotive industry. They also neglect fact that if Facebook (or social network) users feel like they are being lied or bullied about own privacy data, users will trigger massive exodus from that platform. If you count on human laziness to prevail and people swallowing their concerns, you better know that 1] for forthcoming generation the Facebook is already not a first social media choice; 2] Similar piece of mind about high (mental) switching cost was held by mobile operators, banks or utilities and they have to struggle to keep their user-bases.

Social media still have too much of the Diesel. If you want to benefit out of this, don’t go selling electric cars to Zuckerberg, but rather design for them algorithms to evaluate their work with user data. Or launch a new social media network that is going to have transparent and accessible audit of user data handling built directly into core functions from very beginning. Because social media will also go through their Diesel Gate, And probably pretty soon …

Späť na domovskú stránku

Presentation from Data Natives Berlin 2018

Dear all,

as you probably have noticed in my LinkedIn I had the pleasure to talk at the DATA NATIVES conference in Berlin on Feature Engineering Is Your Ticket To Survival In Analytics. As some of you probably could not make it I decided to publish the slides here on the blog, for you to take benefit of them:

FILIP_VITEK_TeamViewer_Feature_Selection_EN

Looking forward to any feedback of yours!

When using info from this presentation, please honor the author’s rights to be properly cited as source of your ideas.

Späť na domovskú stránku

FREE TEST to detect your VIBA profile

You might have heard about VIBA, the attempt to describe different Data Scientist profiles. You maybe even heard what V-I-B-A categories stand for and are wondering what profile is actually your one? Then this short test is exactly aimed at you. Via answering below listed questions, you can gain indication on what is most likely your VIBA “letter”

Instructions to FREE VIBA test

Please answer all of the 8 questions below. For each question pick one answer only. If you feel more answers might apply to you in given question, try to pick one, which is the closest to your situation at the moment. For each answer you will be assigned certain number of points stated in brackets. After answering all the questions, please sum the points earned for answers to these 8 questions. Total score achieved will guide you to what test believes is your existing Analytical letter.

1.What is the prevailing data types that are inputs for your Data Science work?

a] unstructured data OR sound/video data [12 points]

b] stream or batches of transactional data [1 p]

c] structured data of longer time periods with many aggregated Features or Proxies [4 p]

d] sensory data, readings from IoT devices or physical measurement [8p]

2. Which part of the company is most often requesting (or being user of) data science outputs that you create?

a] Online commerce, Social media or PR [1 p]

b] Innovation, Research & Development teams [12 points]

c] Traditional (human operated) Sales, Strategy or Business development [4 p]

d] Operations, Finance or IT dept. [8p]

3. How does the training of the new data science models you (and your team) generate happens most of the times?

a] through annotated examples, most likely generated by humans [12 points]

b] through data from experiments, observations or simulations of the reality [8p]

c] using short time window samples of long going process [1 p]

d] based on Features that are human selected aggregates or proxies of raw data [4 p]

4. What Data Science methods are dominantly used in Data Science tasks your team is working on? If more of the below listed are used, which of those would you keep if allowed to have only one category available?

a] Advanced Machine Learning, Random Forests, Regressions or simple FF neural networks [4 p]

b] Deep learning (often CNN, DCN, KN, LSTM…) [12 points]

c] Time series, simple ML classifiers or Graph analytical methods [1 p]

d] Rule engines generators, Genetic algorithms, or more advanced typologies of Neural Networks [8p]

5. What tools/platforms do you CERTAINLY have to have AVAILABLE for your work?

a] Keras, TensorFlow or similar, NLP or other text mining tools [10 points]

b] Google Analytics or APIs to Social media [1 p]

c] SQL and analytical packages or Opensource ML platforms. [4 p]

6. How does a typical task that you are asked to deliver in your team look like?

a] Describe and analyze user flow , conversion rates or user usage specifics [1 p]

b] Determine probability to do something OR describe segments of the users [4 p]

c] Teach systems to decide or replace human role in processes [8p]

d] Detect similarities or patterns in objects or texts [12p]

7. When doing Data Science in your team, what is the most used domain of the data?

a] users/clients preferences or online data [1 p]

b] Physical (2D or 3D) objects, art or result of some creative work [12p]

c] Purchase data, products or off-line customer data [4 p]

d] Processes and their stages, Motion or Logistics of the things [8p]

8. How long the necessary/typical time window of input data that you need to train your models or prepare your Data Science deliverable?

a] Time does not play role. Many repeatings/variations of the same object(s). [12p]

b] Short time windows, usually below 3 months [1 p]

c] (Near) Real time based, often of many different types or sources from same time window [8p]

d] Longer time windows, usually 6M+ of the analyzed matter/event [4 p]

So what VIBA letter you are?

If you scored 0 – 23 points your most likely letter is I, the Internet and Social media related tribe of Data Science. The closer to 8 points you are the more evident this is. The closer you came to 23 points, the more inclinations or overlap with other letters there might be.

If you scored 24 – 49 points your most likely letter is B, the Behavioral Analytics group of Data Science. Staying on lower bound of the interval indicates that you also probably asked to analyze online behavior of the users. The closer you came to 49 upper limit you are, the more your work might be used also to improve decision making of the processes of automate things.

If you scored 50 – 72 points your domain in Analytics is most likely A, the Automating & Autonomous space of Data Science. If you ended up just few points above the 50 lower limit, we would guess that your automation is still in area with strong Human aspect. Staying closer to upper limit of 72 points means that autonomous aspects are paramount and your models probably also rely on reality measuring or sensory inputs.

If you scored 73 – 94 points your are living your analytical life as V, in the Visual & Voice analytics & Words analytics arena of Data Science. Scores in 70’s range would indicate that your work is somehow useful or needed for the decision machines or automating things. Scores on the higher end of the interval signal pure sensory orientation, most likely living off Deep learning algorithms.

Happy about your VIBA letter? Surprised? Maybe you want to read more/again about your profile, now that you know which you are?

Or soon there will be coming) next blog discussing what should good analyst of your type do and know or how to move to make transition to VIBA letter .

Späť na domovskú stránku

Are you “V” or “A”? What type of Data Scientist are you?

Are you more of ESTJ or rather INFP? If you ever went through MBTI personality test you probably know, what I am referring to. If you did not come across Myers-Briggs test, I strongly suggest you take the free test here, you will learn a lot about yourself (and better understand also main topic of this blog).

It has been a long time since first people found their jobs as data analysts (in 1980’s). Structured Query Language (or SQL as we know it now) gained in popularity heavily with advent of ANSI standardized rules. Most of the 1990 – 2010 period, you did not have to think about what sort of the analyst you are. The only data around (anyway) had been the structured data in databases or formatted data files. Processing data for e-shop was just about the same task as analyzing the utility data of water or gas consumption. That led also to golden era of data analyst: if you mastered SQL (and some statistics on top of that), you could easily swim between jobs or even industries.

Admittedly, I joined the labour force in this period as well, so when choosing the career path of data analyst, I did not crack my head too hard with thinking what kind of analyst I would like to be. After all, being analyst felt so generalist-ish. Almost like living in Ford-times and being perplexed by question, if you can drive car, truck or bus. Hey, man, what’s the difference?!

But After 2010 things got a bit different twist. As I regularly go to several expert conferences on analytics, I started to notice that strange feeling building somewhere in my background. I have seen colleague speakers to present their use cases of analytics and they felt somewhat different to what my team was mainly working on. I still understood the principles, I could replicate the projects, if asked to, but have realized that we deem those marginal and rarely pursue them in our company.

Few years down the road, and I was getting certain: Analytics is not a single continent, where you can move freely from one area to another. It is rather an archipelago of torn islands and tectonic forces push it pieces more and more apart. You still can see there is party on the other island, but you would need to get really wet to get there to take part in it. I started to investigate how many pieces has the analytics continent disintegrated into. Then, on one of my business flights, the article from Kai-Fu Lee in Nov 2018 Europe edition of Fortune nudged me. Wow, these are exactly the clusters, I have been thinking of! A bit different naming and some thoughts in another order, but it strongly converged with what I believed in.

Thus, I decided to wrap my older thoughts around ideas seeded by Kai-Fu Lee and came with VIBA, (most likely) the first data scientists’ persona model attempt, helping you to orientate where you stand on the scale of analytics. It is dwarfed by usefulness of MBTI (or similar) profiling, but gives you basic frame of where you are … And most importantly where you should go. It constitutes 4 different tribes of analytics, each living on its island, slightly moving apart from each other with every quarter of year. V-I-B-A namely stands for:

Visual & Voice analytics & Words analytics. Primarily attempting to detect patterns in images, video, voice and text. Building their models on unstructured data, that needs to be annotated to help the engine to learn the patterns. Trying to make sense of perception-based data and to compete with (or even replace) the human senses. Mostly working on different topologies of neural networks, tapping into tools like Keras, TensorFlow and NLP. If you work for this area, you encounter mostly objects, sound samples or text excerpts. Projects usually have longer leading session for exploring data and extensive training. Requirements of analytical progress are framed by Innovation teams or R&D departments.

Internet and Social media related. Striving to describe user flow or conversion in online apps, e-commerce or through interactions on social media. Analytics based mainly on streams of user generated data from clicks & purchases, often monitoring rather short time window of data. Relying on Time series analysis, some forms of Machine learning or Graph database analytics. Supporting Online and Marketing departments of the larger companies or start-ups in these fields. To survive on this island, you need to be familiar with Google Analytics and/or APIs to major social media networks. Crunching and storing of data often happens in external cloud, analytical results are needed in (near) real-time basis. Projects have short span of time; their results often morph into permanent monitoring of the discovered patterns.

Behavioral Analytics. Serving requests coming from Traditional sales channels, CRM dept’s or Product management teams. Delivers predictive models about crucial business behaviour of the clients/users. Alternatively, clusters the portfolio of users or stipulates their life-time value. Inputs come in form of numerical or categorical features about clients, often still calculated from underlying relational databases. Features are rather aggregated, human suggested info about behaviour. Analytical effort on this island requires at least (6+) months of history to arrive at stable results. To operate in this area, you still need to master SQL and some of (maybe proprietary) analytical packages for regressions, decision forests or classification algorithms. Analysis barely happens in cloud, most of the data is stored on own premises. Projects happen in days, max weeks span, leading to regular, ongoing scoring of the user-base via developed models.

Automating & Autonomous. This type of analytics aims to generate rule engines and sophisticated models to drive decision in certain process, on request of Operations or R&D teams. Takes (streamed,) low level sensory data, coming from experiment or real-life readings. Tries to make sense of them through multiple layers of own features. Using Deep Learning methods teaches machine to decide within stated error rate allowance. Data are either real-time of historical logs of some process or motions’ sequence. Uses opensource Neural network-based packages like TensorFlow, Keras or Caffe. Working in this area, you most likely come across human driven processes being handed over to machines. Analytics happen on/close to mechanical hardware or centrally in cloud for large span of parallel process iterations.

Do you already know, what is YOUR PRIMARY letter in analytics? To support you in this effort, I created a free VIBA test I suggest you take it right now, to assure (or surprise) yourself about it.

As mentioned before, V-I-B-A profiles are distinct, though you might be on edge of more than one “letters”. What is important to understand, that each VIBA letter has its own reason to exist. Similarly, to MBTI, there is no good or bad profile, just one more suited for some jobs as the another is for other job. You might be pleased with where VIBA profiles you or less so. As with personality test, you even might chose to change yourself from one VIBA island to another. Or stick harder to tribe you belong to now. In both cases it is essential for you to understand what constitutes to be a good “V”, ”I”, “B” or “A”. For that matter I am preparing (soon to come) a separate blog explaining what each letter should do improve their mastery and what skills one needs to acquire, if (s)he wants to hop to another VIBA island.

I am more than happy to receive on info@mocnedata.sk any feedback or learnings that you might arrive at while working with VIBA. If you want to share VIBA with your friends to compare if you are part of same tribe, feel free to pass on this blog post. If you intend to use the methodology in your professional (or academic) work, please, pay tribute by properly citing authorship of this article.

Späť na domovskú stránku

5+1 interesting AI videos

After reviews of some AI related books caught quite a interest from TheMightyData.com community, I decided it to elaborate on these points a bit and point you to yet another insights. To prevent turning you into bookworms, I decided to pick a bit more engaging media this time and I would like to inspire you to see some good AI related video speeches. Enjoy!

How AI can save our humanity | Kai-Fu Lee

If you work in AI community name of Kai-Fu-Lee might not be unknown to you, as he stood behind boost of AI in Apple (and few other platforms). What is more, his view on AI is rather encouraging. In his life (and attached video) he strives to point our areas, where people can succeed even after AI fully kicks in. Is it hope worthy following? Well, that conclusion I leave up to you to make after seeing this great video.

A brain in a supercomputer | Henry Markram

There are at least 2 things fascinating about this video. The first one that it was filmed 8 years (!) ago. That depicts how far away the Oxford guys have been in the respective topic already back then. Even 10 years later down the road, some research teams are not at verge of the same understanding of matter. The second fascinating aspect of the video is that in less than 16 minutes, it walks the spectator from basics of Neuroscience up to expert insight on Neural Simulations. And that is certainly worth your 16 minutes.

The Truth Behind Artificial Intelligence | Andrew Zeitler

Andrew was only (hard to believe) 17 years old when he delivered this speech. And if you pardon his young enthusiasm (here and there bordering with affect) he will introduce you to load of interesting thoughts on what are the next milestones for General AI, as well as how will General AI most likely behave, when it comes. For those educated in matter, some parts of the speech might be, I admit, a bit sluggish. However, have You pumped into applause any hall of that size when you were 17?

Where AI is today and where it’s going. | Richard Socher

Richard Socher is professor at Stanford Computer Science Department, mainly focusing on Deep Learning. And he is awesome in clearly and humorously explaining what progress there has been achieved in neural networks lately, as well as stating where AI still drags its feet (as you will see sometime literally). For those facing master or PhD thesis this might be good short list of highly desired topics. If you out of school and happen to already work in AI area, maybe an inspiration to educate more about some areas of AI, you did not cross upon yet. But even if you are lightly interested in AI topics, Richard is a great entertainer, so its worth seeing the video just for fun sake.

Music and Art Generation using Machine Learning | Curtis Hawthorne

Creativity is one of the often cited to be “last fortress of human superiority”. Machines cannot be creative, so at least here we are safe to assume human dominance for longer time, right? Well, to assess to what extent that is really true, I suggest you see this video by Curtis Hawthorne, who reports on how far (they in Google) machines got so far.Next time you hear this soothing self-defense of humans, you already will have an educated arguments to discuss.

The final bonus+1 track id nobody else but Nick Bolstrom, the author of the book SuperIntelligence. In his video he describes basic principles of his book. So if you have not read my review of this fascinating piece of reading take a try with the author himself to motivate you to read it.

Späť na domovskú stránku

Unconventional methods of Feature engineering

My dear fellows,

after settling down in Berlin Datascience community and visiting few interesting, local meet-ups, I agreed to contribute to the community knowledge sharing as well. Thus, if you have a while you can join Sebastian Meier [Technologiestiftung Berlin], Sandris Murins [Iconiq Lab] and me on June 14th 2018 18:00 at AI IN DATA SCIENCE – BERLIN at Wayfair DE Office, Köpenicker Str. 180, 10997 Berlin, Germany.

To ensure you that it really pays off to reserve some time for this meet-up let me make a sneak-preview of what you might get from this event:

In my presentation, titled “Feature engineering – your trump card in Machine Learning” you will find out what are the 2 key reasons for feature engineering future. First explains how variables can become competitive advantage of your models. The second one goes even beyond that and hopefully opens your eyes why Feature engineering might be actually essential for Data Scientists’ future job security.

After getting warmed up on the overall role of features, we shall jump into how variables are actually designed/generated in real life. What are the most used approaches to design the features? But more importantly, what are the common pitfalls that we all often fall into while designing our feature sets for Machine Learning predictions? Are there tricks to prevent us from failing on these ?

However, just to demote the traditional ways and solely point out their notorious downsides without naming the alternative(s) would be a bit unfair. Therefore, in second half of my talk I want to walk you through unconventional ways how to design your features. What is more, I will take help of 4 specific case studies where unconventional features were needed. I hope you might get inspired for your own work.

If you cannot make it (for whatever reason) to this talk and you are member of MighytDataCommunity, you will find the presentation slides here after event, in member-only-restricted blog post. If you are not member of the free MightyData Community yet, you can get your free membership to this community in less then 2 minutes, by registering here.

Looking forward to seeing you all at the AI Meet-up event !

Späť na domovskú stránku