Elections are like Hadoop

In my mother tongue, when one does not know/want to share reason for his actions he says “Lebo medved” (= due to the bear). That brings the actual bears off balance, because they also do not have a clue why you would do that. Going through several elections in past few months, I realize that elections are “Lebo Hadoop” = a bit like Hadoop. Don’t you get it why? Well, then let both you and yellow elephants understand why:

Hadoop

Hadoop and other technologies of distributed computing and data analytics are still less common in our society than the progress would expect them to be. Upon explaining the concept to managers I quite commonly have to reason that  implementing Hadoop (which they heard of at any business conference lately) is not switching Fords for Volkswagens or like migration from MS SQL to Oracle. To really embrace the distributed technologies is rather massive change to company business processes. As this stage I usually run short of managerial understanding and start to hear statements like “After all, Hadoop is piece of software, so it can be reinstalled to take over the actual SW, isn’t it?” You roll your eyes and start to think if it is actually worth investing further clarification effort, when the chances to succeed are so small. And exactly for these situations comes handy following explanation:

VOĽBY okrskova komisia

In most of the European countries elections are organized through means of voting districts. Every voter is registered to exactly one polling station, usually the closest to his/her living address. That helps to make sure that the voter does not have to travel too much to polling station and thus allowing to minimize the time needed for single voter to cast his/her vote. Immediately as the election time is over, the local polling station (through hands of the local electoral committee) starts to count the votes casted for different candidates. After they finish they work, they create election protocol summarizing all achieved results of vote in given polling station. This protocol than travels to county electoral committee to sum up all of their polling stations and then through means of regional and national electoral committee to sum the entire election result. We feel this process to be, somewhat, natural, as we got used to that over the decades. But let’s imagine it would be all done differently …

Imagine the whole country would need to vote at the very same place. That would mean some people will travel very long distance (what certainly would influence participation in elections). What is more, imagine how big would have to the polling station to be, to actually host that millions of voters. And how long the queue would be to actually cast the vote. Polling station that big would have to be a dedicated building built just for that purpose (that is of no other use outside the elections). This way organized elections would be only possible with flawless voter register. Any small discrepancies in recognizing or registering the voter would clutter the queue waiting to give their vote. This election would be ultimate hell for local electoral committee. They would need, literally, weeks if not months before single committee counts and verifies several millions of voting ballots. And imagine that some of the electoral committee members get ill or go on strike. As there is only one electoral committee in whole country there is no replacement at hand. By this moment, you are probably getting irritated and think “Who on Earth would go such a down right insanity?!

dlhy rad ľudí

Well, you might be surprised to find out this is exactly the way how our legacy relational SQL databases operate. They try to store all data into the same table(s). As a result, the computer that can handle that much information, has to possess immense capacity that is both (as single polling station in country) too robust and too expensive. What is more, these cannot be the regular PCs honed by normal users, these have to be dedicated machines (usually not utilized outside of database tasks). All data points have to be written sequentially what automatically creates long queues if data load is massive. God save you, if single line of the intended data input is incorrect. Writting process is then aborted and the rest of the data still waiting to be injected. If the main database corrupts or some calculation got stuck, the whole business process freezes. Any slightly more complicated count takes enormous time. As more and more data flow in the process, it only gets worse at the time.

After seeing all this, Hadoop has said: Enough! It’s operation and data storage are similar to way we organize our elections.  In Hadoop the total data is split into great number of smaller chunks (polling stations) that hosts limited sub-group of (voter) records. Data are stored close by their origin (same way the polling station is close to your place) and thus the reading and writing the data is much faster. When the (election result) sum is called, several (polling) places work in parallel to count the votes, not single election committee calculating everything. After calculation is completed on individual places (Hadoop calls then “nodes”), the nodes pass the result to higher level of aggregation, where results from nodes is summed to total result. While the node are calculating rather small and simple amounts, common PCs can serve as host for the nodes, no single supercomputer (one mega election station) is needed. These common PCs can have other use outside the Hadoop calculation (same way polling stations turn back to schools and community centers after elections are over). The analogy is real, as the Hadoop nodes have similar autonomy as the local electoral committees have in voting process. Same resources available to node are both steering the read/write process as they are taking part in calculations (similarly to local election committee members both casting their own vote as well as taking care of counting all votes casted). And finally if one node (local polling station) fails, another neighbouring node can take over the task. This way, the system is much more resilient to failure or overload. (imagine it to be the case of all voters have portal electoral ID allowing them to vote in another polling station if they home one went on fire)

The Hadoop geeks can object that the Hadoop-Elections analogy is not exact in two points: In order to prevent data loss, Hadoop intentionally stores the copies of the same data chunk to more than one node. If this was to be replicated in the voting process, people would have to cast the same vote into several polling stations “just in case one of them goes on fire”, which is not really wanted phenomena of proper elections (leaving aside the fact that there has been one legendary case of this actually happening in Slovakia). Secondly, there is not fixed, appointed state election committee overseeing the work of the local election committees. In Hadoop any local election committee can take role of the national one. However, leaving these two tiny details aside, the Hadoop-Election analogy fits perfectly.

So next time you need to explain to your boss (or somebody else) how Hadoop really works, remind them of elections. Or simply say “Lebo Medved”…

LONELINESS of data analyst

Higher salary. New challenges. Misfit with recent boss. Desire to get hand on much larger sets. Or pure world peace. None of the above is the main reason for data analysts to change their job. But what is then?

Not just American beauty contests, but also job interviews have quotes like “I wish a world-wide peace”. These are ridiculous statements on why you decided to change the job. I interviewed more than 350 people in my career and thus there is no shortage of even bizarre statements like “My father greats you, he thinks Allianz is still very good company and hopes you gonna get me the job.” Since 2017 I have been doing the interviews in a new country. I thought I would be learning some new “world peace” equivalents of this market. However, my surprise has been much bigger.

Loneliness as the denominator

I felt sorry for the guy, when it appeared in first interview. But when second, third and then forth candidate has come up with same reason of leaving recent job, I got on my toes. I thought if this market is some kind of fairy-tale kingdom cursed by mighty wizard. I could not comprehend, why so many people have loneliness as the common denominator of their life. This was not relationship- or friend- kind of loneliness. Rather this was all about work, the analytical loneliness.

How the story goes …

You graduated from some technical or IT-savvy economic university. You love puzzles and quizzes, you loved Math competitions and finding new interesting links between things. Thus, you look around the job market, where would you be able to put your talent into use. You attend several interviews and find out, that your potential employer has serious data conundrums ahead and that analysts would be measured against demanding criteria. You are thrilled. This is exactly what you are longing for: Let some task beat the f**k of you, so you can learn something new.

The first weeks on job is really fun, indeed. You discover new data that nobody analyzed before you. (even after you, but this you will find only later on). Demanding analytical tasks turn out to be a rather trivial data queries, often falling into reporting only category. But work is fun, so never mind. Your boss does not really understand the nitty-gritty of your work, he would not be up to finding you mistake in your queries. But he is really nice. Time flows, you achieved your quarter, year one, two … oh, man, how the time flies.

To your second Christmas in the company you beg Santa for a new boss, new project or at least the new version of the software you are working for. Just to get some progress ahead in your work life. You try to switch the job, try Start-up, Large corporations, family-driven business… Always few moments of thrill and then Yo-Yo effect of frustration hit as with diet. In the turmoil of internal fight, you sign up for expert conference. After all, one can get inspired externally as well. You meet a lot of stand-ins with same spark fading in the eye.

samota

 Loneliness of data analysts

With a bit of the alteration this was the common plot of all 4 candidate stories. None of them asked for salary of company benefits. None of them was investigating the chances of career growth in future. All they were interested to hear was WHAT and UNDER WHO’S LEADERSHIP will I work on? The usual comment went like “In my recent job I am the only one doing data science. Nobody understands that matter, I have no one to consult my approaches. I only know what I googled out on stackoverflow.com or similar portals. I feel lonely!”

It is interesting. When I gave it a deep thought, I realized I can find stories like that in my friend circles, too. The job of analyst is now going through strange wave. There is enormous demand for data-analyzing people. However, since Data Science is very young branch of analytics, there is very few managers, who actually did the Data Science themselves. Analysts often report to managers, you are not experts in the area. Thus, the young data scientists are doomed for the path of the bonsai: Nobody expects you to grow big, they water you time-to-time, but you are just cartoon of the real tree. The limitations of the expert growth of analytics (subject to another blog coming soon) are dire. These candidate stories only remind me of that again.

The same chorus 3 times again

Now it is clear to me that these were not isolated outliers. Sadly, the analysts’ loneliness is indeed a bitter interplay of three factors. Firstly, there is still very few data scientists. Therefore, it is unlikely that more people with this same job description bump into each other in the same company. Controlling, reporting, data engineers, these are all “multiple repeating’s” jobs in the same firm. But sophisticated analytics is often rather lonely. As a result, deserted Data Scientists do not experience team spirit, they have nobody to consult with their assumptions of uncertainty.

Second factor heating the loneliness is absence of managerial leadership. Sophisticated analysts often find themselves within teams that deal with data only marginally (e.g. sales or marketing). Or the teams work with data, but just to report on structured databases. As a result the Data Scientists do not receive proper feedback on their work and are set to learn only from own mistakes (which they have to detect themselves the first place). Some torture themselves with self-study portals and courses, but few have the iron self-discipline to do these for several years in row. Sooner or later, the majority just throws the towel.

The two above mentioned forces are joined by the lack of external know-how as the third factor. No offence here, but if you want to experience conference where each speaker of the day contributed to your growth, you probably have to organize one by yourself. No jokes here, trust me, my own experience speaks here. Expert conferences are rarely organized by people who can tell if the speaker is beneficial or not. Most of the  Data Science events are vendor-brainwashing or people-headhunting traps. That is why (in the popularity ever rising) Meetup’s are often the only remedy for the expert conference hunger.

meetup

How to brake the vicious circle?

Even though our dispute might not suggest so, there is a happy-end for the 4 lonely candidate stories.  Although they represent a sad probe into soul of the contemporary data analyst, they also show a way how to brake the vicious circle.  Based on their talks and my own expert experience I would suggest you following 3 steps out of the loneliness:

Step 1: Self-diagnostics. The real change can happen only if participants admit there is an issue to address. This makes the change needed and unavoidable. Thus, please give yourself following 3 questions:  1] Do I have a boss, who understands my job to extent that (s)he would be able to temporarily step in during my absence?  2] Is there somebody else in our company that I can ask if I am doing my Data Science tasks correctly or give me tips if I got stuck?  3] Did I have a chance to get my hands dirty on at least 2 new analytical approaches that I never tried before over the last 6-9 months?  If at least one “NO” emerged  for you in previous three questions, you should seriously think about your analytical future.

Step 2: Deep breath before the dive. If you self-diagnosed yourself for a move ahead, do not run to job portals to look for new job ads. Before you embrace the switch, get ready for the leap. If you start looking for a new job now right away, you will most likely fail to get one. Grab a bit of self-discipline. Take some on-line courses, watch YouTube videos on your new desired expertise. But foremost,  force yourself to really try by your own hands the new skills that you will need in new job.  Maybe start here.

spiralaStep 3: Look for real Data Science leader in real Data Science team. To escape the analytical loneliness for real, one has to solve all the three underlying factors. Therefore, .. (a) you have to find a job, where you will be .. (b) part of the larger team working on Data Science and the team … (c) will be led by person with own, real analytical experience.  Team that has ambition to work on plenty of new, analytics relying projects.  Don’t get fooled by sexy offers of companies where one of the 3 aspects is hyped. Start-up’s often look as cool place to work, but often is accompanied with inexperienced founder who “just” had a great idea OR dire DYI conditions that leave you weak to unreliable systems or unskilled neighbor teams. I advise you to start the search by nailing down interesting projects and check on who is leading them.  Alternatively, you can start your search from respected analytics manager and check is his team works on something that would make you dream big again. With high quality managers do not bother to revisit your former bosses’ profiles as well. As they say in airplane security instructions: “Look around, as your nearest emergency exit may be located behind you as well”.

On final note, let me emphasize that analytical loneliness might be a cyclic phenomenon. As the Data Science industry takes traction, teams will grow and finally form generation of nature Data Science leaders. However, in our region it make take well 5-7 years before happening so. Hence, probably a bit too long of a period to “shelter against the storm” in your recent hiding. Thus, if you identified yourself with some aspects of analysts’ loneliness, do not sooth yourself, it will get better sometime. Repeatedly do the same stuff, in stable, well-payed job, where my boss does not know enough to mess into my job or to fire me, can sound like recipe for nice life. So rather than galvanizing you to action, I ask you to give it a thought: How will the whole industry shift in the meantime? How do my chances to switch to more sophisticated analytics improve/worsen as I enjoy the “invincible” times? No matter if you decide to stay or get ready for the leap, let me wish you months (or years) without analytical loneliness.

CHILDREN had it REALLY tough on Titanic

If you had been a child at Titanic, you would have died with 46% probability. Damn sad. Especially if considered, that total survival rate of Titanic was 39%. Contrary to (proclaimed) Ladies-and-Children first evacuation rule, there is only 15pp difference in death incidence of children compared to overall all ages combined. Had they disregarded this evacuation mantra back there?

Maybe you think, hold on, it was ship wreck on the open sea, so maybe children were not able to swim. After all, we usually learn to swim a bit later into the life. Well, the whole truth is revealed only, if you drill down the children survival rate based on class, they had been traveling in on Titanic:
titanic_graph_1

Children from first and second class had almost 3 times higher chance to survive then those from 3. class. Thus, the swimming skills were not the game there, even if you had been Mike Phepls’ son (or rather grandfather giving the year of the catastrophe).

Following graph depicts that Gentlemen from the upper class, have been not that much of the gentlemen at all:

titanic_graph_2

Men from first class survived in similar rate as women and children from 3rd class. Quite the contrary, the most heroic approach took the men from 2nd class, out of who only 8% survived the trip of legendary steamship. But you wonder maybe, how would we know that?

Kaggle is a web portal that organizes competitions in analysis of data. What is more, on given portal, one can find several training data sets originated for training of the analytical skills. One of these sets is also Titanic Survival dataset. (you can download it HERE, but you have to be a registered Kaggle member). This attractive dataset compiles the data of passengers on the sadly known Titanic voyage with adding parameter if they survived the incident or perished.

Using Machine learning methods (which is dataset destined for training of) you can discover how location of your cabin within ship construction or how much you payed for ticket increased or decreased your chances of survival. One can also analyze if having more relatives on the ship was competitive advantage or disadvantage in fight for life. I will not rip you off the pleasure to find all these “sunk treasures”, but would like to indicate that some conclusions are pretty sinister (as the above mentioned are).

So, do not hesitate and try easy Machine learning predictions with this super interesting dataset. Feel free to share additional findings with THEmighyDATA community in discussion to this blog. On last note: “Near, Far, Wherever you are …”, may your Machine learning attempts to be as convincing as Kate Winslet with Leonardom di Capriom were on the railing 🙂

3 WAYS how to TEACH ROBOTS human skills

Surprising it may sound, but question of robots’ moral standards is with us for centuries and have been present well before first functioning  robot prototype constructed. If you have seen the old Czech movie Emperor’s Baker, Baker’s Emperor (or its American version the Emperor and the Golem) or knows the story of the Golem, has actually encountered first attempts to make robots morally competent. Clay robot has been activated by inserting the magical stone (shaem) into robot’s forehead. From today’s perspective, more interesting that Golem’s clay is the fact that the robots back then acted as good or bad by standards of the man, who activated the Golem. Unfortunately, the story does not recall how this has been achieved. It is even more puzzling because Golems (and other early robots) did not have any senses to observe its surroundings or to adapt to environment.

Fast forward, human kind has moved from legends to reality. We can build (much more elaborate) robots, we can even equip them with all senses that regular human possesses (sight, touch, hearing, smell and taste). However, in question, how to make the robot good or evil by design we are not much further then Golem times. So what methods did we develop as humans to teach the machine what is morally acceptable and what not?

Some readers might get disappointed upon finding out that in programming robots we try to rely on the learning methods that we use to teach ourselves. The supporters of these approaches reason that if we wantGolem to create robots compatible with human kind, we should provide them with comparable education as we receive as humans. Some disillusion rises from opponents reminding that we hope robots to be better/fairer than the bar we, humans, se for them. While, let’s be sincere, our education system still produces abundance of cheaters, violence or human intolerance. Thus, striving for higher moral standards in robots does not seem to unreasonable request in that context. Be it reality or just our dream, these are three ways we now try to educate machines:

I. Mistakes as a path to success

Central Europeans have the right to be proud that one of the ways, which humans pick to teach robots, has started in Slovakia.  Marek Rosa (Slovak founder or gaming start-up KeenSofwareHouse,  decided to devout his full focus to Artificial Intelligence in company named GoodAI.  Marek’s approach uses mentor to build robot’s thinking habits.  Literally, robot’s own mistakes help him get better via loops of tasks served to him from mentor. Robot tries all possible solutions to the problem and those ways leading to desired result are stored into robot’s permanent memory as useful concepts. Mentor’s role is to feed robot with still more and more complicated tasks upon learning the simpler ones. When robot is faced with conditions same to already solved task from past, he acts to deliver the desired result. Marek Rosa foto

This approach probably most closely follows the pattern of human learning: First we learn to detect digits, then to substract and multliply, yet later to solve the set of equations or to describe curve of time-space. The name GoodAI has been chosen to indicate that artificial intelligence trained this way will not feel off the balance by new circumstances. Robot simply selects the solutions that minimizes the damage and if more often facing that new phenomena, over the time it improves the method to ideal resolution of the problem.

II.  When sugar succumbs to whip

Ron Arkin fotoQuite different approach has been selected by  Ron Arkin, American professor of Robotics at Georgia Tech University in USA. Upon his experience he had gained by programming robots for American military, in classic sugar-whip choice he leaves the caloric option aside. His approach builds on simulating emotions in robots. And emotions that are. Arkin let the robot to decide and after the decision he simulates joy or shame in the robot’s system (by assigning black or red points to his solution). So if robot hits the barrier and tries to smash it by force, the teacher stimulates the shame feeling in robot’s memory about the damage caused to the wall. Therefore, next time robot hits the wall, he refrains the solutions that he felt ashamed about in past and prefers solutions that he has been praised about. This approach is essential because robots quickly learn to avoid unacceptable mistakes. In real life these robots will be less blunt in “surprising, never experienced” scenario than the ones trained by first method.

III. Read your robot a fairy tale before sleep

Mark Riedl fotoThe third approach relies on moral standards development that we, humans, receive in the early childhood. Fairy tales and Stories. Mark Riedl, also from Georgia Tech University agrees to Good AI approach. But he reasons that we do not have enough time to teach robot every tiny bit of the intelligence by plethora of trials and fails.

Therefore, Riedl suggests that robot „reads“ great number of stories and analyses human thought and acting into cause-aftermath pairs. If robots during reading of the stories identifies formula that repeats, it stores it into the memory and will try to validate or disprove this rule in next stories to read. Already from legendary movie  „Number five is alive“ (see video) we know that robots can read enormously fast. Hence, this approach of learning can progress much faster than other methods involving human feedback. Robot can this way infer from innocent stories that if humans walk into the restaurant, they sit down and wait for the waitress to take their food orders. Do you find this trivial? Well, then consider robots to be perplexed, why hungry humans do not storm into the restaurant kitchen and cook something for them, as they do at their homes. The advantage of the “fairy tale” approach is that it can train event complex events that are very complicated to construct into try-and-fail attempts used by Marek Rosa.

Together or against each other?

So, what all three methods share in common? Moral standards training of robots cannot rely on preprogramed routines.  Even if we took the effort of rewriting all our moral standards into chains of “If X happens, then do Y“, robot educated by them would be still paralyzed if new circumstances arise. This way trained robots would also be rigid in times with their standards, fully inapt for human dynamics changing. Let’s not forget that not that long-ago women did not have right to vote and it was owner’s legal right to beat his slave on the street.  Proper training of the robot must allow for him to “learn along seeing” new societal norms, same way we teach new customs upon arriving into foreign culture. At the beginning we are a bit cautious and reserved, but after few days we slowly learn not to be elephant in glasshouse. Robot’s training has several advantages to human education. Firstly, if one robot learns all the needed skills, all his next copies get them right away from moment of the construction. What is more, state authorities, can demand that all human facing robots in given country will share common moral standards and compulsory stick to them. The thing that would be so often needed in our human life as well. But that is different fairy tale to read …