Did ChatGPT pass Data Science technical interview?

On last day of November 2022, bit in the shadow of the Cyber week craze, there has been released by OpenAI team for free testing the new ChatGPT. It is aimed to be an chat-bot using strong GPT 3.5 natural language model, capable of not only casual conversation but also able to answer real (even tough) expert questions, or write creatively texts, poetry or even whole stories.

As the features (and performance) of the model are pretty awesome step-up to what we have seen so-far, its launch immediately rolled the snowball of testing it in plethora of the domains. The craze seems to be actually so intense, that it is believed to be the first digital tool/service to reach 1 million of new users within 5 days of its official release. (To be fair, I think it is the first recorded one only, I am quite sure that in countries like India or China it is not unheard of gaining 1 mil users fast for something really catchy 😊)

But back to core story. The ChatGPT use-case, that was bringing the most havoc on LinkedIn and many blogs and news portals, is fact that can produce real snippets of code based on very simple specification of what the code should do. You can go really like “Show me code to predict survival rate on Titanic” and it returns in snap the Python code to fetch the data, create predictive model and run it, all in gleaming, well commented Python coding language. Or so it looks.

In effort to create my own opinion, I tried (and collected others’) attempts on coding inquiries to investigate the real quality of the code. I made a short summary of this early investigation in this this LinkedIn post. Tl;DR = it was not flaw-less code; if you try to run it, you will still often stumble upon errors, BUT … For somebody not having a clue how to attack the problem, it might be more than an inspiration.

Few days later, my dear friend (and former colleague) Nikhil Kumar Jha came with the idea to ask the ChatGPT one of the technical interview questions he remembered from the time I was hiring him into my team. He passed me the question and answer in message. And I have to say, the answer was pretty solid. That made my mind twisting. So, we quickly agreed to take the whole battery of the test that I use for technical interview for Data scientist and submit the ChatGPT “candidate” through the whole interview hassle. Rest of this blog tries to summarize how did the robot do and what are the implications of that. But before we get there: What do you think: Has the ChatGPT passed the technical round to be hired?

Technical interview to pass

Before jumping into (obviously most) juicy answer to question at the end of previous paragraph, let me give you a bit of the context about my interview as such. The market of the Data Scientist and Machine Learning engineers is full of “aspirational Data Scientists” ( = euphemism for pretenders). They rely on the fact that it is difficult to technically screen the candidate into details. Also the creativity of the hiring managers to design very own interview questions is relatively low, so if you keep on going to interview after interview, over several tens of rounds you can be lucky to brute force some o them (simply by piggybacking on the answers from failed past interviews).

To fight this, I have several sets of uniquely designed questions, that I rotate through (and secret follow-up questions ready for those answering the basic questions surprisingly fast). In general, the technical round needs to separate for me the average from great and yet genius from great. Thus, it is pretty challenging in its entirety. Candidate can earn 0 -100 points and the highest score I had in my history was 96 points. (And that only happened once; single digit number of candidates getting over 90 points from more than 300 people subjected to it). The average lady or gentleman would end up in 40 – 50 points range, the weak ones don’t make it through 35 points mark even. I don’t have a hard cut-off point, but as a rule of a thumb, I don’t hire candidate below 70 points. (And I hope to get to 85+ mark with candidates to be given offer). So now is the time to big revelation…

Did the ChatGPT get hired?

Let me unbox the most interesting piece here first and then support it with a bit of the details. So, dear real human candidates, the ChatGPT did not get hired. BUT it scored 61 points. Therefore, if OpenAI keeps on improving it version by version, it might get over the minimal threshold (soon). Even in tested November 2022 version, it would beat majority of the candidates applying for Senior data science position. Yes, you read right, it would beat them!

That is pretty eye-opening and just confirms what I have been trying to suggest for 2-3 years back already: The junior coding (and Data Science) positions are really endangered. The level of the coding skills needed for entry positions are, indeed, already within the realm of Generative AI (like ChatGPT is). So, if you plan to enter the Data Science or Software engineering career, you better aim for higher sophistication. The lower level chairs might not be for the humans any more (in next years to come).

What did robot get right and what stood the test?

Besides the (somewhat shallow) concern on passing the interview as such, more interesting for me was: On what kind of questions it can and cannot provide correct answers? In general, the bot was doing fine in broader technical questions (e.g. asking about different methods, picking among alternative algorithms or data transformation questions).

It was also doing more than fine in actual coding questions, certainly to the point that I would be willing to close one-eye on technical proficiency. Because also in real life interviews, it is not about being nitty-gritty with syntax, as long as the candidate provides right methods, sound coding patterns and gears them together. The bot was also good at answering straight forward expert question on “How to” and “Why so” for particular areas of Data Science or Engineering.

Where does the robot still fall short?

One of the surprising shortcomings was for example when prompted on how to solve the missing data problem in the data set. It provided the usual identification of it (like “n/a’, NULL, …), but it failed to answer what shall be done about it, how to replace the missing values. It also failed to answer some detailed questions (like difference between clustered and non-clustered index in SQL), funny enough it returned the same definition for both, even though prompted explicitly for their difference.

Second interesting failure was trying to swerve the discussion on most recent breakthroughs in Data Science areas. ChatGPT was just beating around the bush, not really revealing anything sensible (or citing trends from decade ago). I later realized that these GPT models still take months to train and validate, so the training data of GPT is seemingly limited to 2021 state-of affairs. (You can try to ask it why Her Majesty Queen died this year or what Nobel prize was awarded for in 2022 in Physics 😉 ).

To calm the enthusiasts, the ChatGPT also (deservedly and soothingly) failed in more complicated questions that need abstract thinking. In one of my interview questions, you need to collect the hints given in text to frame certain understanding and then use this to pivot into another level of aggregation within that domain. Hence to succeed, you need to grasp the essence of the question and then re-use the answer for second thought again. Here the robot obviously got only to the level 1 and failed to answer the second part of the question. But to be honest, that is exactly what most of the weak human candidates do when failing on this question. Thus, in a sense it is indeed at par with less skilled humans again.

How good was ChatGPT in the coding, really?

I specifically was interested in the coding questions, which form the core of technical screening for Data Science role. The tasks that candidate has to go through in our interview is mix of “show me how would you do” and “specific challenge/exercise to complete”. It also tests both usual numerical Data Science tasks as well as more NLP-ish exercises.

The bot was doing really great on “show me how would you do …” questions. It produced code that (based on descriptors) scores often close to full point score. However, it was struggling quite on specific tasks. In other words, it can do “theoretical principles”, it fails to cater for specific cases. But again, were failing, the solutions ChatGPT produced were the usual wrong solutions that the weak candidates come with. Interestingly, it was never a gibberish, pointless nonsense. It was code really running and doing something (even well commented for), just failing to do the task. Why am I saying so? The scary part about it that in all aspects the answers ChatGPT was providing, even when it was providing wrong one, were looking humanly wrong answers. If there was a Turing test for passing the interview, it would not give me suspicion that non-human is going through this interview. Yes, maybe sometimes just weaker candidate (as happens in real life so often as well), but perfectly credible human interview effort.

Conclusions of this experiment

As already mentioned, the first concern is that ChatGPT can already do as good as an average candidate on interview for Senior Data Scientist (and thus would be able to pass many Junior Data Scientist interviews fully). Thus, if you are in the industry of Data analysis (or you even plan to enter it), this experiment suggests that you better climb to the upper lads of the sophistication. As the low-level coding will be flooded by GPT-like tools soon. You can choose to ignore this omen on your own peril.

For me personally, there is also second conclusion from this experiment, namely pointing out which areas of our interview set need to be rebuilt. Because the performance of the ChatGPT in coding exercises (in version from November 2022) was well correlated with performance of human (even if less skilled) candidates. Therefore, areas in which robot could ace the interview question cleanly, signal that they are probably well described somewhere “out in wild internet” (as it had to be trained on something similar). I am not worried that candidate would be able to GPT it (yes we might replace “google it” with “GPT it” soon) live in interview. But the mere fact that GPT had enough training material to learn the answer flawlessly signals, that one can study that type of questions well in advance. And that’s the enough of concern to revisit tasks.

Hence, I went back to redrafting the interview test battery. And, of course, I will use “ChatGPT candidate” as guineapig of new version when completed. So that our interview test can stand its ground even in era of Generative AI getting mighty. Stay tuned, I might share more on the development here.

Older articles on AI topic:

AI tries to capture YET another human sense

Want to learn AI? Break shopping-window in Finland

REMOTE LEARNING now on AI steroids

5+1 interesting AI videos

Späť na domovskú stránku

FREE TEST to detect your VIBA profile

You might have heard about VIBA, the attempt to describe different Data Scientist profiles. You maybe even heard what V-I-B-A categories stand for and are wondering what profile is actually your one? Then this short test is exactly aimed at you. Via answering below listed questions, you can gain indication on what is most likely your VIBA “letter”

Instructions to FREE VIBA test

Please answer all of the 8 questions below. For each question pick one answer only. If you feel more answers might apply to you in given question, try to pick one, which is the closest to your situation at the moment. For each answer you will be assigned certain number of points stated in brackets. After answering all the questions, please sum the points earned for answers to these 8 questions. Total score achieved will guide you to what test believes is your existing Analytical letter.

1.What is the prevailing data types that are inputs for your Data Science work?

a] unstructured data OR sound/video data [12 points]

b] stream or batches of transactional data [1 p]

c] structured data of longer time periods with many aggregated Features or Proxies [4 p]

d] sensory data, readings from IoT devices or physical measurement [8p]

2. Which part of the company is most often requesting (or being user of) data science outputs that you create?

a] Online commerce, Social media or PR [1 p]

b] Innovation, Research & Development teams [12 points]

c] Traditional (human operated) Sales, Strategy or Business development [4 p]

d] Operations, Finance or IT dept. [8p]

3. How does the training of the new data science models you (and your team) generate happens most of the times?

a] through annotated examples, most likely generated by humans [12 points]

b] through data from experiments, observations or simulations of the reality [8p]

c] using short time window samples of long going process [1 p]

d] based on Features that are human selected aggregates or proxies of raw data [4 p]

4. What Data Science methods are dominantly used in Data Science tasks your team is working on? If more of the below listed are used, which of those would you keep if allowed to have only one category available?

a] Advanced Machine Learning, Random Forests, Regressions or simple FF neural networks [4 p]

b] Deep learning (often CNN, DCN, KN, LSTM…) [12 points]

c] Time series, simple ML classifiers or Graph analytical methods [1 p]

d] Rule engines generators, Genetic algorithms, or more advanced typologies of Neural Networks [8p]

5. What tools/platforms do you CERTAINLY have to have AVAILABLE for your work?

a] Keras, TensorFlow or similar, NLP or other text mining tools [10 points]

b] Google Analytics or APIs to Social media [1 p]

c] SQL and analytical packages or Opensource ML platforms. [4 p]

6. How does a typical task that you are asked to deliver in your team look like?

a] Describe and analyze user flow , conversion rates or user usage specifics [1 p]

b] Determine probability to do something OR describe segments of the users [4 p]

c] Teach systems to decide or replace human role in processes [8p]

d] Detect similarities or patterns in objects or texts [12p]

7. When doing Data Science in your team, what is the most used domain of the data?

a] users/clients preferences or online data [1 p]

b] Physical (2D or 3D) objects, art or result of some creative work [12p]

c] Purchase data, products or off-line customer data [4 p]

d] Processes and their stages, Motion or Logistics of the things [8p]

8. How long the necessary/typical time window of input data that you need to train your models or prepare your Data Science deliverable?

a] Time does not play role. Many repeatings/variations of the same object(s). [12p]

b] Short time windows, usually below 3 months [1 p]

c] (Near) Real time based, often of many different types or sources from same time window [8p]

d] Longer time windows, usually 6M+ of the analyzed matter/event [4 p]

So what VIBA letter you are?

If you scored 0 – 23 points your most likely letter is I, the Internet and Social media related tribe of Data Science. The closer to 8 points you are the more evident this is. The closer you came to 23 points, the more inclinations or overlap with other letters there might be.

If you scored 24 – 49 points your most likely letter is B, the Behavioral Analytics group of Data Science. Staying on lower bound of the interval indicates that you also probably asked to analyze online behavior of the users. The closer you came to 49 upper limit you are, the more your work might be used also to improve decision making of the processes of automate things.

If you scored 50 – 72 points your domain in Analytics is most likely A, the Automating & Autonomous space of Data Science. If you ended up just few points above the 50 lower limit, we would guess that your automation is still in area with strong Human aspect. Staying closer to upper limit of 72 points means that autonomous aspects are paramount and your models probably also rely on reality measuring or sensory inputs.

If you scored 73 – 94 points your are living your analytical life as V, in the Visual & Voice analytics & Words analytics arena of Data Science. Scores in 70’s range would indicate that your work is somehow useful or needed for the decision machines or automating things. Scores on the higher end of the interval signal pure sensory orientation, most likely living off Deep learning algorithms.

Happy about your VIBA letter? Surprised? Maybe you want to read more/again about your profile, now that you know which you are?

Or soon there will be coming) next blog discussing what should good analyst of your type do and know or how to move to make transition to VIBA letter .

Späť na domovskú stránku

CRM brainteasers or job interview tasks. Do you dare?

If you search the web or social media, you find plethora of math brainteasers. But if you want to put your grayish matter to test in CRM or Marketing there is not that many of riddles from these areas. You sit candidate for marketing position interview and lack some juicy case study to scan his/her real CRM abilities? There is a hope for you, now.

In past, you might have come across some of the beloved, older round of CRM riddles (I. round, II. round or III. round , sorry some only in Slovak) After bit of s short break, here we are with 4th round of the cunning CRM mind/benders. Do you dare to get them correct ?

4th Round Of CRM Riddles

4.1 Drier offer

You really had a weird day today. You are manager of CRM team of larger, national electricity utility for households with more than 800.000 retail clients. The VP of Marketing&Sales stopped by your table in the afternoon and passionately talked you through details of new cooperation contract with major electronic appliances chain, just signed by the board of your company. The pilot project of this new cooperation program will be aimed at offering well-discounted cloth drier to your customer base. You are asked for just a little help: to identify a proper target group for this offer. As you company has digital electricity usage meter installed for each customer, you have a 24 month-long history of electricity consumption in hourly readings from each of the customer. On top of that you have a customer profile with basic client data from contract signed between your company and the end/customer. How would you select the clients for mentioned drier offer ?

4.2 Vitamins at the petrol station

You were fed/up with bank analyst job, so you switched a job and now for more than 2 months you already as data analyst in large chain of petrol stations. Your company, operating aloyalty card program, has recently decided to extend the range of assortment offered at their petrol outlets with additional line of unregulated Vitamin products. You are asked to narrow down the selection of clients that should receive (fancy and thus costly) Vitamins introducing direct mail form the central marketing team. You are still under probabtion period, so you don’t want to spoil this and let your skills shine to superiors. How would you select the customers to be addressed?

4.3 Opening own chain of BIO restaurants

Obviously, more than 9 years in CRM team of the national Telco operator has allowed for loads of bizarre situations. But this made certainly your heart skip a beat. Top management of your Telco company has YESed to launch of new, own chain of BIO FOOD restaurants. You think they must be nuts, but after all its their business to burn the company cash. Or is it? Well would be fine, if only you haven’t been asked to generate list of existing clients that are highly probable to become clients of the soon-to-be restaurant chain. You have extracted all data and behavior insights (from Telco) that you have at disposal on clients had ever passed buy selected locations. How would you pick the correct target group ?

4.4 Interesting travelers’ behavior

For years you have pulled levers of central insight team for public transport operator in large 1.000.000+ European city (think Prague for instance). Your employer has issued local chip-enabled traveller’s ID that stores client identity and his travel season ticket. All of the vehicles operated by your company are fitted with strong chip readers located at any door of the vehicle. Thus, all clients entering and leaving vehicle are logged into your database. For each of the passenger you have at least 2 years of their travel history and these chip/running clients account for more than 85% of all transport company revenue. Propose 20 cunning client behavior parameters that you can distill from the data at your hand. How creative will you be ?

The solutions of the riddles will be published at TheMighyData blog in few days time. If you don’t want to miss their release, become a free-to-be member of TheMighyData community (who receive update on any new blog on this site).

Späť na domovskú stránku