Data science tutorial: common statistics problem for interviews

Data scientists are often asked to evaluate experiments during their interviews especially with take-home challenges. Oftentimes, I’ve seen candidates make a critical statistical error that causes them to use the wrong test for inference and evaluation. Let’s walk through the issue and how to fix it!

#datascience #dataanalytics #datascientist #techinterview #careeradvice #causalinference #datascienceinterview #datasciencetok #statisticstutorials #datasciencetutorial #datasciencecourse

Code here

Data science tutorial: pricing experiment with confounders

Pricing experiments are incredibly important for data scientists. However, in the case where we can offer a discount across products, randomization at the product level can still yield a biased result. However, we can recover the true result using causal inference techniques. Code is available below

#datascience #datascientist #causalinference #datasciencetutorial

Code here

Are US Election pollsters HERDING their results?

Nate Silver had a bombshell claim stating that the probability that there is no manipulation of the polls is less than 1 in 9 trillion (!) or effectively impossible. I walk through the statistics and math to demonstrate where this figure came from using common data science techniques and coding in python. Code is available below

#datascience #datascientist #causalinference #uspolitics #election #election2024 #electionpolls2024 #electionpolls #datasciencetutorial

Code here

Data science tutorial: Confounding variables and why we need experimentation

Confounding variables form the foundation for why experimentation and causal inference is in such demand across the tech industry and in public policy. We often want to estimate the incremental impact of a treatment or policy, but due to confounding variables, comparing groups with and without the treatment may give us biased results ... results that may not just be incorrect in magnitude but possibly even in direction!

#datascience #datasciencetutorial #datascientist #causalinference #datasciencecourse #datascienceforbeginners

N/A Code

Data science tutorial: Math to solve linear regression

Linear regression is one of the most important tools for the data scientist and data analyst. It helps to understand how the coefficient estimates come out of linear regression so they don’t seem random or from magic! Here I use linear algebra and calculus to demonstrate how we solve a simple linear regression model

N/A Code

Data science tools to estimate the value of Starbucks's new CEO

As data scientists, we are often asked by stakeholders “what is the impact?” Here are two methods that we can use to answer these questions. In this example, I use readily available stock market data and estimate both an event study model and difference-in-differences model to evaluate the market value of the Starbucks CEO announcement. It turns out his 9-figure pay package is already a bargain!

I also go through methods of model validation and model selection.

Sample Code

Introducing causal inference methods: Difference-in-differences

As data scientists, we are often asked by stakeholders “what is the impact?” Of a marketing campaign or feature launch, and often after the treatment was launched.

When experiments are not available, we have to turn to causal inference methods. Here I discuss one of the foundational methods: difference-in-differences

[No code used]

Data Science Tutorial: Biggest mistake junior data scientists make

A common mistake from data analysts and junior data scientists is mistaking statistical significance for causality. In this example, I generate data and walk through how unobserved confounding variables can lead to misleading outcomes -- even statistically significant results that are completely wrong. In mainstream media, this mistake is often made when discussing wage gaps; for example, the college wage premium or gender wage gaps.

Code available at the link below.

Sample code

Data science skills tutorial: evaluate experiments

If you want to be a data scientist or analyst, experiment evaluation is a foundational skill for you to learn. Here i demonstrate two ways that you can code this yourself and also compare the results to an online calculator.

Code available to the t-test difference in means and linear regression at the link below.

Sample code

Let's solve this data science
interview question:
major social media company

This is a real case study interview question from a large social media company for a senior data scientist role. Case studies are often the most important module of the data science interview because the problems reflect the real-world projects you'll be expected to complete on the job. Here, I suggest three potential solutions to the problem.

[No code used]

Biden vs Trump: US inflation vs G7 inflation. A data science approach.

There are many ways we can compare the Biden and Trump economies. Inflation is talked about quite a bit because of the spike during the Biden presidency.

But if we assume that there are world events and factors outside their control that drive global changes in inflation, how do we fare against other advanced economies -- particularly the G7?

Sample Code

Data science project using synthetic control methods

One of the most important tools a data scientist has in the causal inference track is the synthetic control. This technique is often used to estimate the impact of a treatment or an event when no randomized experiment or a/b test is possible.

Here, I walk through an example project you can do at home using publicly available data to estimate the impact of a negative event — a project with highly transferable skills.

Sample code

Data science skill: Experimentation

The power of randomization

Why is experimentation so powerful? If a stakeholder applies the treatment to a selected group of users, then there will likely be meaningful differences between the groups. Then, when we observe differences in outcomes, it may be challenging to discern what's due to the treatment or due to underlying differences.

In this video, I offer a simple example of how randomizing users into treated and control groups makes them more likely to be similar, and how sample size plays a critical role in determining just how similar we can expect them to be.

Sample code

Data science tutorial: common statistics problem for interviews

We need your consent to load this video

Data science tutorial: pricing experiment with confounders

We need your consent to load this video

Are US Election pollsters HERDING their results?

We need your consent to load this video

Data science tutorial: Confounding variables and why we need experimentation

We need your consent to load this video

Data science tutorial: Math to solve linear regression

We need your consent to load this video

Data science tools to estimate the value of Starbucks's new CEO

We need your consent to load this video

Introducing causal inference methods: Difference-in-differences

We need your consent to load this video

Data Science Tutorial: Biggest mistake junior data scientists make

We need your consent to load this video

Data science skills tutorial: evaluate experiments

We need your consent to load this video

Let's solve this data science interview question: major social media company

We need your consent to load this video

Biden vs Trump: US inflation vs G7 inflation. A data science approach.

We need your consent to load this video

Data science project using synthetic control methods

We need your consent to load this video

Data science skill: Experimentation

The power of randomization

We need your consent to load this video

Privacy Settings

Select all services

Website Translator

YouTube Video

Google Maps

Let's solve this data science
interview question:
major social media company