Data scientists are often asked to evaluate experiments during their interviews especially with take-home challenges. Oftentimes, I’ve seen candidates make a critical statistical error that causes them to use the wrong test for inference and evaluation. Let’s walk through the issue and how to fix it!
#datascience #dataanalytics #datascientist #techinterview #careeradvice #causalinference #datascienceinterview #datasciencetok #statisticstutorials #datasciencetutorial #datasciencecourse
Pricing experiments are incredibly important for data scientists. However, in the case where we can offer a discount across products, randomization at the product level can still yield a biased result. However, we can recover the true result using causal inference techniques. Code is available below
#datascience #datascientist #causalinference #datasciencetutorial
Nate Silver had a bombshell claim stating that the probability that there is no manipulation of the polls is less than 1 in 9 trillion (!) or effectively impossible. I walk through the statistics and math to demonstrate where this figure came from using common data science techniques and coding in python. Code is available below
#datascience #datascientist #causalinference #uspolitics #election #election2024 #electionpolls2024 #electionpolls #datasciencetutorial
Confounding variables form the foundation for why experimentation and causal inference is in such demand across the tech industry and in public policy. We often want to estimate the incremental impact of a treatment or policy, but due to confounding variables, comparing groups with and without the treatment may give us biased results ... results that may not just be incorrect in magnitude but possibly even in direction!
#datascience #datasciencetutorial #datascientist #causalinference #datasciencecourse #datascienceforbeginners
Linear regression is one of the most important tools for the data scientist and data analyst. It helps to understand how the coefficient estimates come out of linear regression so they don’t seem random or from magic! Here I use linear algebra and calculus to demonstrate how we solve a simple linear regression model
As data scientists, we are often asked by stakeholders “what is the impact?” Here are two methods that we can use to answer these questions. In this example, I use readily available stock market data and estimate both an event study model and difference-in-differences model to evaluate the market value of the Starbucks CEO announcement. It turns out his 9-figure pay package is already a bargain!
I also go through methods of model validation and model selection.
As data scientists, we are often asked by stakeholders “what is the impact?” Of a marketing campaign or feature launch, and often after the treatment was launched.
When experiments are not available, we have to turn to causal inference methods. Here I discuss one of the foundational methods: difference-in-differences
A common mistake from data analysts and junior data scientists is mistaking statistical significance for causality. In this example, I generate data and walk through how unobserved confounding variables can lead to misleading outcomes -- even statistically significant results that are completely wrong. In mainstream media, this mistake is often made when discussing wage gaps; for example, the college wage premium or gender wage gaps.
Code available at the link below.
If you want to be a data scientist or analyst, experiment evaluation is a foundational skill for you to learn. Here i demonstrate two ways that you can code this yourself and also compare the results to an online calculator.
Code available to the t-test difference in means and linear regression at the link below.
This is a real case study interview question from a large social media company for a senior data scientist role. Case studies are often the most important module of the data science interview because the problems reflect the real-world projects you'll be expected to complete on the job. Here, I suggest three potential solutions to the problem.
There are many ways we can compare the Biden and Trump economies. Inflation is talked about quite a bit because of the spike during the Biden presidency.
But if we assume that there are world events and factors outside their control that drive global changes in inflation, how do we fare against other advanced economies -- particularly the G7?
One of the most important tools a data scientist has in the causal inference track is the synthetic control. This technique is often used to estimate the impact of a treatment or an event when no randomized experiment or a/b test is possible.
Here, I walk through an example project you can do at home using publicly available data to estimate the impact of a negative event — a project with highly transferable skills.
Why is experimentation so powerful? If a stakeholder applies the treatment to a selected group of users, then there will likely be meaningful differences between the groups. Then, when we observe differences in outcomes, it may be challenging to discern what's due to the treatment or due to underlying differences.
In this video, I offer a simple example of how randomizing users into treated and control groups makes them more likely to be similar, and how sample size plays a critical role in determining just how similar we can expect them to be.
© Copyright. All rights reserved.
We need your consent to load the translations
We use a third-party service to translate the website content that may collect data about your activity. Please review the details in the privacy policy and accept the service to view the translations.