Propensity Score: How to Construct and Evaluate It

The propensity score is the chance that a person will get a particular treatment based on what is known about them at the start.

A propensity score is the probability that a person, customer, patient, or other unit receives a treatment based on observed characteristics measured before the treatment. Researchers use it in observational studies when treatment is not randomly assigned and selection bias may affect the results.

Selection bias happens when the treatment group and comparison group differ in ways that also affect the outcome. For example, in healthcare research, patients who receive a treatment may be older, sicker, or more engaged with care than patients who do not. In market research, customers who receive an offer may already be more likely to buy.

Propensity score methods help researchers make treated and untreated groups more comparable based on measured variables. They do not make observational data the same as randomized data, but they can reduce bias from observed confounders when used carefully.

Content Index hide

1. What is a propensity score?

2. When should you use a propensity score?

3. How do you construct a propensity score?

4. How do you evaluate a propensity score?

5. What methods use propensity scores?

6. What are the limitations of propensity scores?

7. What mistakes should you avoid?

8. Final thoughts

9. Frequently Asked Questions (FAQs)

What is a propensity score?

A propensity score is the conditional probability of receiving a treatment given observed baseline covariates.

In plain language, it estimates how likely someone was to receive the treatment based on what was known before the treatment happened. Rosenbaum and Rubin introduced the propensity score in 1983 as a way to adjust for observed covariates in observational studies.

A covariate is a variable measured before treatment that may be related to treatment assignment, the outcome, or both. Examples include age, income, location, health status, prior purchase history, education level, or product usage.

A treatment can mean many things depending on the study, such as:

Receiving a medical treatment.
Seeing a marketing campaign.
Joining a loyalty program.
Using a product feature.
Receiving a discount.
Participating in a training program.

The outcome is the result the researcher wants to study, such as recovery, churn, purchase behavior, satisfaction, retention, or performance.

When should you use a propensity score?

Use a propensity score when you want to estimate a treatment effect from observational data and the treatment was not assigned at random.

A treatment effect is the estimated impact of the treatment on an outcome. In a randomized controlled trial, random assignment helps make groups comparable before treatment. In observational studies, people or units often select into treatment or are selected because of existing characteristics.

This analysis is useful when:

Random assignment is not possible or ethical.
The treatment and control groups differ at baseline.
You have enough pre-treatment covariates to model treatment assignment.
You want to reduce confounding from measured variables.
You need a transparent way to compare treated and untreated groups.

Propensity scores are common in healthcare, education, economics, social science, and applied quantitative research. They can also be useful in business research when analysts compare customers exposed to an intervention with similar customers who were not exposed.

How do you construct a propensity score?

To construct a propensity score, estimate each unit’s probability of receiving the treatment using observed pre-treatment covariates. The treatment indicator is the dependent variable in the propensity model, and the covariates are the predictors.

1. Define the treatment and comparison groups

Start by defining who received the treatment and who did not.

The treatment group includes units exposed to the intervention. The comparison group includes units not exposed to it. This definition must be clear before modeling begins.

For example:

Treatment group: customers who received a promotional email.
Comparison group: similar customers who did not receive that email.
Outcome: purchase within 30 days.

A vague treatment definition will create weak analysis. Be specific about timing, exposure, eligibility, and outcome measurement.

2. Choose pre-treatment covariates

Choose covariates measured before the treatment. These variables should be related to treatment assignment, the outcome, or both.

Good covariates may include:

Demographics.
Prior behavior.
Baseline health or performance measures.
Purchase history.
Engagement level.
Location.
Device type.
Customer segment.
Prior satisfaction scores.

Do not include variables caused by the treatment. A post-treatment variable can hide part of the treatment effect and distort the analysis.

Also avoid variables that perfectly predict treatment assignment. Propensity score methods need overlap between treatment and comparison groups. If one covariate perfectly separates the groups, there may be no fair comparison for some units.

3. Estimate the propensity score

Researchers often estimate propensity scores using logistic regression or probit regression.

Logistic regression is a statistical model used when the outcome is binary, such as treated versus untreated. Probit regression is another binary-outcome model that uses a different statistical link function.

The model estimates a probability between 0 and 1 for each unit. A score of 0.80 means the model estimates an 80% probability that the unit would receive the treatment based on observed covariates.

The goal is not to predict treatment assignment perfectly. The goal is to create a score that helps balance measured covariates between treated and untreated groups.

4. Check overlap and common support

After estimating the propensity score, check whether treated and untreated groups have overlapping score distributions.

Common support means there are treated and untreated units with similar propensity scores. Without overlap, the data cannot support a credible comparison for those units.

For example, if some treated customers have propensity scores near 0.95 but no untreated customers have similar scores, the analysis cannot estimate the treatment effect well for that group.

Researchers often inspect overlap using histograms, density plots, or side-by-side score distributions. Units outside the common support may need to be excluded, but that changes the population to which the results apply.

How do you evaluate a propensity score?

You evaluate it by checking whether it balances baseline covariates between the treatment and comparison groups. A good propensity score process should make the groups look more similar on measured pre-treatment variables.

Check covariate balance

Covariate balance means the treated and comparison groups have similar distributions of baseline variables after matching, weighting, stratification, or adjustment.

You should check balance before and after applying the method. The analysis should show whether imbalance was reduced.

Useful balance checks include:

Means or proportions by group.
Standardized mean differences.
Variance ratios.
Distribution plots.
Balance tables.
Love plots.

A love plot is a chart that shows covariate imbalance before and after adjustment, often using standardized mean differences.

Use standardized mean differences

A standardized mean difference, or SMD, measures the difference in a covariate between groups in standard deviation units.

SMDs are widely used because they do not depend heavily on sample size. A p-value can look significant in a large sample even when the difference is small, so p-values are not the best balance diagnostic.

Many applied studies treat an SMD below 0.10 as a rough sign of acceptable balance, but the threshold should not be used blindly. Researchers should still consider the study context and the importance of each covariate.

Avoid relying only on AUC or c-statistics

AUC and c-statistics measure how well a model distinguishes treated from untreated units. They are useful for prediction, but propensity score analysis is mainly about balance.

A model with high predictive power may separate groups too strongly, which can reduce overlap. A model with a lower AUC may still produce better covariate balance.

The better question is not, “Did the model predict treatment well?”
The better question is, “Did the method balance the groups on important measured covariates?”

What methods use propensity scores?

Common propensity score methods include matching, weighting, stratification, and covariate adjustment. Austin’s review of propensity score methods explains these approaches as ways to reduce confounding in observational studies.

Method	How it works	Best use case
Propensity score matching	Matches treated units with untreated units that have similar scores	Creating comparable pairs or groups
Propensity score weighting	Weights observations based on treatment probability	Estimating effects in a weighted sample
Propensity score stratification	Divides units into score-based groups or strata	Comparing outcomes within similar score ranges
Covariate adjustment	Adds the propensity score to an outcome model	Simple adjustment, but often less robust than matching or weighting

Propensity score matching

Propensity score matching pairs treated and untreated units with similar propensity scores. The goal is to create groups that are more comparable than the original sample.

A caliper is a maximum allowed distance between matched propensity scores. Tight calipers can improve match quality, but may exclude more units.

After matching, check covariate balance again. Matching is not successful just because every treated unit has a match. It is successful only if balance improves.

Propensity score weighting

Propensity score weighting uses the score to assign weights to units. One common approach is inverse probability of treatment weighting, often called IPTW.

IPTW gives more weight to units that received a treatment pattern that was less likely based on their covariates. This can create a weighted sample where measured covariates are more balanced across groups.

Weighting can be powerful, but extreme weights can create unstable estimates. Researchers should inspect weight distributions and consider trimming or alternative weighting methods when needed.

Propensity score stratification

Propensity score stratification divides the sample into groups based on score ranges, such as quintiles.

Within each stratum, treated and untreated units should have similar propensity scores. Researchers then compare outcomes within strata and combine the results.

Stratification is easier to explain than some weighting methods, but it still requires balance checks inside each stratum.

Covariate adjustment using the propensity score

Covariate adjustment includes the propensity score as a predictor in an outcome model.

This method is simple, but it may not balance covariates as well as matching or weighting. It can be useful in some settings, but researchers should be cautious and still check diagnostics.

What are the limitations of propensity scores?

Propensity scores can reduce bias from measured covariates, but they cannot fix unmeasured confounding.

Unmeasured confounding happens when an important variable affects both treatment assignment and the outcome but is missing from the data. If that variable is not measured, the propensity score cannot balance it.

Other limitations include:

Poor overlap between groups.
Missing important baseline covariates.
Incorrect model specification.
Extreme weights.
Loss of sample after matching.
Results that apply only to the matched or weighted population.
Overinterpretation of observational findings as proof of causality.

A propensity score can support causal reasoning, but it does not prove causality on its own.

What mistakes should you avoid?

The most common mistake is treating the propensity score model like a prediction model. The goal is balance, not prediction.

Avoid these mistakes:

Including post-treatment variables in the propensity score model.
Ignoring overlap and common support.
Reporting treatment effects without balance diagnostics.
Using p-values alone to check balance.
Relying only on AUC or c-statistics.
Matching without checking whether covariates are balanced.
Forgetting that unmeasured confounders may still bias results.
Making claims that sound stronger than the design supports.

A clean analysis should explain the covariates used, the method chosen, the balance diagnostics, the excluded observations, and the target population for the estimate.

Final thoughts

A propensity score is useful when researchers need to compare treated and untreated groups in observational data. It helps reduce imbalance in measured baseline covariates, which can make treatment effect estimates more credible.

The strongest analyses do not stop after estimating the score. They check overlap, evaluate covariate balance, choose an appropriate method, and explain the limits of the design.

Propensity score analysis works best when it is planned carefully, reported transparently, and interpreted with caution. It is a helpful tool, but it is not a substitute for good research design.

Frequently Asked Questions (FAQs)

What is a propensity score in simple terms?

A propensity score is the estimated probability that someone receives a treatment based on observed characteristics. It helps researchers compare treated and untreated groups that may differ before treatment.

Is propensity score matching the same as propensity score analysis?

No. Propensity score matching is one method within propensity score analysis. Other methods include weighting, stratification, and covariate adjustment using the estimated propensity score.

What variables should be included in a propensity score model?

A propensity score model should include pre-treatment covariates related to treatment assignment, the outcome, or both. It should not include variables caused by the treatment or variables measured after treatment.

Can propensity scores prove causation?

No. Propensity scores can reduce bias from measured confounders, but they cannot remove bias from unmeasured variables. They support causal inference only when assumptions, design, and diagnostics are reasonable.

How do you know if a propensity score worked?

A propensity score method worked better if treated and untreated groups are balanced on important baseline covariates after matching, weighting, stratification, or adjustment. Standardized mean differences are commonly used to check this.

Why is common support important?

Common support is important because treatment effects cannot be estimated reliably when treated units have no comparable untreated units. Without overlap, the analysis depends on unsupported comparisons.

SHARE THIS ARTICLE: