An Intro to Predictive Modeling for Salesforce Peeps

I wanted to write about prediction because it’s seemingly everywhere: in marketing emails, in real estate, and even in criminal justice (blah).

An understanding of prediction and modeling can be pretty beneficial in business, however. Think lead scoring for new prospective clients or donors, employee attrition, or revenue forecasting. By understanding the capabilities of prediction, it might help us predict what data to collect in order to make more informed decisions.

Salesforce has released some A.I. products over the last few years under various names: Einstein, Tableau CRM, Wave Analytics. We can also create models ourselves, using different programming languages such as R and Python, or even in Excel.

First, a quote.

“All models are wrong, but some are useful”

George E. P. Box

Human nature can be unpredictable. We are also biased and have blind spots. I recently watched the documentary Coded Bias on Netflix, which showed how bad models can affect real people. So handle prediction with care.

So how does prediction work?

A predictive model tells us, based on existing criteria, what outcome is expected. For instance, what is someone’s income based on their race, gender, and profession?

Think of this as a mathematical formula similar to the ones we learn in high school for a straight line or a parabola. The inputs are x and the outputs are y.

y=mx+b

y=x^2

y=1/x

To build a predictive model, we use mathematical methods to find a formula that provides an output that is as close as possible to the real output for the greatest number of records (e.g. people, schools, whatever).

We can use both categorical (e.g. picklists) and numeric values in prediction.

So let’s say we want to predict a donor’s 2021 donation amount. We can use a method called regression.

In order to build a model to predict someone’s next donation amount, we need a “training data set” – a report of donors who have made 2021 Donations. We choose fields we think may be relevant (called independent variables, e.g. Age, Gender, Last Donation Amount, and Average Donation Size) and a “target” dependent variable (a field called 2021 Donation Amount.)

Then, we use a tool (Excel, R, Python, or a calculator with a huge memory :P) to create the model. Our goal is to get the most accurate model, so we may remove variables that don’t add that much value.

Our final output might look like this:

ŷ (this is the predicted value of 2021 Donation Amount) = 10 (this is the y-intercept) + 1.4*Age + 1.5*GenderFemale(true=1 or false=0) + 1.02*LastDonationAmount + 0.01*AverageDonationSize

So to predict the next donation amount of a donor who is:

  • Age = 45
  • Gender = Male
  • LastDonationAmount=$200
  • AverageDonationSize=$250

We would calculate the predicted value here:

ŷ = 10 + (1.4*45) + (1.5*0)+(1.02*200)+(.01*250)

ŷ = $279.50

NOTE: This is not a real model, and I am also oversimplifying.

From this result, we can see that the predicted donation amount for 2021 is higher than the donor’s last donation amount and average donation. Perhaps, in our fictional world, donors are very generous in 2021.

Conclusion and Disclaimer

I am passionate about educating people on how different technology works and hopefully saving them some money. However, I am a student and I am writing about what I am learning in class at the current time, and I am not an expert.

If you’re interested in learning more, check out online courses on modeling with R or Python, and then apply these skills to your Salesforce data. Courses will be able to explain some of the road bumps to look out for (e.g. multicollinearity, non-normality) and go more into detail about how to figure out if your model is actually a reliable one.

The example above is what is called linear regression. Logistic regression is another type of model…a way to predict the likelihood of an event. A famous example is whether or not someone died on the Titanic, which, spoiler alert: given a number of variables, is predicted best gender (“women and children” first) and ticket level (a proxy for wealth). I don’t know which method would be quicker for me: watching the 2 1/2 hour Kate Winslet and Leonardo DiCaprio movie or trying to build a model in R, but I digress…

Happy modeling!