logos of various colleges and universities
logos of various colleges and universities

If I apply to six schools, what are my chances of getting into at least one? Many websites (e.g. here, here, here and here) provide calculators that tell you your chances of getting into a specific school given your SAT scores and GPA. That is super-useful, but it does not tell you the chance of getting into at least one or at least two of the schools in the set of schools that you applied to. …

Unlike joins, relationships preserve the native granularity of data, reducing the need for LOD expressions.

Last summer, Tableau introduced a new way of combining data. It is called relationships. The old way of combining data using joins is still available, and I imagine that many of us might stick with the familiar joins for a while. However, relationships have much to recommend them and this post will show some of their ins and outs. Consider the three tables below:

Image by author.

Using joins Tableau would combine these tables into one flat file like this:

What if you had to predict how many passengers would survive the Titanic shipwreck? Are methods optimized for classification still appropriate?

If you are reading this, then you probably tried to predict who will survive the Titanic shipwreck. This Kaggle competition is a canonical example of machine learning, and a right of passage for any aspiring data scientist. What if instead of predicting who will survive, you only had to predict how many will survive? Or, what if you had to predict the average age of survivors, or the sum of the fare that the survivors paid?

How should we aggregate classification predictions?
How should we aggregate classification predictions?
How should we aggregate classification predictions?

There are many applications where classification predictions need to be aggregated. For example, a customer churn model may generate probabilities that a customer will…

Photo by Marco Secchi on Unsplash

During the 2009 recession many firms suspended 401k contributions. The tendency to do so during this recession has been relatively modest so far. TowersWatson reported that only 12% of employers in their survey suspended 401k contributions, and 23% are considering the move. A new development is that retirement contributions are now targeted by higher education institutions including Duke, Northwestern, Georgetown, Johns Hopkins, Chicago, USC, Michigan, BU, Washington University, Alabama, Pace, Drexler, American and many others. The sector faces significant challenges of uncertain enrollment, high cost of safely opening campuses and unfavorable demographics. Perhaps it is not surprising that the relatively…

And how traffic data impacts excess deaths, one proxy for COVID-19 mortality

Over 35,000 people die in car crashes annually in the U.S. Another three million are injured. The closing of the U.S. economy in mid-March of 2020 dramatically reduced traffic across the United States. I created a visualization to document the accompanying reductions in traffic collisions in nine U.S. cities. The purpose was to assess the magnitude of these reductions, how they vary across cities, over time, and over geography within the cities.

Traffic collisions declined dramatically during stay-at-home orders. The declines are concentrated in city centers.
Traffic collisions declined dramatically during stay-at-home orders. The declines are concentrated in city centers.

What does the viz show?

Collisions drop a lot. Across nine cities, the average reduction in collisions for the period from mid March to late June in 2020 compared to the same period…

Train, validate and test partitions for out-of-time performance take planning and thought

The purpose of supervised machine learning is to classify unlabeled data. We want algorithms to tell us whether a borrower will default, a customer make a purchase, an image contains a cat, dog, malignant tumor or a benign polyp. The algorithms “learn” how to make these classifications using labeled data, i.e. data where we know whether the borrower actually defaulted, customer made a purchase, and what a blob of pixels actually shows. Normally, researchers take the labeled data, and split it three ways: training, validation and testing/hold-out (the terminology sometimes differs). They train hundreds of models on train data, and…

Tomas Dvorak

I am a Professor of Economics at Union College in Schenectady, NY. I spent my last sabbatical on the data science team at a local health insurer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store