cross-validation – Win-Vector Blog

↧

Image may be NSFW.
Clik here to view.

Bad Bayes: an example of why you need hold-out testing

February 1, 2014, 8:14 am

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit. The example is designed to imitate a common situation found in predictive analytic natural language...

View Article

Image may be NSFW.
Clik here to view.

Random Test/Train Split is not Always Enough

January 5, 2015, 10:42 am

Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is...

View Article

Image may be NSFW.
Clik here to view.

A bit more on testing

January 17, 2015, 7:36 am

If you liked Nina Zumel’s article on the limitations of Random Test/Train splits you might want to check out her recent article on predictive analytics product evaluation hosted by our friends at...

View Article

Image may be NSFW.
Clik here to view.

On Nested Models

April 26, 2016, 8:00 am

We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or...

View Article

Image may be NSFW.
Clik here to view.

vtreat cross frames John Mount, Nina Zumel 2016-05-05 As a follow on to “On Nested Models” we work R examples demonstrating “cross validated training frames” (or “cross frames”) in vtreat. Consider the...

View Article

Image may be NSFW.
Clik here to view.

Laplace noising versus simulated out of sample methods (cross frames)

November 9, 2016, 10:18 am

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design...

View Article

Image may be NSFW.
Clik here to view.

A Theory of Nested Cross Simulation

January 1, 2017, 6:07 pm

[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through....

View Article

Image may be NSFW.
Clik here to view.

Sharing Modeling Pipelines in R

December 11, 2018, 3:50 pm

Reusable modeling pipelines are a practical idea that gets re-developed many times in many contexts. wrapr supplies a particularly powerful pipeline notation, and a pipe-stage re-use system (notes...

View Article

Image may be NSFW.
Clik here to view.

When Cross-Validation is More Powerful than Regularization

November 12, 2019, 11:45 am

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso...

View Article

PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised...

December 30, 2019, 10:02 am

Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in...

View Article

Python Data Science Tip: Don’t use Default Cross Validation Settings

March 2, 2020, 1:13 pm

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a...

View Article

Cross-Methods are a Leak/Variance Trade-Off

March 10, 2020, 6:27 pm

We have a new Win Vector data science article to share: Cross-Methods are a Leak/Variance Trade-Off John Mount (Win Vector LLC), Nina Zumel (Win Vector LLC) March 10, 2020 We work some exciting...

View Article

Use the Same Cross-Plan Between Steps

March 14, 2020, 12:57 pm

Students have asked me if it is better to use the same cross-validation plan in each step of an analysis or to use different ones. Our answer is: unless you are coordinating the many plans in some way...

View Article