The main idea is very simple: we analyze first few seconds of a clip and build histograms of color distribution per frame. Then, we average them, building an averaged color distribution histogram. Then we find a frame, which is closest to the average value (I am using RMSE to estimate "closeness"). We select a frame close to beginning of the video, which makes selection process faster (less frames to examine) and less likely to include spoilers. Selected picture is similar in color distribution to the overall video theme, making it more likely to display typical frame.
I run it on few hundreds video clips, and it shows pretty good results. Of course, these results are not representative, I've selected most interesting ones but generally I think it is very usable. You can grab the source code and try it yourself.