This paper was Michael Oltmans' thesis and was written in 2007 at the Massachusetts Institute of Technology. Unlike the other papers that have been presented in this blog, this approach is a vision based approach that uses a sliding window called a "bulls-eye" that is split like a darts target into concentric rings with radial splits. The outer ring's sections are larger than the inner rings (on a logarithmic scale) to represent that the points in the middle are in the "focus" / center of the fovea and more important than points that fall in the inner rings.
Oltmans makes the shapes found by the bulls-eyes rotation invariant by aligning the shape so that the stroke direction points to the right along the x-axis. In the event that the determined direction of the stroke is wrong, Oltmans also flips the stroke direction and aligns that with the x-axis. To prevent the stroke's points from falling on either side of the bullseye's horizontal bins and damaging the comparison. The bullseye is rotated so that a single wedge (bin) takes up the horizontal bin. The stroke direction itself is found using a sliding window of 19 points on which an orthogonal distance regression finds the best-fit line. Shapes in the bullseye are normalized for scale invariance so that they are 75 pixels wide and tall. The strokes within the shape are also processed so that the points in the sketch have a pixel distance of at least one, and are also resampled.
Oltmans' recognition scheme is temporally invariant (stroke order doesn't matter) since the entire figure will be brought into focus and preserve the context of the figure.
The bulls-eye uses a histogram based scoring method to compare the part of the sketch it has in focus to other template bulls-eyes of other shapes. For this comparison, he uses the following distance metric:
- Sum( (Qi - Pi)^2 / (Qi + Pi) )
- where the Qi is the number of points in the input bulls-eye and Pi is the number of points in the template bulls-eye.
To determine which figures should be focused on, i.e. which contexts/shapes to perform recognition on, Oltman moves along each stroke and adds a candidate context every 10 points along the stroke. This process is repeated for larger and larger windows to find shapes drawn at different sizes. The system then breaks down the shape by classifying some candidate regions as wires that don't match any other kind of shape. The other candidate regions are then either combined or split into candidate regions that are processed by the bulls-eyes for final classification. This initial set of clusters is found using an expectation maximization clusterer. The clusterer is given a vector of each candidate region's top left and bottom right corners which are weighted by the square scores of each candidate region. The clusterer outputs the mean and standard deviation of the four coordinates which are used to find a mean candidate region to be used as a new region for shape prediction. These clusters of new regions are then split if they are too large as determined by the standard deviations from the clusterer. Any remaining overlapping regions are chosen between by greedily choosing the highest weighted cluster.
Finally, predictions are made on the remaining clusters and the highest scoring clusters are chosen for any remaining overlapping clusters and the rest of the clusters are thrown away within a given set of overlapping clusters. This proceeds until all regions have been classified.
Reference:
Michael Oltmans, "Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches", Massachusetts Institute of Technology, 2007
No comments:
Post a Comment