Sketch Recognition

Saturday, April 20, 2013

A Visual Approach to Sketched Symbol Recognition

In their paper, Tom Y. Ouyang and Randall Davis present a visual recognizer that uses what the researchers called the, "Image Deformation Model" that compares each point in a given image in a 3x3 local window to a corresponding window in a template image. Each point has values from a set of 5 features:

Horizontal Orientation of Stroke at Point

0 degrees

"Forward Slash" Diagonal Orientation of Stroke at Point

(Note: "Forward Slash" is a term that I have assigned)
45 degrees

Vertical Orientation of Stroke at Point

90 degrees

"Back Slash" Diagonal Orientation of Stroke at Point

(Note: "Back Slash" is a term that I have assigned)
135 degrees

Endpoint

1.0 if the point is an endpoint of a stroke
0.0 if the point is not an endpoint of a stroke

The points that are given these features values are the output points after a normalization process where the strokes are resampled, scaled, and translated for temporal, scale, and position invariance, respectively. Ouyang and Davis do their scaling after translating the center of mass to the origin (0,0). Their scaling works by constraining the scaling along a unit standard deviation outward from the center along both axes.

The actual IDM distance (D^2) computed between the input stroke to the templates is computed by the following equation:

D^2 = Sum for every point x and y ( min of dx or dy from ( ( || I1(x+dx, y+dy) - I2(x, y)^2 || ) ^2)

where dx and dy represent pixel shifts
where Ii(x,y) represents 3x3x5 features values in Ii from the 3x3 patch centered at point(x,y)

Ouyang and Davis perform several optimization to reduce computation time including Coarse Candidate Pruning where only the N nearest neighbors of a coarse metric are passed to be computed by the IDM model. After taking the first K principle components of the image, it uses those K components to compute the following coarse metric:

Dhat^2 = sum from k = 1 to K ( v1(k) - v2(k) ) ^2

where vi(k) is the k-th principle component of the i-th image.

The second optimization is a hierarchical clustering optimization that uses a branch and bound technique to apply agglomerative hierarchical clustering to each training example. Each class's examples are grouped on a complete-link distance and are then merges the two nearest clusters until only one cluster exists for the class. The resulting hiearchical tree is used to choose the cluster center as the representative template and uses that to find the maximum distance between the other clusters to use as the "cluster radius," rc. This radius is used to choose whether to ignore a given class of training examples by determining if the center radius, dc, minus the cluster radius, rc, is larger than the best distance the IDM comparison has found so far. If it is, the entire cluster is ignored, lower computation time.

Rotational invariance is obtained by comparing 32 rotated versions of the input sketch to the templates that pass the hierarchical clustering optimization.

The researchers tested their IDM based recognition on 3 datasets and achieved the following accuracy rates and rankings to other recognizers:

Pen Digits, 99.2%, 2
HHReco, 98.2%, 1
Circuit Diagrams, 96.2%, 1

For the pen digits dataset, the algorithm ran in 8.1 ms which was the second fastest time where the fastest was 2.4 ms and the 3rd fastest was 40.8 ms.

The description of the optimization and the scaling were particularly interesting for this recognition scheme. It's also funny how well their distance recognition performs with only 5 features for a given point, 4 of which are orientation.

Reference:
Tom Y. Ouyang and Randall Davis, "A Visual Approach to Sketched Symbol Recognition", Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09), pp 1463 - 1468, Pasadena, California, USA, July, 2009

Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches

This paper was Michael Oltmans' thesis and was written in 2007 at the Massachusetts Institute of Technology. Unlike the other papers that have been presented in this blog, this approach is a vision based approach that uses a sliding window called a "bulls-eye" that is split like a darts target into concentric rings with radial splits. The outer ring's sections are larger than the inner rings (on a logarithmic scale) to represent that the points in the middle are in the "focus" / center of the fovea and more important than points that fall in the inner rings.

Oltmans makes the shapes found by the bulls-eyes rotation invariant by aligning the shape so that the stroke direction points to the right along the x-axis. In the event that the determined direction of the stroke is wrong, Oltmans also flips the stroke direction and aligns that with the x-axis. To prevent the stroke's points from falling on either side of the bullseye's horizontal bins and damaging the comparison. The bullseye is rotated so that a single wedge (bin) takes up the horizontal bin. The stroke direction itself is found using a sliding window of 19 points on which an orthogonal distance regression finds the best-fit line. Shapes in the bullseye are normalized for scale invariance so that they are 75 pixels wide and tall. The strokes within the shape are also processed so that the points in the sketch have a pixel distance of at least one, and are also resampled.

Oltmans' recognition scheme is temporally invariant (stroke order doesn't matter) since the entire figure will be brought into focus and preserve the context of the figure.

The bulls-eye uses a histogram based scoring method to compare the part of the sketch it has in focus to other template bulls-eyes of other shapes. For this comparison, he uses the following distance metric:

Sum( (Qi - Pi)^2 / (Qi + Pi) )

where the Qi is the number of points in the input bulls-eye and Pi is the number of points in the template bulls-eye.

Oltman used his bulls-eyes method to find shapes in circuit diagrams that included the symbols for resisters, batteries, grounds, and current sources (see figures below).

To determine which figures should be focused on, i.e. which contexts/shapes to perform recognition on, Oltman moves along each stroke and adds a candidate context every 10 points along the stroke. This process is repeated for larger and larger windows to find shapes drawn at different sizes. The system then breaks down the shape by classifying some candidate regions as wires that don't match any other kind of shape. The other candidate regions are then either combined or split into candidate regions that are processed by the bulls-eyes for final classification. This initial set of clusters is found using an expectation maximization clusterer. The clusterer is given a vector of each candidate region's top left and bottom right corners which are weighted by the square scores of each candidate region. The clusterer outputs the mean and standard deviation of the four coordinates which are used to find a mean candidate region to be used as a new region for shape prediction. These clusters of new regions are then split if they are too large as determined by the standard deviations from the clusterer. Any remaining overlapping regions are chosen between by greedily choosing the highest weighted cluster.

Finally, predictions are made on the remaining clusters and the highest scoring clusters are chosen for any remaining overlapping clusters and the rest of the clusters are thrown away within a given set of overlapping clusters. This proceeds until all regions have been classified.

Reference:
Michael Oltmans, "Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches", Massachusetts Institute of Technology, 2007

Tuesday, April 9, 2013

ShortStraw: a simple and effective corner finder for polylines

In this paper, Wolin et al. discuss their ShortStraw corner finding algorithm.

Their algorithm is based on finding where "straws" connections between the endpoints in a window of points, bridge gaps.

Formally:

a straw at point i = distance(point i-w, point i+w)

where w = 3, an empirically determined constant for best results

After resampling (which normalizes the straw lengths), their algorithm uses a bottom-up approach to build a set of initial corners followed by a top-down approach to refine or add to the existing set of found corners.

Bottom-up stage:

Compute the straws for all points
Find the median straw (distance)
Set threshold t = median * 0.95
Check for Additional Corners:

For each (Straw straw : Straw)

if( straw < threshold )

corners.add( corresponding point for straw )

Top-down stage:

Additive Co-linear check:

for each( pair of points in corners)

if( pair of points does not form a line [i.e. r <= 0.95])

Check for Additional Corners between the pair of points

with relaxed threshold, t

Reductive Co-linear check:

for each( triplet of points in corners)

if( triplet of points forms a line [i.e. r <= 0.95] )

remove middle point from triplet of points

where r = distance(points,a,b) / path-distance(points,a,b)

Results
The shortstraw recognizer, when run on 244 different polyline shapes, identified 97.9% of actual corners as corners (Correct Corners Accuracy). More importantly, their recognizer achieved an "All-or-Nothing Accuracy" of 74.1% which means that the Shortstraw recognizer perfectly recognized all the corners as corners and did not mis-classify any of the non-corners as corners for 74.1% of the strokes.

This is impressive considering that the Sezgin and Kim recognizers had Correct Corner Accuracies of 82.4% and 79.0%, respectively and All-or-Nothing Accuracies of 27.8% and 29.7%.

Even better than that, the algorithm is simple and competitive in term of the computation time and resources required to run it.

It's encouraging to see simple ideas outperform the more advanced methods. Simplicity can be beautiful at times.

A. Wolin, B. Eoff, and T. Hammond. 2008. ShortStraw: a simple and effective corner finder for polylines. In Proceedings of the Fifth Eurographics conference on Sketch-Based Interfaces and Modeling (SBM'08), Christine Alvarado and Marie-Paule Cani (Eds.). Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, 33-40. DOI=10.2312/SBM/SBM08/033-040 http://dx.doi.org/10.2312/SBM/SBM08/033-040

Saturday, April 6, 2013

A Domain-Independent System for Sketch Recognition

In this paper, Yu and Cai, discuss their system which was designed with the following goals:

Naturalness

It must allow users to draw in a natural way just as how they do on paper.

Consistency

It should produce a consistent recognition result,

i.e. if a user wants to sketch a line segment, no matter if it is drawn with a single stroke, or more than one stroke, or a compound stroke which contains other primitive elements, the system should interpret it as a line segment.

Hierarchical Composition

It should understand the hierarchical relations among the recognized graphic objects.

For example, a stroke which is composed of several primitive shapes may serve as a constituent part of some more complex object.

Beautification

It should try to guess what the user intends to draw and turn the imprecise strokes into regular objects.

Although domain-specific knowledge may be very helpful here, some simple decisions can still be made according to low-level geometric features only.

Editing Support

Besides a fairly high recognition rate, it should provide an efficient and convenient user interface for drawing modification as well.

Modularity

It can be easily integrated as in input module into other more complex systems.

Their system was divided into two stages:

Imprecise Stroke Approximation

Vertex Detection and Primitive Recognition

Procedure: For every stroke, check to stroke to see if primitive. Stop if it is, else cut the stroke at the point of highest curvature (corners) and perform primitive check on these two strokes. (recursive)

Primitive Detection: Lines, Curves, Circles, Ellipses, Squares, Spiral (referred to as 'helixes' in the paper)
Discussed Topics

Line Segment Approximation

Computes the least-squares best fit line for the direction curve or coordinates of points of the stroke and whether the number of deviation points is "small" with a threshold. If it is, then guess line segment and verify with feature area verification, else, delay classification.
"Feature Area Verification" - checks the ratios between feature and standard area; Serves as a constraint to local criteria for line segment recognition on strokes that are based on direction and curvature information.

Curve Approximation

Do circle and ellipse approximation

Circle: check if direction curve can be approximated with a line with a slope of 2pi / n. If so, guess circle with center as the bounding box center and mean distance from center as the radius. Verify with feature area verification.

Overtracing is handled by breaking each 2pi revolution of the direction graph into separate circles and combining them by taking the mean features of the circles.

Ellipse: "similar to circle" with extra constraint : area is symmetrical about major and minor ellipses.

A dubious claim, according to Dr. Hammond

Arc Stroke: Direction curve is approximated by line with slope *less than* 2pi / n.

Self-Intersecting Stroke Handling

If stroke can't be broken down into primitives and it has intersecting lines, make two copies of the stroke and divide them using two strategies:

At self intersection points
At points of maximum curvature

Choose the one with the "better" recognition

2. Post Processing

Eliminates elements made by noise in data, beautifies remaining elements, do domain-independent recognition.
Steps:

Simple Relation Retrieval
Cleanup
Basic Object Recognition

Their user interface also allowed for editing by going into a "modification state" and entering editing command symbols which included "delete","undo", and "redo".

The system was evaluated with only 10 students (totally :p) who thought that the system "was easy to use".

The students entered a set of primitive and hybrid shapes.

"Correct Rate":

Primitive Shapes: 98%

Arcs: 94%

Hybrid Shapes: 70%

Smooth connections in hybrid shapes were hard to detect.

The paper's evaluation was, qualitatively and quantitatively, extremely weak. Simply saying that system was easy to use is ambiguous. How was it easy to use? What parts did they like? Did they ask if they didn't like anything? Perhaps using a likert scale questionnaire would give a more solid grasp of user satisfaction. How long did they use it for? What the evaluation so short they couldn't find anything wrong about the system?

They only tested on 10 students and no mention was made of how many shapes the users each had to enter. Their correct rate metric is a little vague and not a standard measure for recognition performance.

The paper also erroneously calls spirals "helixes" and there were many spelling and grammatical errors that distracted from the paper.

Ex: "We has devised a function..." (Can they has chzburgers?)

Ex: "There were totally 10 students..." (Did a surfer write this paper?)

Ex: "the curvature curve..." (This happened a lot in the paper and it is really redundant.)

Don't get me wrong, they had some smart ideas. Their method of recognizing overtraced circles was very clear and concise and their method of breaking strokes down at high curvature vertices makes perfect sense.

Bo Yu and Shijie Cai. 2003. A domain-independent system for sketch recognition. InProceedings of the 1st international conference on Computer graphics and interactive techniques in Australasia and South East Asia (GRAPHITE '03). ACM, New York, NY, USA, 141-146. DOI=10.1145/604471.604499 http://doi.acm.org/10.1145/604471.604499

Friday, April 5, 2013

Mechanix : Sketch Recognition Algorithms for Comparing Complex and Unpredictable Shapes

In this paper, Field et al. discuss their software platform called, "Mechanix" which is designed to help teachers publish and grade course material that students submit to online in the form of sketches.

The main subject domain that their recognition system is used for is in the recognition of free body diagrams, primarily trusses, for engineering courses at Texas A&M (whoop!).

Their recognizer uses a single example provided by a professor, teacher, or TA as the training data that the student's input data (truss drawings) are compared to for grading. Rather than give a confidence or break down the results, the recognizer simply returns a correct or not correct response.

Body Identification
The free body input is also checked to see if the bodies are indeed closed shapes. This is done by traversing the stroke segments' endpoints and finding the closest endpoints to check if those endpoints are within 9% of the total path length to each other. After all end points are checked, the start and end point are checked to see if they are close enough. If all conditions pass, then the shape is closed shape

Body Comparison - Tanimoto + Hausdorff
Field et al.'s algorithm works by preprocessing the stroke for temporal invariance, rotational invariance, scale invariance, and position invariance similar to the preprocessing in $1.

Then, the Tanimoto coefficient, Hausdorff, and Modified Hausdorff distances are computed and combined to give a "matching" or "not matching" response if the confidence is greater than the acceptance threshold of 65%.

Hausdorff Distance:

Haudorff Distance(A, B) = max(max(Da),max(Db))

where Da = min (distance from a to b)
Db = min (distance form b to a )

where a and b are points in point vectors Pa and Pb from the input (A) and training (B) points, respectively

Modified Hausdorff(A,B) = Va + Vb / 2

where Va = avg(Da)
Vb = avg(Db)

Tanimoto Coefficient:

Normally: T(A,B) = numPts ( A & B ) / numPts(A) + numPts(B) - numPts(A & B)
Mechanix: T(A,B) = numPts(A & B) + numPts( B & A) / numPts(A) + numPts(B).

Truss Identification
Trusses are identified by finding intersection that share edges with 3 three distinct paths connecting them.

Field et al. achieved recall rates of 0.991 and a precision of 0.691.
While the precision seems low, but the high recall rates is what is really important since the recognizer needs to be able to verify correctness of the sketches and recognize correct trusses as such.

The high recognition rates for general symbols with Tanimoto and Hausdorff is impressive and its great to see a strong practical application of sketch recognition towards education. Knowing what's important for the domain was a strong point of the paper.

Martin Field, Stephanie Valentine, Julie Linsey, and Tracy Hammond. 2011. Sketch recognition algorithms for comparing complex and unpredictable shapes. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three(IJCAI'11), Toby Walsh (Ed.), Vol. Volume Three. AAAI Press 2436-2441. DOI=10.5591/978-1-57735-516-8/IJCAI11-406 http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-406

Sunday, March 24, 2013

The Word-Gesture Keyboard: Reimagining Keyboard Interaction

In this paper, Zhai and Kristensson discuss their Word-Gesture Keyboard which allows users to drag their finger in a single stroke across a touch screen keyboard to input a word. Their keyboard uses a probabilistic model to provide a possible-word menu from which the user can select their desired input.

Zhai and Kristensson first describe the requirements for the widespread acceptance of a new text input method:

High Speed of Input
Minimal Cognitive Load
Support for the Development of Efficiency

From there, the researchers discussed the feasibility of a gesture based text input method. They note that English, like nearly all natural languages, feature a high degree of regularity and redundancy and that the average word length is only about 4 to 5 characters. Longer more complex words can still be entered through the striking method so all words can be entered by users despite their particular skill or accuracy in using the keyboard. Essentially, SHARK (short-hand aided rapid keyboarding) is really an improvement of the existing keyboard. The probabilistic recognition of matching words in the English lexicon also reduces the impact of inaccurate input caused by the fat finger problem.

Since the user doesn't have to lift their finger, the input is generally takes less time than tapping.
Their keyboard input method does not require any training which is a must for any new text input method. Since most input is done for a small set of commonly used words, these common words' input paths are easily remembered through procedural/muscle memory which further improves input speed. This training allows users to progressively become more efficient which is a positive mark of a "progressive user interface."

Their keyboard also has features common system commands...

to change the case of the word before the caret to further decrease time to input and minimize user frustration.
to copy and paste
to switch languages (lexicon search and different character input sets)

Zhai and Kristensson performed several experiments to test...

recall rates without visual feedback (looking at the keyboard) across multiple sessions

resulting in about 13 words remembered for the 1st session to 45 to 57 words remembered for the 4th session

entry rates (speed) for initial performance with varying amounts of time to practice on QWERTY and ATOMIC keyboards

resulting in 7 to 13 wpm for ATOMIC and 15-25 wpm for QWERTY

maximum performance for small set of text inputs recorded in the form of a competition

with an average error-free performance of 57.5 wpm and top performance of 99 wpm

A subjective evaluation was also conducted with responses and comments gathered from their official release with the Apple App Store for the iPhone. Responses were 81.6% completely positive, 12.5% somewhat positive, and 5.9% completely negative.

Given the limited input space touch screens for mobile devices, it is impossible to use the standard two handed typing for which the DVORAK keyboard was designed. As such it makes sense to explore less traditional methods of keyboard input and to attempt to ease input for these devices, especially in light of their widespread use. Zhai and Kristensson's work is a great work that does just that with the grounded scientific work and forethought given to it that such work deserves.

Shumin Zhai and Per Ola Kristensson. 2012. The word-gesture keyboard: reimagining keyboard interaction. Commun. ACM 55, 9 (September 2012), 91-101. DOI=10.1145/2330667.2330689 http://doi.acm.org/10.1145/2330667.2330689

A Few Useful Things to Know About Machine Learning

In this article, Pedro Domingos attempts to reveal the "Folk Knowledge" and "Black Art" commonly used in machine learning.

First Domingos provides some key insights about machine learning:

Machine learning allows the program to generalize rules from examples. This saves programmers time by not having to explicitly program rules. Cost savings are increased for larger data sets.
Using machine learning techniques successfully requires information not easily found in formal information sources. i.e. "Black Art" is abundant in the field.

These insight entailed in this "Black Art" is revealed by Domingos in 12 lessons:

1st Lesson: Learning = Representation + Evaluation + Optimization:

Representation

Set of Classifiers - Hypothesis Space

The set of possible classes that the classifier can assign input to

Evaluation

Evaluation Function / Scoring Function

Scores the input and uses that score to classify

Optimization

Optimizer chooses the best scoring and efficient classifier (or way to classify) for the data

2nd Lesson: It's Generalization that Counts

Training data should not be used as test data as it 'contaminates' the test data and provides an unrealistic and overoptimistic view the classifier's performance. Optimizing the classifier for the training data can lead to overfitting/overspecialization. Instead the classifier should be generalized to be effective for a larger, more diverse set of data. We don't know the function that the learning algorithm is attempting to optimize; we're instead intuiting and learning that function from the data.

3rd Lesson: Data Alone Is Not Enough
It is often impossible to know and store every possible instance of all the classes. Without any information about that data, random guessing is the best way to classify. This is summed up by the "no free lunch" theorem by Wolpert which says that no learner can beat random guessing over all possible functions to be learned. Instead, the learner/classifier must make some assumptions about how that data looks like. Knowledge is used as a lever for predictive power. Learners combine knowledge with data to "grow" programs

4th Lesson: Overfitting Has Many Faces
Overfitting is a generalization error that comes from bias or variance in the data. Bias is the learner's tendency to consistently learn the same wrong thing. Variance is the tendency to learn random things irrespective of input signal. Different types of classifiers can have different behavior with respect to bias or variance and it is often a tradeoff. We can counteract overfitting with cross-validation or by adding a regularization term to prefer classifier with less structure and complexity. Significance tests can combat the problem of multiple testing and/or minimizing the false discovery rate. Similarity based reasoning and common patterns break down for data sets with high dimensionality.

5th Lesson: Intuition Fails in High Dimensions
Humans are unable to see patterns in data with high dimensionality which is part of the reason by machine learning is so useful, however, the same difficulty makes it harder to craft features for the classifier and successfully make generalization and assumptions about the data.

6th Lesson: Theoretical Guarantees Are Not What They Seem
The number of examples needed to generalize in order to make solid claims on the effectiveness of classifier is bounded, however in induction these guarantees are probabilistic and require many more examples than in deduction. Furthermore, these bounds do not inform us on how to construct a good set of classes for the hypothesis space. Theoretical guarantees is a source to inform good algorithm design but may not be indicative of the predictive power and effectiveness of the classifier.

7th Lesson: Feature Engineering is the Key
Understanding the domain and taking the time to gather good data in a large enough volume and adequately analyzing it is often the most time-consuming and important part of machine learning. Automating feature generation is possible, but often poorer in quality to hand crafting.

8th Lesson: More Data Beats a Cleverer Algorithm
Gains in using a more complex and accurate classifier are minimal and larger gains can be obtained with the same amount of effort in getting more data. Start simple and go more complex and refine if needed. Fixed size learners can only handle so much data.

9th Lesson: Learn Many Models, Not Just One
Often a combination of classifiers through bagging, boosting, or stacking can give better results. Model ensembles are now commonly used.

10th Lesson: Simplicity Does Not Imply Accuracy
Rather, simpler hypotheses are preferred for the sake of simplicity rather than accuracy.

11th Lesson: Representable Does Not Imply Learnable
Rather, the form, depth, breadth, and distribution of the data may make the function difficult, if not impossible, to learn. The relationship between representation and function can vary.

12th Lesson: Correlation Does Not Imply Causation
Rather, correlation *may* indicate that there is a set of causal links. This relationship should be explored experimentally to discover the link before a conclusion is made about causality.

This article is a great introduction to the pitfalls of machine learning. However, it does seem that some of the discussion is slightly too abstract and pedagogical to be operationally useful when the discussed problems are encountered in the field. However, as a primer, the article works well.

Pedro Domingos. 2012. A few useful things to know about machine learning. Commun. ACM 55, 10 (October 2012), 78-87. DOI=10.1145/2347736.2347755 http://doi.acm.org/10.1145/2347736.2347755