Flexibility of letterform perception in humans. (A) Humans are good at parsing unfamiliar CAPTCHAs. (B) The same character shape can be rendered in a wide variety of appearances, and people can detect the “A” in these images regardless. (C) Common sense and context affect letterform perception: (i) m vs u and n. (ii) the same line segments are interpreted as N or S depending on occluder positions. (iii) perception of the shapes aids the recognition of “b,i,s,o,n” and “b,i,k,e”. [Bison logo with permission from Seamus Leonard, http://www.steadynow.com]
Cientistas publicam artigo onde explicam como criaram uma rede neural que supera as redes neurais de aprendizado profundo “tradicionais” em um desafio de reconhecimento de texto, sendo 300 vezes mais eficientes em termos de dados e demonstrando excelente capacidade de generalização.
Fig. 2 Structure of the RCN. (A) A hierarchy generates the contours of an object, and a Conditional Random Field (CRF) generates its surface appearance. (B) Two subnetworks at the same level of the contour hierarchy keep separate lateral connections by making parent-specific copies of child features and connecting them with parent-specific laterals; nodes within the green rectangle are copies of the feature marked “e”. (C) A three level RCN representing the contours of a square. Features at Level 2 represent the four corners, and each corner is represented as a conjunction of four line-segment features. (D) Four-level network representing an “A”.
Fig. 3 Samples from RCN. (A) Samples from a corner feature with and without lateral connections. (B) Samples from character “A” for different deformability settings, determined by pooling and lateral perturb-factors, in a 3-level hierarchy similar to Fig. 2D, where the lowest level features are edges. Column 2 shows a balanced setting where deformability is distributed between the levels to produce local deformations and global translations. The other columns show some extreme configurations. (C) Contour to surface-CRF interaction for a cube. Green factors: foreground-to-background edges, blue: within-object edges. (D) Different surface-appearance samples for the cubical shape in C. [See section 3 of (33) for CRF parameters.]
Fig. 4 Inference and learning. (A) (i) Forward pass, including lateral propagation, produces hypotheses about the multiple letters present in the input image. PreProc is a bank of Gabor-like filters that convert from pixels to edge likelihoods [section 4.2 of (33)]. (ii) Backward pass and lateral propagation creates the segmentation mask for a selected forward-pass hypothesis, here the letter “A” [section 4.4 of (33)]. (iii) A false hypothesis “V” is hallucinated at the intersection of “A” and “K”; false hypotheses are resolved via parsing [section 4.7 of (33)]. (iv) Multiple hypotheses can be activated to produce a joint explanation that involves explaining away and occlusion reasoning. (B) Learning features at the second feature level. Colored circles represent feature activations. The dotted circle is a proposed feature [see text and section 5 of (33)]. (C) Learning of laterals from contour adjacency (see text).
Fig. 5 Parsing CAPTCHAs with RCN. (A) Representative reCAPTCHA parses showing top two solutions, their segmentations, and labels by two different Amazon Mechanical Turk workers. (B) Word accuracy rates of RCN and CNN on the control CAPTCHA data set. CNN is brittle and RCN is robust when character-spacing is changed. (C) Accuracies for different CAPTCHA styles. (D) Representative BotDetect parses and segmentations (indicated by the different colors).
Fig. 6 MNIST classification results for training with few examples. (A) MNIST classification accuracy for RCN, CNN, and CPM. (B) Classification accuracy on corrupted MNIST tests. Legends show the total number of training examples. (C) MNIST classification accuracy for different RCN configurations.
Fig. 7 Generation, occlusion reasoning, and scene-text parsing with RCN. Examples of reconstructions (A) and reconstruction error (B) from RCN, VAE and DRAW on corrupted MNIST. Legends show the number of training examples. (C) Occlusion reasoning. The third column shows edges remaining after RCN explains away the edges of the first detected object. Ground-truth masks reflect the occlusion relationships between the square and the digit. The portions of the digit that are in front of the square are indicated by brown color and the portions that are behind the square are indicated by orange color. The last column shows the predicted occlusion mask. (D) One-shot generation from Omniglot. In each column, row 1 shows the training example and the remaining rows show generated samples. (E) Examples of ICDAR images successfully parsed by RCN. The yellow outlines show segmentations.
Fig. 8 Application of RCN to parsing scenes with objects. Shown are the detections and instance segmentations obtained when RCN was applied to a scene parsing task with multiple real-world objects in cluttered scenes on random backgrounds. Our experiments suggest that RCN could be generalized beyond text parsing [see section 8.12 of (33) and Discussion].
O artigo foi publicado na edição de ontem da revista SCIENCE e pode ser lido e/ou baixado nos links abaixo:
Não sabe o que é CAPTCHA?
CAPTCHA é um acrônimo da expressão “Completely Automated Public Turing test to tell Computers and Humans Apart” (teste de Turing público completamente automatizado para diferenciação entre computadores e humanos): um teste de desafio cognitivo, utilizado como ferramenta anti-spam, desenvolvido de forma pioneira na universidade de Carnegie-Mellon. Como o teste é administrado por um computador, em contraste ao teste de Turing padrão que é administrado por um ser humano, este teste é na realidade corretamente descrito como um teste de Turing reverso.
Um tipo comum de CAPTCHA requer que o usuário identifique as letras de uma imagem distorcida, às vezes com a adição de uma sequência obscurecida das letras ou dos dígitos que apareça na tela