Every day we recognize thousands of objects without for one second paying attention to how and why we can do this. Visual object recognition refers to our ability to attribute meaning to something (AtoZ) which centers around the understanding of how we think about what we see (LN). The following paper will concentrate on how humans recognize objects, through the integration and interpretation of visual features as well as how we really know that this is how our minds process objects we see. Object recognition is the end point of visual processing and a vital aspect before one interacts and reasons about the world (Peissig & Tarr, 2007). Various principles and models have proposed to explain how people group simple elements to see objects in our everyday lives. The discussion will start by considering the Gestalt Laws of Perceptual Organization, which is a crucial early stage in object recognition. After that, more general theories of object recognition will be discussed.
We group visual features into whole objects effortlessly, but how do people so easily connect these features? Gestalt psychologists came up with several principles that would explain how people group together features and decide which ones are relevant or irrelevant, so that attention can be given to those that are relevant (AtoZ). Gestaltists believed that one needed to work out where the object is, before deciding what it is (Eysenck & Keane, 2010). The following four laws will be discussed further to provide a brief description of how they influence people’s perceptual organization; proximity, similarity, good continuation, and closure.
According to the law of proximity, elements are group together into an object if they are close together and represent a figure (AtoZ). The law of similarity is the tendency to group elements that are similar in colour, shape or texture as though they belong to the same object (AtoZ). The law of good continuation allows us to perceive stimuli in such a way that elements which follow a continuous, smooth edge, form a figure (AtoZ). The law of closure describes the tendency to fill in gaps or complete something spontaneously so that it would create a coherent shape with meaning (AtoZ). Evidence supporting this approach, was conveyed by Pomerantz (1981). In the study, observers viewed four item visual arrays and attempted to identify the one that was different from the others as soon as possible. Pomerantz found that observers identified the object more quickly when the array was more complex but also more easily organized.
In addition to the laws of Gestalt, central to the Gestalt approach was figure-ground segregation (Eysenck & Keane, 2010). When we perceive information we often organize it into a meaningful figure (relevant information) against a less meaningful ground (irrelevant information) (AtoZ). The process by which this is done still remains of interest however detecting concavities may be particularly important to the process. Nevertheless, figure ground segregation is a basic law of perceptual organization (AtoZ). What we do know, is that figure and ground are not necessarily fixed, they can be reversed, but this requires more conscious effort (AtoZ). Weisstein and Wong (1986) gave participants a rapid discrimination task to determine whether a line is vertical or slightly tilted. They found that observers did three times better when the line was presented to what the observers perceived as the figure than the ground (Eysenck & Keane, 2010). Gestalt psychologists therefore proposed that the process of object recognition occurs as follows; visual stimulation, focal attention, figure-ground segregation, and then object recognition (LN).
However, even though Gestalt laws have correctly predicted our perception in many cases, and thereby still remain important, it is important to remember that their model is limited because of numerous shortcomings. Nearly all the evidence provided for their principles were based on two-dimensional line drawings. It is important to see whether their findings would apply to more realistic scenes. Furthermore, Gestaltists proposed that the various principles of grouping operated in a bottom-up process to produce perceptual organization. Further research determined that top-down process are often involved with figure-ground segregation being influenced by past experience and by attentional processes (Eysenck & Keane, 2010).
Even though humans do it effortlessly, object recognition is still difficult to explain as objects can be seen in many different positions. Other theories about object recognition have described the representation of the object in terms of the forms of objects and how they are put together (Harris, 2014). Two of these theorists are Marr and Biederman.
According to Marr’s Computational Theory, the image on the retina can be represented in various ways in the nervous system, and this theory distinguished three such representations (Harris, 2014). Marr and Nishihara (1978) were the first to introduce the idea of part-based structural representations which were based on three-dimensional volumes and their spatial relations (Piessig & Tarr, 2007). They suggested that the segments of objects come to be mentally represented as generalized cones or cylinders, and objects as hierarchically organized structural models relating the spatial positions of the segments to one another (Peissig & Tarr, 2007). Marr’s computational theory of object recognition outlined a number of processes necessary for object recognition. In the primal sketch, edges and other components are detected in early vision, areas of similar intensity in the retinal image are grouped together (Harris, 2014). Then, the depths and relative position of the object surfaces are mapped out in the two and half dimensional sketch. Finally, this representation is further processed to produce a viewpoint-independent three-dimensional scene of the structure of objects. To do so, concave regions are used to separate the objects into generalized cones and cylinders. These components have a principle axis of elongation which are important for the three-dimensional sketch to proceed, with each model representing different spatial scales (Harris, 2014). According to Piessig & Tarr (2007), Marr’s and Nishihara’s theory was entirely computational, with no attempt at empirical validation.
Biederman (1987) extended on Marr’s approach, by providing another structural account of object recognition, broken down into a number of stages. The model Recognition by Components, consists of three dimensional shapes known as ‘geons’, which are geometrical icons (Harris, 2014). A few examples of geons are; blocks, cylinders, wedges, arcs and spheres (Eysenck & Keane, 2010). Similar to Marr’s computational model, Biederman’s objects are represented by a collection of geons, together with a description of their orientations and spatial relationships. Biederman suggested that the human visual system converts its input to a line-drawing-like representation at an early stage of processing (Harris, 2014). Edges and concavities are used to detect non-accidental properties which both define an object and are visible from many different angles. Objects are then decomposed into a visual alphabet of 36 different geons which can combine to define many of the objects we see (Eysenck & Keane, 2010). Evidencing supporting the significance of geons was reported by Biederman and Cooper (1993) who used repetition priming as a way of testing their theory of geons. They found that task performance was significantly worse when a geon in an image had changed than when it did not. However, there are several limitations to Biederman’s theory. One of the major limitations is that it focuses solely on bottom-up processes which are triggered directly by the input of stimuli. In doing this, the theory excludes the significance of top-down process which occur from expectations and knowledge (Eysenck & Keane, 2010).
Marr and Biederman both proposed structural models, which can be problematic because structural models strive to be viewpoint-independent, which means they produce representations of objects regardless of the viewpoint of the observer (Eysenck & Keane, 2010). However, modern theories have emphasized that objects are recognized differently depending on the view of the observer (Peissig & Tarr, 2007). According to Gauthier & Tarr (2002), there was a general increase in speed as an individual’s expertise developed when learning how to identify greebles. However, performance still remained strongly viewpoint-dependent throughout the rest of the experiment. This is consistent with Tarr & Bulthoff’s (1995) findings which discovered that object recognition is viewpoint-dependent (Eysenck & Keane, 2010).
This then brings us to the discussion of viewpoint-independence versus viewpoint-dependence, which determines if the angle we view an object at affects our ability to recognize that object or not. Viewpoint independent means that we can identify an object based on its structural description and the relations of those parts from any viewpoint as long as we can see its key features. One viewpoint-independent model is Marr and Nishihara’s computational model which suggests that every object has a central axis, and it is through this axis that we identify the object. In addition, the Recognition by Components model states that every object is made of geometric shapes and it is through the combination of these geons that we identify objects. Viewpoint-independence has its weaknesses. Models with this perspective, will have difficulty discriminating between objects in the same class. It’s possible to decompose an object in multiple ways, therefore making it difficult to fine lines and junctions with natural objects such as rocks. Marr’s model has also been criticized because objects such as rocks or tress have no symmetry and therefore have no key axis to identify.
Viewpoint-dependent relies on knowing that all objects known to the individual are stored in his or her ind in a number of small discrete prototypical forms, and for every object he or she has stored prototype and generic views (Eysenck & Keane, 2010). This means that if the object is not facing the same direction as the mental image we have of it, we will mentally rotate the object. However, sometimes the geometric structure of the object is not explicit, and therefore the geometric structure and the object cannot be identified. This theory can also become problematic if the individual does not have prior experience with the object or if the object is to far removed from the prototype (Eysenck & Keane, 2010).
One of the ways to illustrate how complex the subject of object recognition is, is to briefly explain how deficits in object recognition present themselves. Agnosia, a rare disorder, which can result from a stroke or traumatic brain injury, is defined as a profound deficit in recognizing objects by sight (Humphreys & Riddoch, 1987). This diagnosis excludes patients with problems that can be explained by visual acuity, colour vision, or any other low-level problem. It can also be distinguished from a general problem with semantic knowledge or naming by showing that participants can describe objects from memory, or by other sense such as touch (LN). Agnosia is assessed through the evaluation of line drawings. Historically, a distinction was made between two types of agnosia, Apperceptive and Associative. Apperceptive agnosics have problems in perceiving visual features, they cannot copy the line drawings. Whereas Associative agnosics can copy the line drawings which suggests that their vision is intact, but have therefore become disconnected from semantic concepts of the objects (Humphreys & Riddoch, 1987). More recently, it’s been found that this distinction is inaccurate as agnosics show a range of abilities. Humphreys and Riddoch (1987), reported an account of an integrative agnosic who showed especially poor performance when objects were overlapping. This may illustrate that the disorder is caused by an inability to group features into a full object. So one can see that even by investigating deficits in object recognition, it is difficult to explain why these specific deficits occur.
One more important aspect of object recognition has not been discussed yet, that of scene context. Natural scenes are usually made up of different objects. Therefore, researchers find it surprising that people can recognize natural scenes very quickly (Biederman et al., 1974; Potter & Levy, 1969). Its possible that this type of recognition depends on global features rather than relying on the perception of the individual components of the object. Our memory for scenes is also affected by top down or conceptual factors. for example, Intraub (1997) described boundary extension. When reproducing a scene from memory, observers tend to extend the boundaries and depict a scene zoomed out compared to the original. The reasons for this are not completely understood but it is likely to be because what we remember is the context of the objects and scene. It is clear that object and scene recognition interact. In particular, we interpret what we see according to the context. Palmer (1975) showed participants a picture of a scene (e.g kitchen), followed by a briefly presented object. The object was recognized more often if consistent with the scene (eg. loaf of bread). This is evidence that scene context can affect object perception, and that object recognition is at least partly susceptible to top-down influences.
This paper only touches upon a few of the principles and models in the field of object recognition, more recent theories and approaches to object recognition have been proposed from the 1990’s onwards. However, according to Peissig & Tarr (2007) regardless of the progress that has been made since the models discussed in this paper, the field of object recognition still has a long way to go toward a comprehensive account of visual object recognition. The principles and models discussed in this paper can only account for basic level categorization but not subtle distinctions (LN). Its still not entirely known how humans can perceive and recognize objects, but there is a significant amount of theoretical and empirical progress that has been achieved and this is how we know what we do know.