Multi-View Object Segmentation in Space and Time

In this paper, we address the problem of object segmentation in multiple views or videos when two or more view-points of the same scene are available. We propose a new approach that propagates segmentation coherence information in both space and time, hence allowing evidences in one image to be shared over the complete set. To this aim the segmentation is cast as a single efﬁcient labeling problem over space and time with graph cuts. In contrast to most existing multi-view segmentation methods that rely on some form of dense reconstruction, ours only requires a sparse 3D sampling to propagate information between viewpoints. The approach is thoroughly evaluated on standard multi-view datasets, as well as on videos. With static views, results compete with state of the art methods but they are achieved with signiﬁcantly fewer viewpoints. With multiple videos, we report results that demonstrate the beneﬁt of segmentation propagation through temporal cues.


Introduction
Segmenting objects of interest in images is a key prerequisite task for many applications of computer vision, e.g., matting and compositing in post-production, image indexing, video compression, and 3D reconstruction. Segmentation from multiple images of the same object has gained interest in recent years, as a means to remove the need for shape or appearance priors, or user interaction [20] in monocular approaches. This paper addresses the task of unsupervised multiple image segmentation of a single physical object, possibly moving, as seen from two or more calibrated cameras, which we refer to as multi-view object segmentation (MVOS), see Fig. 1 for a first example. As noted by [25], this is an intrinsically challenging problem, especially when the number of views is small, and viewpoints far apart. Indeed, it then becomes difficult to rely Work sponsored by the OSEO-funded Quaero Programme, and partially sponsored by European Commission FP7 Project React. on shared appearance models of the object between views while parts of the background seen from several viewpoints will present similar aspects. In that respect, the MVOS problem significantly differs from the object cosegmentation problem [21,14], which assumes shared appearance models for the foreground but different backgrounds.
In most applications where viewpoints see a single scene and object, calibration is available or computable using off the shelf tools such as Bundler [23]. This includes static camera setups such as performance capture studios [11], static camera networks used for surveillance, or even crowd sourced data of a single shape such as a monument [23]. This has also been shown true for sparse setups, with as few as four handheld cameras shooting video sequences of a moving object [12]. Because this geometric information is available, a key to solving MVOS is in how to make good use of it to spatially propagate evidences across viewpoints.
We propose a new iterative formulation ( §4) of multiple view object segmentation that is using a joint graphcut linking pixels through space and time. This formulation is inspired by the efficient tools developed by the cosegmentation community to correlate segmentations of different views [13,24]. It differs by the graph coupling that our framework introduces at the geometric rather than photometric level. This method brings several key contributions, validated in §6: first, it is noticeably efficient in convergence 1 and computational requirements, using only sparse interview links. Second, the graph structure intrinsically produces conservative and inclusive segmentations of the object of interest. Third, the ability to handle few viewpoints, much further apart than most state of the art approaches require. A situation that naturally arises in practice and for which none of the previous works is giving results below 8 viewpoints. Fourth, the framework straightforwardly extends to use of temporal links for multiple video sequences to propagate momentarily reliable segmentation evidences across time in multi-view setups. To the best of our knowledge, this is the first approach to leverage temporal cues for multiple video segmentation, with significant future applications.

Related work 2.1. Multi-View Object Segmentation
Zeng et al. [29] coined the problem, and proposed an initial rudimentary silhouette-based algorithm for building segmentations consistent with a single 3D object. Many methods follow this initial trend by building explicit 3D object reconstructions and alternating with image segmentations of the views based on foreground/background appearance models [7,18,11,19]. Different object representations and cues are used, most often silhouette-based and volumetric [7], depth-based [10,11], or stereo-based [16], and a range of techniques are used to regularize occupancy, enforcing smoothness criteria with graphcuts [7,11], or global joint optimization of both [15]. A significant portion of existing works require user guidance and interaction [15,28]. While generally a successful strategy, there is undeniable motivation to take dense 3D reconstruction out of the loop when processing a small number of viewpoints: image-based 3D models only achieve acceptable quality for a dozen views or more. Some of the most successful MVOS approaches to date [16] strongly rely on large number of viewpoints and small baseline. Our goal is to achieve equivalent quality with only a few, possibly widespread viewpoints. Our focus is therefore on how to propagate information between views and across time for consistent pixel labeling and not precise 3D modeling.
Propagating geometric consistency information from one view to another has proven surprisingly difficult. Indeed, the simple 3D definition of geometric consistency given above often leads to a complex counterpart in images with regions carved if no compound occupancy from other views is observed on epipolar lines of a pixel, e.g. [17,22]. In [6] a graphcut/superpixel framework is used with constraints derived from epipolar geometry jointly with soft stereo and depth binning. This requires semi-circular setups with short baseline (as in [16]) and using specific heuristics to sparsify the superpixel interaction matrix with unclear complexity outcome.
We draw inspiration from a method that uses only a sparse 3D occupancy sampling of the scene [9], which proves to be a successful and efficient alternative to 3D reconstruction. While 3D samples embody spatial consistency between views, a specific construct is nonetheless still required to properly model information transfer between images and across time. In this paper we investigate graph representations for that purpose.

Cosegmentation Approaches
Cosegmentation was first coined in the work of Rother et al. [21] as the simultaneous binary segmentation of image parts in an image pair and by extension to more images [3,14,25]. The key assumptions of these methods is the observation of a common foreground region, or objects sharing appearance properties, versus a background with higher variability across images. As noted by [25], cosegmentation increasingly refers to diverse scenarios, ranging from user-guided segmentation to segmentation of classes of objects rather than instances. MVOS differs by only considering geometric cues for inter-view propagation of segmentations, and focuses on single object instances. Interestingly, some cosegmentation methods [13] have created tools to link segmentations across views based on appearance, formulating segmentation as a joint graph cut on the views. Similarly, we introduce a graph structure specifically relevant to propagate geometric cues for MVOS, rather than photometric cues.

Monocular Video Segmentation
Recent trends examine the use of temporal cues for monocular video segmentation. Such cues may be used to propagate manually specified segmentation information [26,2,27], or completely automated [8,5]. Cues are propagated either deterministically based on e.g. optic flow [2], probabilistically by weighing different flow or link hypotheses [5,27] or by learning low level variation statistics [8]. Interestingly, some approaches construct a graph over the full 2D+t volume to link segmentations in time [26], which we propose to unify in a single graph-based framework to include intra-view, inter-view and temporal links. To the best of our reading, our method is the first to propose such unification and temporal treatment of the MVOS problem.

Overview
We adopt the same definition of foreground as in [17]. That is, an object of interest should satisfy two constraints: be fully visible in all considered views, and its general appearance should be different from the background's general appearance. To this end, we cast the MVOS problem as a joint labeling problem among the n input views, and t time steps if available, governed by a single MRF energy Superpixels Links between superpixels Superpixel graph cut segmentation Final pixel segmentation Update color models Iterate until convergence Figure 2. Overview: superpixels are computed using SLIC [1]. Links between superpixels, showed as white lines, are estimated using superpixel descriptors. The iterative process alternates between graphcut on superpixels and color models update. At convergence, segmentation on pixels is computed. discussed in §4. First, in order to ensure inter-view propagation of segmentation information, we build on the idea that sparse 3D points (or samples) randomly picked in the region of interest (common field of view of all the cameras) provide sufficient links between images [9]. Each sample creates links in the graph between itself and pixels at its projection, whose strength reflects the object coherence probability of the sample. Second, to ensure efficient intra-frame propagation, we compute a superpixel oversegmentation of each image, and define two neighborhood sets on each superpixel in the graph based on image-space and texturespace proximity. Resorting to superpixels also allows one to benefit from richer region characterizations reducing colorspace ambiguity. Third, the resulting MRF energy is minimized using s-t mincut [4] and resultant segmented regions are used to re-estimate per-view foreground/background appearance models, which are, in their turn, used to update 3D sample object coherence probabilities. We present the details of each stage of the algorithm below.

Formulation
We are given a set of input images I t = {I 1,t , ..., I n,t } at instant t. For each image i at t we have the set P t i of its superpixels p. We use superscript t for time for all terms, generally keeping it implicit for concision unless terms from different instants are involved. Segmenting the object in all the views consists in finding for every superpixel p ∈ P t i its label x p with x p ∈ {f, b}, the foreground and background labels. We denote S t the set of 3D samples used to model dependencies between the views at instant t. These points are uniformly sampled in the common visibility volume.

MRF Energy Principles
Given the superpixel decomposition and 3D samples (shown Fig. 3), we wish the MRF energy to reward a given labeling of all superpixels as follows, each principle leading to MRF energy terms described in the next subsections. Individual appearance. The appearance of a superpixel should comply with image-wide foreground or background models, depending on its label. Appearance continuity. Neighboring superpixels likely have the same labels if they have similar appearance.
Appearance similarity. Two superpixels with similar color/texture are more likely to be part of the same object and thus, more likely to have the same label. These superpixels may not be neighbors due to occluding objects, etc. Multi-view coherence. 3D samples are considered objectconsistent if they project to foreground regions with high likelihood. Projection constraint. Assuming sufficient 3D sampling of the scene, a superpixel should be foreground if it sees at least one object-consistent sample in the scene. Conversely, a superpixel should be background if it sees no object-consistent 3D sample. Time consistency. In the case of video data, superpixels in a sequence likely have the same label when they share similar appearance and are temporally linked through an observed flow field (e.g. optic flow, SIFT flow).

Intra-view appearance terms
We use the classic unary data and binary spatial smoothness terms on superpixels, to which we add non-local appearance similarity terms on superpixel pairs for broader information propagation and a finer appearance criterion.
Individual appearance term. We denote E c the unary data-term related to each superpixel appearance. We characterize appearance by the sum of pixel-wise logprobabilities of being predicted by an image-wide foreground or background appearance distribution: with R p the set of pixels contained in superpixel p. To model appearance we use a combination of color and texture histograms. In our case, I i r is an 11-dimension vector that includes both color and texture information. Appearance histograms are assumed to be shared for all frames of a given viewpoint for video sequences. Texture is defined as gradient magnitude response for 4 scales and Laplacian for 2 scales. As an initialization step, a k-means is run separately on color and texture values. This clustering is used to create texture and color vocabulary on which foreground and background histograms (H F i and H B i ) are computed. Appearance continuity term. This binary term, denoted E n , discourages the assignment of different labels to neighboring superpixels that exhibit similar appearance. It is of the form of a contrast sensitive Potts model [4]. To model this similarity we use the previously defined texture and color vocabulary to create superpixel descriptors. These descriptors consist of histograms on the vocabulary. The appearance descriptor of a given superpixel p is noted A p . Let N i,t n define the set of adjacent superpixel pairs in view i at time t. For (p, q) ∈ N i,t n , the proposed E n is inversely proportional to the distance between the two superpixel descriptors, as follows: The distance d(., .) here is the χ 2 distance between the superpixel descriptors. < d(A p , A q ) > indicates expectation over all neighboring superpixels.
Appearance similarity term. To favor consistent labels and efficient propagation among similar superpixels, we introduce a second binary term E a of the same form as E n but defined non-locally. Retrieving for each superpixel its k-nearest neighbors for χ 2 distance, we define the set N i,t a of similar superpixel pairs and for each of these pairs:

Inter-view geometric consistency terms
To propagate inter-view information, we use a graph structure connecting a 3D sample to pixels it projects on. While this leads to a structure similar to [13], the latter builds inter-pixel hard links that are always active based on common histogram binning of pixels. A key difference we have to cope with is that geometric consistency of samples may change during iteration because of evolving segmentations. We thus evaluate before each iteration an "objectness" probability measuring consistency with current segmentation, and use it to reweigh the propagation strength of the sample, using a per-sample unary term as follows.
Sample objectness term. Let P f s be the coherence probability of a sample s ∈ S t . P f s is computed using a conservative probability of common foreground coherence based on the view's histogram sets, as in [9]. We associate a unary term and a label x s to sample s, allowing the cut algorithm the flexibility of deciding on the fly whether to include s in the object segmentation, based on all MRF terms: Sample-pixel junction term. To ensure projection consistency, we connect each sample s to the superpixels p it projects onto in all views, which defines a neighborhood N s . We define a simple binary term E j as follows: The key property of this energy is that, as shown in Fig. 4, no cut of the corresponding graph may assign simultaneously to background a superpixel p and to foreground a sample s that projects on p. Thus it enforces the following desirable projection consistency property: labeling a superpixel p as background is only possible if it is coherent to label all the samples s projecting on it as background. src sink cut cut Figure 4. Relation between samples and superpixels. If a sample s is labeled as foreground then superpixels at its projection positions can not be labeled as background. This corresponds to an impossible cut, as illustrated here.
The converse property, inclusion of segmentations in the sample's projected set, cannot be ensured: a superpixel can be labeled foreground even though it sees no foreground sample. This would require enforcing a foreground superpixel p to see at least one foreground sample s, which can only be expressed with higher order MRF terms. We opt to keep a first order MRF by modeling this behavior through an iteratively reweighed unary term, computed as follows.
Sample projection term. The desired behavior can be achieved by associating to each superpixel p a sample reprojection term P (x p |V p ). Its purpose is to discourage foreground labeling of p when no sample was labeled foreground in the 3D region V p seen by the superpixel, and conversely encouraging foreground superpixel labeling as soon as a sample s in V p is foreground. This leads to a simple unary term:

Time consistency terms
In the case of video segmentation, the idea is to benefit from information at different instants and to propagate consistent foreground/background labeling for the frames of the same viewpoint. A set N i f of related superpixels between frames can be estimated by matching interest points or using optical flow. The propagation is done through the energy term E f that enforces consistent labeling of linked superpixels (p t , q t+1 ) ∈ N i f as follows: otherwise. (7) In this equation, θ f will depend on the considered links: in the case of SIFT based links, θ f is inversely proportional to the descriptor distance between the matched points. Thus, a good matching will constrain the two linked superpixels to have the same label. In the case of optical flow it will be proportional to the estimated flow quality.

MRF energy and graph construction
Let X be the conjunction of all possible sample and superpixel labels. Our MRF energy can thus be written with the three groups of terms: the intra-view group, the interview group with its own multi-view binary and unary terms, and finally the time consistency group with only binary terms between successive instants t and t + 1. λ 1 , λ 2 , λ 3 are relative weighing constant parameters. Finding a multiview segmentation for our set of images, given the set of histograms H B i and H F i , and the probabilities P f s , consists in finding the labeling X minimizing: The submodularity constraint being satisfied in our model, we can build an s-t graph G where the min-cut will provide the solution to our energy minimization problem. This graph contains the two terminal nodes source and sink, one node for each superpixel and one node for each 3D sample s. Edges are added between superpixels and samples according to the energy terms previously defined. Fig. 3 shows the resulting graph.

Computational approach
Similar to most of state of the art segmentation methods, we adopt an iterative scheme where we alternate between the previous graph cut optimization, and an update of the color models. The common visibility constraint can be used to initialize color models as in [17]. Fig. 5 gives an overview of the whole method. The extraction, description and linking of superpixels is done once, at initialization time. In the iterative process, the unary terms (objectness, superpixel sample projection and silhouette labeling probabilities) computed using the appearance models of the previous iteration. The algorithm converges when no more superpixels are re-labeled from an iteration to another. Superpixel labeling at convergence is used to estimate foreground/background appearance models which are used in a standard graphcut segmentation at pixel level, with unary terms based on appearance and smoothing binary terms using color dissimilarity.
In the case of video segmentation, the same scheme is applied over a sliding window of 5-10 frames. In this situation additional cues can be used, such as considering nonmoving regions as background.
Initialization (for each sequence instant) 1. Divide the images into superpixels. 2. Compute descriptors for all the superpixels. 3. Link similar superpixels. 4. Link superpixels from successive frames. 5. Randomly draw 3D sample positions. 6. Initialize background/foreground appearance models. Iterated steps 7. Compute unary terms of energy from the models. 8. Minimize energy with s-t mincut. 9. Update color models from graph cut results. Finalization 10. Final segmentation: standard graphcut segmentation at pixel level using models derived from superpixel segmentation. Figure 5. Algorithm overview.

Experimental protocol
We implemented our approach using publicly available software for superpixel segmentation (SLIC [1]) and using Kolmogorov's s-t mincut implementation [4]. We use superpixel sizes of 30-50 pixels to ensure oversegmentation, obtaining around 2000 superpixels per image. For appearance models, we run K-means on texture and color values, to quantize texture and color into respectively 60 and 150 "words". The region of interest is computed by keeping only 3D samples in the common visibility domain, i.e. which project inside all views. We randomly generate 100k 3D samples for all tests. The only free parameters in the method, λ 1 , λ 2 and λ 3 were respectively set to 2.0 , 4.0 and 0.05 for all datasets. No particular sensitivity was observed to these settings. Initialization of the algorithm was very weak, by setting H F i to the statistics of the projection region of the common visibility domain of all views, which is quite large on all datasets, only eliminating about 25% of pixels on outer regions of the image. Background histograms H B i are set to the statistics of the known background (outside the projection of visibility domain). Computation time depends on the number of viewpoints and the number of frames. In a static case with 10 viewpoints, each iteration of the algorithm takes less than 10s with our C++ implementation and convergence is reached in less than 10 iterations. Tests were run on a 2.3 GHz Intel i7 pc with 4GB memory.

Qualitative results
To validate our approach, we run our implementation on a dozen challenging datasets. Note that among the existing literature on the subject, few or no MVOS datasets are made publicly available, making comparisons difficult. We obtained datasets from two state of the art approaches: COUCH, BEAR, CAR, CHAIR1 from [16] which we use for qualitative and quantitative evaluation, BUSTE from [17] and PLANT 1 which we use for qualitative evaluation.
The figures from 6 to 8 show the results for our methods on the various datasets. We show the graph cut result on superpixels at convergence and the final segmentation at pixel level. We illustrate the resilience of the algorithm in particular with low numbers of viewpoints on all the datasets. Very good results are obtained with only 3 widespread viewpoints (such as Fig.1). This corresponds to a scenario where approaches that need numerous viewpoints, e.g. [16], are likely to fail.

Input images
Superpixels Segmentation  In complex scenarios, such as in Fig. 8, approaches relying only on color [9] fail to segment foreground objects, where our approach benefits from a more complex appearance model. Fig. 7 shows that what is considered as foreground object depends on the viewpoints. For the first ex-Input Our method Djelouah [9] Figure 8. Results on PLANT dataset (3 views) with qualitative comparison with [9]. Our method benefits from a richer appearance model and also from intra-image consistency constraints.
ample with 8 viewpoints the table is seen by all the views and it is identified as part of the foreground. When adding more viewpoints, the table is no longer entirely seen by all the cameras, and thereby it is segmented as background.
Using all the views, many cameras see the black elements in the background very close to the black base. They are then cut out from the foreground and only the statue is left.

Quantitative and Comparative results
To illustrate the strength of the approach and for the purpose of comparison, we use the same protocol as [16], computing accuracy as the proportion of correctly labeled pixels (Fig. 9). We evaluate here the sensitivity of our approach to the number of viewpoints and the quality of the segmentation result compared to state of the art approaches [9,16,25], by randomly picking 10 viewpoint subsets for a given tested number of viewpoints and averaging results.
Clearly Fig. 9 shows that our approach exhibits very little sensitivity to the number of viewpoints and achieves excellent segmentation results even with only 3 widespread viewpoints. Let us emphasize the excellent performance of the algorithm on CAR and CHAIR1 datasets, despite the very low number of viewpoints used and the challenging nature of color ambiguities in the datasets.
The difference of segmentation precision between approaches is mainly due to some difficult color ambiguities in the model, such as shadows that appear consistent both with hypothesis of geometric and photometric cosegmentation methods. In [16], it should be noted that depth information and plane detection significantly help, especially through the identification of the ground plane, which eliminates some ambiguities at the price of requiring more viewpoints for the purpose of obtaining the stereo.

Video segmentation results
In the case of video sequences, our framework has the ability to propagate multi-view segmentation evidences over time. It also enables to propagate temporal evidences from a given viewpoint to other viewpoints, e.g. static background or moving foreground. They can help resolve local segmentation ambiguities in few views in time or space. In order to demonstrate these principles, we evaluated the approach with two datasets DANCERS and HALF-PIPE from [11] and [12] respectively (more avaible as supplemen-

Input images
With temporal No temporal constraints constraints tal material 2 ). The first consists of 8 cameras in an indoor setup whereas the second is captured with 4 handheld cameras in a challenging outdoor environment. Fig. 10 shows segmentation results with and without temporal consistency. Results on the DANCERS sequence (first row) show how temporal evidences help resolving background ambiguities. This is achieved by taking advantage of pixels with static values when building the background model. With the HALF-PIPE video dataset (Fig. 10 second and third rows), we experiment propagation in time and space of user inputs. In this dataset, the complex nature of the environment, the handheld cameras in general motion and non-static backgrounds, and the few, widespread viewpoints make the segmentation very challenging. As shown in Fig. 10, specifying ambiguous foreground/background regions with two strokes in a single view (second row, left image) is sufficient to obtain visually satisfying results, This demonstrates that cues in an image can benefit to other images with different viewpoints and at different times.

Conclusion
We have presented a new approach to solve the MVOS problem based on iterated joint graph cuts. To our knowledge we propose the first unified solution dealing with intra-view, inter-view, and temporal cues in a multi-view image and video segmentation context, into a single consistent MRF model. The approach is shown to cope with a low number of widespread viewpoints, many times achieving state of the art quality with only three wide baseline views. The algorithm has been demonstrated on very challenging datasets, including MVOS segmentation with videos from four moving handheld cameras. We believe that the framework is a solid basis to explore more complex multi-view  Figure 9. Quantitative evaluation of our approach with a static scene. The graph on the left shows performance with respect to the number of images. The table on the right presents comparisons with state-of-the-art approaches (nb views, Accuracy). Notice that our approach achieves equivalent segmentation results with significantly fewer images than other approaches. motion models, which we suspect may even further improve segmentation quality in the video MVOS problem context.