1 Introduction
Visual similarity plays an important role in visual recognition in object detection and scene understanding [11, 17]. A visual similarity function returns a score of how likely two instances (images and videos) share similar semantic concepts (persons, cars, ). With this perspective we propose the Group Membership Prediction (GMP) problem, where the goal is to determine how likely a collection of distinct items share the same semantic property. Fig. 1 depicts the idea of the GMP problem for two visual recognition tasks, person reidentification and kinship verification. In person reidentification (ReID) we are given a collection of images of persons captured from multiple views (cameras) and the goal is to detect whether or not they belong to the same person. In applications such as kinship detection, the underlying semantic property is more general, and the goal is to predict whether or not a collection of images share a familial relationship. GMP poses significant challenges on account of large variations in data including lighting conditions, poses and camera views.
We introduce a novel parametric probability model for predicting group membership. Our key insight is that although the visual appearances can significantly vary, they share a set of latent variables common to all views. As depicted in Fig. 2, we can hypothesize “body parts” as shared latent variables for all the pedestrian images, while for kinship verification “facial landmarks” could be considered as the shared latent variables. Our model postulates that conditioned on the location of each shared latent variable (body part or facial landmark) the visual appearance at that location is conditionally independent for different views. This property leads to a natural way of measuring image similarities through comparison of visual similarities of the same shared latent variables across different views.
This postulate leads us to a joint parametric probability model that consists of viewspecific and viewshared random variables. Viewspecific variables account for visual characteristics within a view while viewshared variables account for the integration of multiview information. The group membership likelihood factorizes into a tensor product consisting of dataindependent and datadependent factors. We learn the dataindependent parameters (weights) discriminatively using bilinear classifiers. Finally we marginalize these data tensors over all the dimensions with the learned weights as the group membership scores. Our experimental results on multicamera person ReID and kinship verification demonstrate the good prediction performance and computational efficiency of our method.
1.1 Related Work
GMP problem is closely related to multiview learning (MVL). Indeed, our perspective of shared variables has been used before in the context of MVL [12, 29, 30]. Nevertheless, the goal of MVL specifically in visual recognition is different from ours. Namely, the objective of MVL is to leverage multiple sources (texts, images, videos, ) of data corresponding to the same underlying object (persons, events, ) to improve recognition performance [3, 14, 21, 30]. On the other hand our goal is to predict group membership among the multiple sources.
Person ReID essentially is a GMP problem, where each camera view can be taken as one of the instances. In the literature, however, most of existing works consider this problem as an independent twoview classification task, mainly focusing on cleverly designing local features [10, 20, 26, 31, 36] or learning better metrics [15, 16, 18, 19, 25, 37]. Recently, Figueira [12]
proposed a semisupervised learning method to fuse multiview features for ReID so that the features agree on the classification results. Das
[5] considered the group membership prediction in ReID by maximizing the summation of pairwise similarity scores using binary integer programming during testing. Unlike [5], we formulate the group membership problem as a learning problem, rather than a postprocessing step to improve the matching rate.Kinship verification is indeed another GMP problem, where each family role (father, mother, son, daughter, ) can be considered as an instance. Similar to person ReID, existing works mainly focus on learning better features [6, 9] and better distance metrics [23] for pairwise classification [22]. Recently, Qin [28] proposed a bilinear model to handle socalled trisubject kinship verification problems. Fang [8]
proposed a sparse group lasso based feature selection method to determine whether a query person is from a specific family. Unlike
[8, 28], our method targets at a more general and challenging problem which can be used to predict an arbitrary number of images with a fixed structure of family roles, such as fatherson, fathermotherdaughter, grandfatherfathersongrandson, .2 Our Method
2.1 Problem Setting
Let be a group of persons from different views, where denotes the person and denotes its label (identity or family). Let be the image for the person with images in total. The goal of our method is to predict the following probability as group membership:
(1) 
Note that our problem setting is naturally applicable to the multiple instance cases. For example, during learning we allow multiple images to be associated with a person () in person ReID and kinship verification, as in the CUHK Campus [35] and Family101 [8] datasets.
While we have motivated our approach in the context of shared latent variables (body parts or facial landmarks), this information is unavailable during the training or testing phases. Furthermore, estimating locations of body parts and facial landmarks is known to be extremely challenging
[2, 38]. Fortunately, in the context of the applications and problems that we are concerned with, the images are approximately aligned. In these images, foreground objects are centralized and well cropped. Currently most benchmark datasets are composed of such approximately aligned images, namely, the same body parts or facial landmarks appear roughly at similar locations. In such cases, pixel locations provide good approximation of where body parts and facial landmarks are, and we utilize this property to bypass the detection challenge, while accounting for spatial misalignments with spatial kernels. Note that the issue of visual ambiguity of the shared variables still remains in our problem.2.2 Parametric GMP Model
We introduce two latent variables to model the relationship between the class labels and data samples . The graphical representation of our parametric probability model is shown in Fig. 3(a), where denotes the viewspecific latent variable for view , denotes the viewshared latent variable, and denotes the number of images from view . Based on this model, we can factorize our group membership score as follows:
(2)  
where .
Model interpretation. To show the intuition of our parametric probability model, we consider the person ReID example in Fig. 2(a) in more detail. In the ReID problem the viewspecific latent variables can be thought of as visual appearances of body parts of different persons, and the viewshared latent variable can be considered as these body parts which are shared among all the persons.
Then using Bayes rule we can expand Eq. 2. In particular, for the twoview ReID problem we see that the group membership score of the image pair and as . Since visual appearances in (or ) are posited to be independent given image (or ) and the parts , we can predict whether or not is equal to () by marginalizing the similarities of corresponding visual features of each individual part in both images ( and ) with some dataindependent weights ( and ). Similarly for the kinship example in Fig. 2(b) we can infer the group membership score by marginalizing the corresponding landmark similarities.
We take these dataindependent weights as the model parameters for prediction, which are learned discriminatively.
2.3 Discriminative Learning of Model Parameters
2.3.1 Cooccurrence Tensor Representation
As discussed in Section 2.1, images are approximately aligned in the related applications. Specifically, in person ReID benchmarks the head is always located at the top of images, torso in the middle, and legs at the bottom. This typical structure has been exploited in designing discriminative features [10]. Therefore, with approximately aligned images we can bypass the problem of shared variable detection and directly utilize pixel locations as surrogates for locations of body parts or facial landmarks. Note that we can still allow small spatial misalignments by designing kernels to account for spatial distortions.
Recently, Zhang [32] proposed an interesting feature representation to handle visual ambiguity and spatial distortion in images for person reid. The basic idea in their method is to capture visual ambiguity using visual words, and match them at similar locations using distance transform to handle spatial distortion. This results in a visual word cooccurrence matrix for a pair of images.
Inspired by [32], we propose a visual word cooccurrence tensor representation using from multiple views to represent the group of data samples. Their proposed Gaussian kernel [32] is computationally cumbersome. Instead we design a truncated exponential function as the spatial kernel with an arbitrary distance function inside to improve flexibility and computational efficiency.
Let be a pixel location where the corresponding pixel in image is encoded using visual word , and be the pixel location with index . Then we define in Eq. 2 as follows:
(3)  
where denotes a distance function, denotes a predefined window size parameter for view , and is a predefined spatial scale parameter. Then if we take viewspecific and viewshared latent variables as the dimensions in the tensor to represent the group of data, the entry at index can be calculated as .
2.3.2 General Learning Formulation
Here we introduce additional notations to simplify our exposition. Rather than directly representing a group of data samples as a tensor, we convert it into a matrix with dimensions and , respectively, where and denote the numbers of visual words for view and pixel locations in images. Further, we denote and
as our model parameters in the form of vectors. Then our group membership score in Eq.
2 can be rewritten as a decision function as follows:(4) 
where denotes the matrix transpose operator. If , we expect that all the members in the group have the same class label (and do not otherwise).
Let be a set of training data groups from views, where if all the class labels in group are the same (and otherwise). Due to the specific form in Eq. 4, we propose learning bilinear classifiers ( and ) for GMP inspired by [27], which used bilinear classifiers in a different context (binary classification):
(5) 
where
denotes the loss function (hinge loss),
are predefined regularization parameters, and denotes the norm of a vector.Note that here we relax the probability constraint on and to real numbers so that Eq. 5 can be efficiently solved using alternating optimization. In each iteration, we fix one parameter ( or
) and use a standard support vector machine (SVM) solver to find the other parameter so that the objective value decreases monotonically, thus guaranteeing a local optimal solution.
2.3.3 Pairwise Decomposition Approximation
With sufficient training data, we can train a bilinear classifier directly using Eq. 5. This training method, however, does not scale well with the number of views due to the high dimensional tensor representation, leading to serious computational and overfitting issues.
To overcome these issues, we propose an approximate pairwise decomposition method, as illustrated in Fig. 3(b), to reduce the parameter space. This is based on the conditional independence assumption in multiview learning [1]. Accordingly, we can rewrite our group membership score in Eq. 2 as follows:
(6)  
where indicates how importantly the pair of views and contribute to GMP. In this way, the number of parameters that need to be learned in our method is significantly reduced from to .
Let be the pairwise visual word matrix between views and , where . Also let , and . Then based on Eq. 6, we can rewrite Eq. 4 as follows:
(7) 
where denotes the entry in for the view pair.
To learn our model parameters in Eq. 7, we propose two learning methods as follows, namely, multiview training and doubleview training:
Multiview training:  
(8)  
Doubleview training:  
(9) 
where if in group the labels of the two persons holds; otherwise, 0. Here, denotes an elementwise operator. Both training can be done using alternating optimization with a standard SVM solver. Still local optima are guaranteed. For twoview scenarios, both training methods are essentially identical, and scale quadratically with the number of views, in general. Linear scalability is also possible if we organize all the views as cycle graphs. Difference in these two training methods comes from the loss functions, where in multiview training measures the group (multiview) loss, while in doubleview training measures the pairview loss. Our algorithm is summarized in Alg. 1.
3 Experiments
We evaluate our method on person ReID and kinship verification tasks along with stateoftheart methods on benchmark datasets. Standard training/testing protocols are used in all experiments. For each comparing method, we either cite the original results from the papers (denoted by in the tables) or calculate from released codes. Our results are reported as the average over 3 trials.
For each experiment, we choose the same or similar lowlevel feature as the other methods (see the details in subsection) for fair comparison. We densely sample the images to generate a lowlevel local feature per pixel. Then we use KMeans to build the visual vocabularies with about
randomly selected features per view. Further, every local feature is quantized into one of these visual words based on Euclidean distance. Note that more complicated feature selection methods may be employed to yield better performance, but we do not finetune this component for the sake of computational efficiency and generalization ability.We employ the chessboard distance for Eq. 3 and LIBLINEAR [7] as our SVM solver with hinge loss. We randomly generate about training samples to learn model parameters ’s. The regularization parameters are determined by crossvalidation.
3.1 Person Reidentification
For performance measure we adopt the standard Cumulative Match Characteristic (CMC) curve, which displays the recognition rate as a function of rank. The recognition rate at rank is the proportion of queries correctly matched to a corresponding gallery entity at rank or better.
For tasks with multiple camera views, we follow [5] to compare results under two camera views. Consider the results from multiple views as a high dimensional tensor, one dimension per view. To predict pairwise matches from multiview results (identifying matches between camera view 1 and view 2 from the predicted results for the joint of view 1, 2, and 3), we can either sum over or find the maximum over the extra dimensions. Crossvalidation is used to choose the better way for each dataset.
3.1.1 Two Camera Views
Rank  1  5  10  15  20  25 

VIPeR  
SCNCD [31]  20.7  47.2  60.6  68.8  75.1  79.1 
SCNCD [31]  37.8  68.5  81.2  87.0  90.4  92.7 
LADF [19]  29.3  61.0  76.0  83.4  88.1  90.9 
Midlevel filters [36]  29.1  52.3  65.9  73.9  79.9  84.3 
Midlevel+LADF [36]  43.4  73.0  84.9  90.9  93.7  95.5 
VWCooC [32]  30.70  62.98  75.95  81.01     
Ours  33.5  59.5  72.8  81.3  88.0  89.6 
CUHK01  
Singleshot LAFT [18]  25.8  55.0  66.7  73.8  79.0  83.0 
Multishot LAFT [18]  31.4  58.0  68.3  74.0  79.0  83.0 
Midlevel filters [36]  34.3  55.1  65.0  71.0  74.9  78.0 
VWCooC [32]  44.03  70.47  79.12  84.77     
Ours  60.39  82.92  90.43  93.42  94.55  95.78 
Person ReID between two views is the simplest scenario. We test our method on the VIPeR [13] and CUHK Campus [35] dataset. We extract a 672dim Color+SIFT^{1}^{1}1We downloaded the code from https://github.com/Robert0812/salience_match. vector from each 55 pixel patch in images as lowlevel features. We follow the experimental setting in [35] for both datasets.
Our comparison results are listed in Table 1. As we see, on VIPeR “Midlevel+LADF” from [36] is the current best method, which utilized more discriminative midlevel filters as features and a powerful classifier, and “SCNCD” from [31] is the second, which utilized only foreground features. Our results are comparable to both of them. However, our method always outperforms their original methods significantly when either the powerful classifier or the foreground information is not involved. On CUHK01, our method performs the best. At rank1, it outperforms [32, 36] by 16.36% and 26.09%, respectively. Compared with [32], the improvement mainly comes from the multiple instance setting of our method.
The CMC curve comparison on VIPeR and CUHK01 is shown in Fig. 4. As we see, our curve is very similar to that of LADF. This is mainly because LADF is a secondorder (quadratic) decision function based on metric learning, which shares some commonality with our classifiers.
We also demonstrate the impacts of different numbers of pixel locations (viewshared space) and visual words (viewspecific space) on the performance using VIPeR in Fig. 5. We sample the pixel locations, step by from 1 to 5 pixels along x and yaxis in images (larger number leading to fewer samples), while using different numbers of visual words. Visual words capture the variations in appearance, and with more visual words more similar patterns can be differentiated (pink and red). Matching between pixel locations gives us the statistic information of visual words, and more samples make the statistics more robust. Together they work for good performance.
3.1.2 Three Camera Views
Now we consider three camera views, and test our method on the WARD dataset [24]. Following [5], we denote the camera views as view 1, 2 and 3. However, for pairwise view matching, [5] did not mention which view as probe or gallery. Here, we define the view with a smaller/larger number of data to be the gallery/probe set. We randomly select 35 people for training, and the rest for testing.
We first resize each image to the same pixels, and take every pixel patch in the HSV color space to generate our lowlevel features by concatenating entries into a vector. The reason for choosing this feature is because in [5] the features were built in the HSV color space as well. Different from [5], we take the whole image to generate features without foreground segmentation.
The results are shown in Fig. 6. As we see, our method performs similar or better than NCR [5], and the curves of both the multiview training and doubleview training for our method behave very similarly. We list the area under curve (AUC) scores in Table 2. Our method is better than NCR on FT by 0.6%, on average, from 94.3% to 94.9%.
View pair  12  13  23  Ave. 

FT  93.3  91.0  94.9  93.1 
NCR on ICT  90.4  84.8  91.1  88.7 
NCR on FT  95.4  91.9  95.6  94.3 
Ours: Multiview  94.4  92.1  98.1  94.9 
Ours: Doubleview  92.7  91.0  97.5  93.8 
3.1.3 Four Camera Views
Next we consider four camera views, and test our method on the Reidentification Across indooroutdoor Dataset (RAiD) [5] with two indoor views camera 1 and 2, and two outdoor views camera 3 and 4. Still we take the views with smaller/larger numbers as galleries/probes. We follow [5], and utilize the same HSV lowlevel feature as we did in Section 3.1.2.
Our comparison results are shown in Fig. 7. As we see, our method again performs equally well or better than NCR. We list the AUC score comparison results in Table 3. Still our method is better than NCR on FT by 1.6%, on average, from 94.7% to 96.3%.
For both indoorindoor and outdooroutdoor cases, our method consistently works best, which may indicate that the visual word cooccurrence patterns are more discriminative if the lighting condition is similar.
View pair  12  13  14  23  24  34  Ave. 

FT  96.6  84.3  88.8  90.0  93.9  93.5  91.2 
NCR on ICT  98.5  90.6  92.1  91.0  94.4  94.1  93.4 
NCR on FT  98.1  90.4  93.1  94.5  96.5  95.9  94.7 
Ours: Multiview  98.2  93.0  97.1  94.1  96.6  90.5  94.9 
Ours: Doubleview  99.3  90.8  98.3  93.0  98.0  98.8  96.3 
3.2 Kinship Verification & Identification
As before, we utilize the HSV 12dim lowlevel features. In the experiments, we denote father, mother, son, and daughter as F, M, S, and D, respectively. Following [23], we measure the verification performance with the verification rate, defined by the number of correctly classified face pairs divided by the total number of face pairs in the test set. For identification, CMC curves are also used. We only use doubleview training in this task since the information captured by parentoffspring pairs are more important.
Kinship verification between two views (one parent and one offspring) is the conventional setting, where we test our method on two datasets, KinFaceWI [23] and KinFaceWII [23]. The former consists of 156 FS, 134 FD, 116 MS and 127 MD pairs, while the latter contains 250 pairs of each kin relation. The main difference between the two datasets is that each pair of face images in KinFaceWII comes from the same photo while the image pairs in KinFaceI come from different photos. We follow the same protocol as that in [23, 6, 28] and use a 5fold cross validation with balanced positive and negative pairs on the default training/testing split. Results are listed in Table 4.
On KinFaceWII, our method significantly outperforms the competitors, but on KinFaceWI ours is slightly worse. Our reasoning is that our current visual word representation using simple KMeans does not account for significant visual ambiguity in appearance when imaging factors (lighting conditions, illumination, ) change substantially. This leads to large intracluster variations in visual words that our method does not currently handle well. To further investigate the different performances on both datasets, we use a smaller training set randomly sampled on KinFaceWII such that it has the same size as KinFaceWI, while keeping the same test set and record the results as “reduced training set”. The results become slightly worse than the original training set, while still outperform other methods. These relatively good results, along with the worse results on KinFaceWI, demonstrate that the size of training data is indeed important, but less important than the data sources.
•  FS  FD  MS  MD  Mean 

KinFaceWI  
Dehghan [6]  76.4  72.5  71.9  77.3  74.5 
Lu [23]  72.5  66.5  66.2  72.0  69.9 
Qin [28]  76.8  76.8  74.6  78.0  76.6 
Ours  63.5  65.0  63.8  75.6  67.0 
KinFaceWII  
Dehghan [6]  83.9  76.7  83.4  84.8  82.2 
Lu [23]  76.9  74.3  77.4  77.6  76.5 
Qin [28]  84.6  77.0  84.4  85.4  82.9 
Ours  85.4  81.8  86.6  90.0  86.0 
Ours (reduced training set)  84.4  78.2  84.6  87.8  83.8 
Next we use TSKinFace dataset [28] for threeview kinship verification (father, mother, offspring), which contains 513 FMS and 502 FMD groups. Following [28], we carry out a 5fold cross validation with balanced positive and negative samples , and list the results in Table 5. As we see, our method performs consistently better than [28].
FS  FD  MS  MD  FMS  FMD  

Dehghan [6]  79.9  74.2  78.5  76.3  81.9  79.6 
Fang [8]  69.1  66.8  68.7  67.9  71.6  69.8 
Lu [23]  74.8  70.0  72.2  71.3  77.0  71.4 
Qin [28]  83.0  80.5  82.8  81.1  86.4  84.4 
Ours  88.5  87.0  87.9  87.8  90.6  89.0 
Finally we employ the Family 101 dataset [8] to investigate kinship identification, namely, identifying the correct parent/child among a set of candidates given one child/parent image. This dataset contains 14816 images that form 206 nuclear families belonging to 101 unique family trees. Following [6], we adopt 101 nuclear families and use 50 families for training and 51 families for testing. For each of the four kin relations, we train a model and use the model to match offspring images to all possible parent images. The CMC curves^{2}^{2}2We use the author’s code (http://enriquegortiz.com/publications/FamResemblance.zip) to produce the results. are shown in Fig. 8 , and Table 6 lists the Area Under Curve (AUC) measure of the CMC curves.
•  FS  FD  MS  MD  Mean 

Dehghan [6]  88.8  91.3  94.3  96.4  92.7 
Ours  90.3  94.6  96.0  97.0  94.5 
3.3 Storage & Computational Time
Storage ( for short) and computational time during testing are two critical issues in realworld applications. In our method, we only need to store a feature matrix for each entity based on Eq. 3, which is used to calculate similarities between different entities. The computational time can be roughly divided into two parts: (1) computing feature matrices , and (2) predicting group membership . We do not consider the time for generating lowlevel features, since different implementations vary significantly.
We record the storage and computational time using 300 visual words for both probe and gallery sets on VIPeR (two views), WARD (three views), and RAiD (four views). The rest of the parameters are the same as described in Section 3.1. As we see, the storage per data sample and computational time are linearly proportional to the size of images and number of visual words. Our implementation is based on unoptimized MATLAB code^{3}^{3}3Our code is available at https://zimingzhang.wordpress.com/sourcecode/.. Numbers are listed in Table 7, including the time for saving and loading features. Our experiments were all run on a multithread CPU (Xeon E52696 v2) with a GPU (GTX TITAN). The method runs efficiently with very low demand for storage.
(Kb)  (ms)  (ms)  

VIPeR  110.7  52.9  0.6 
WARD  113.7  99.7  1.5 
RAiD  166.5  68.7  0.5 
4 Conclusion
In this paper, we propose a general parametric probability model for the group membership prediction (GMP) problem. We introduce the notions of viewspecific and viewshared latent variables to capture visual information and commonality for each view. Using these two variables, we can factorize the group membership score into a tensor product, and thus propose a new visual word cooccurrence tensor feature to represent groups of data samples. In our parametric probability model, we can handle the multiple instance cases as well. Further we propose discriminatively learning a bilinear classifier for GMP, with the decision function as the marginalization over all latent variables. Our experiments on multicamera person reid and kinship verification tasks demonstrate the good predictive ability and computational efficiency of our method. As future work, we would like to explore other applications for our method such as activity retrieval [4], and develop new approaches such as zeroshot recognition [34] and structured learning [33] for our problem.
Acknowledgement
We thank the anonymous reviewers for their very useful comments. This material is based upon work supported in part by the U.S. Department of Homeland Security, Science and Technology Directorate, Office of University Programs, under Grant Award 2013ST061ED0001, by ONR Grant 50202168 and US AF contract FA865014C1728. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the social policies, either expressed or implied, of the U.S. DHS, ONR or AF.
References
 [1] A. Blum and T. Mitchell. Combining labeled and unlabeled data with cotraining. In COLT, pages 92–100, 1998.
 [2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In CVPR, pages 1365–1372, 2009.
 [3] Z. Cai, L. Wang, X. Peng, and Y. Qiao. Multiview super vector for action recognition. In CVPR, pages 596–603, 2014.
 [4] G. D. Castanon, Y. Chen, Z. Zhang, and V. Saligrama. Efficient activity retrieval through semantic graph queries. In ACM Multimedia, 2015.
 [5] A. Das, A. Chakraborty, and A. K. RoyChowdhury. Consistent reidentification in a camera network. In ECCV, pages 330–345, 2014.

[6]
A. Dehghan, E. G. Ortiz, R. Villegas, and M. Shah.
Who do i look like? determining parentoffspring resemblance via gated autoencoders.
In CVPR, pages 1757–1764, 2014.  [7] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
 [8] R. Fang, A. C. Gallagher, T. Chen, and A. C. Loui. Kinship classification by modeling facial feature heredity. In ICIP, pages 2983–2987, 2013.
 [9] R. Fang, K. D. Tang, N. Snavely, and T. Chen. Towards computational models of kinship verification. In ICIP, pages 1577–1580, 2010.
 [10] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person reidentification by symmetrydriven accumulation of local features. In CVPR, pages 2360–2367, 2010.
 [11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 32(9):1627–1645, 2010.
 [12] D. Figueira, L. Bazzani, H. Q. Minh, M. Cristani, A. Bernardino, and V. Murino. Semisupervised multifeature learning for person reidentification. In AVSS, pages 111–116, 2013.
 [13] D. Gray, S. Brennan, and H. Tao. Evaluating appearance models for recognition, reacquisition, and tracking. In PETS, pages 41–47, 2007.

[14]
M. M. Kalayeh, H. Idrees, and M. Shah.
Nmfknn: Image annotation using weighted multiview nonnegative matrix factorization.
In CVPR, 2014.  [15] S. Khamis, C.H. Kuo, V. K. Singh, V. Shet, and L. S. Davis. Joint learning for attributeconsistent person reidentification. In ECCV Workshop on Visual Surveillance and ReIdentification, pages 134–146, 2014.
 [16] M. Kostinger, M. Hirzer, P. Wohlhart, P. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, pages 2288–2295, 2012.
 [17] L.J. Li, R. Socher, and F.F. Li. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In CVPR, pages 2036–2043, 2009.
 [18] W. Li and X. Wang. Locally aligned feature transforms across views. In CVPR, pages 3594–3601, 2013.
 [19] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith. Learning locallyadaptive decision functions for person verification. In CVPR, pages 3610–3617, 2013.
 [20] C. Liu, S. Gong, C. C. Loy, and X. Lin. Person reidentification: what features are important? In ECCV, pages 391–401, 2012.
 [21] J. Liu, Y. Jiang, Z. Li, Z.H. Zhou, and H. Lu. Partially shared latent factor learning with multiview data. NNLS, 2014.
 [22] J. Lu, J. Hu, V. E. Liong, X. Zhou, A. Bottino, I. U. Islam, T. F. Vieira, X. Qin, X. Tan, Y. Keller, et al. The fg 2015 kinship verification in the wild evaluation. In FG, 2015.
 [23] J. Lu, X. Zhou, Y.P. Tan, Y. Shang, and J. Zhou. Neighborhood repulsed metric learning for kinship verification. PAMI, 36(2):331–345, 2014.
 [24] N. Martinel and C. Micheloni. Reidentify people in wide area camera network. In CVPR Workshops, pages 31–36, 2012.
 [25] A. Mignon and F. Jurie. PCCA: a new approach for distance learning from sparse pairwise constraints. In CVPR, pages 2666–2672, 2012.
 [26] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian. Local fisher discriminant analysis for pedestrian reidentification. In CVPR, pages 3318–3325, 2013.
 [27] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Bilinear classifiers for visual recognition. In NIPS, pages 1482–1490, 2009.
 [28] X. Qin, X. Tan, and S. Chen. Trisubject kinship verification: Understanding the core of a family. arXiv:1501.02555, 2015.
 [29] Y. Song, L.P. Morency, and R. Davis. Multiview latent variable discriminative models for action recognition. In CVPR, pages 2120–2127, 2012.
 [30] C. Xu, D. Tao, and C. Xu. A survey on multiview learning. arXiv:1304.5634, 2013.
 [31] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li. Salient color names for person reidentification. In ECCV, pages 536–551, 2014.
 [32] Z. Zhang, Y. Chen, and V. Saligrama. A novel visual word cooccurrence model for person reidentification. In ECCV Workshop on Visual Surveillance and ReIdentification, pages 122–133, 2014.
 [33] Z. Zhang and V. Saligrama. PRISM: Person reidentification via structured matching. arXiv preprint arXiv:1406.4444, 2014.
 [34] Z. Zhang and V. Saligrama. Zeroshot learning via semantic similarity embedding. In ICCV, 2015.
 [35] R. Zhao, W. Ouyang, and X. Wang. Person reidentification by salience matching. In ICCV, pages 2528–2535, 2013.
 [36] R. Zhao, W. Ouyang, and X. Wang. Learning midlevel filters for person reidentification. In CVPR, pages 144–151, 2014.
 [37] W.S. Zheng, S. Gong, and T. Xiang. Reidentification by relative distance comparison. PAMI, 35(3):653–668, 2013.
 [38] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In CVPR, pages 2879–2886, 2012.
Comments
There are no comments yet.