Sample Selection and Preprocessing
For this experiment, I used grayscale digital photographs of the students, staff, and faculty of the Astronomy Department at the University of Texas at Austin. This 176 photo data set contains a reasonable mixture of males and females (65% male, 35% female) and represents a wide range of ages, from undergraduates to senior faculty. The pictures were also taken with same camera and a uniform configuration. Additionally, many of the faces in this collection contain the confounding factors of facial hair and/or glasses. A detailed characterization of the sample is provided below.
The images are high-resolution (1050x1500 pixels) and capture each person from the chest up in a "school-portrait" style. I preprocessed the images to extract just the face region using my own interactive tool built with python+matplotlib. The tool worked by displaying the original image and then extracting a 500x500 pixel slice centered at a specified point. This point was roughly each person's nose, although the exact position varied depending on the orientation of the face. The choice of the 500x500 pixel slice was determined through experimentation to provide the best results. This tool is demonstrated below for my own picture in the collection.
The resulting 500x500 pixel images contain 250,000 pixels each. This is far too large for my vintage-2004 computer hardware to handle. The images were resized to be manageable, namely 50x50 pixels.
Finally, I randomly shuffle these images into three subsets: a training set, a cross validation set, and a test set, in a 60:20:20% proportion, respectively, with the constraint that the percentage of males and females in each set must match that of the entire parent sample (i.e., 65% male, 35% female). The training and cross validation sets are used to train and tune the machine learning algorithm, and the test set is used to make the final evaluation of gender classification performance.
Dimensionality Reduction
Now we have images that are of a practical size. However, there is still a problem. Each image provides an array of 2,500 pixels carrying grayscale intensity values. This is probably not helpful because skin color alone is independent of gender; I am not convinced the pixel values themselves can convey anything about masculine or feminine facial characteristics, at least without more advanced analysis of the pixel intensity distributions.
Therefore, I adopt the approach of running a principle component analysis (PCA) to reduce the pixels in each image from 2,500 to a smaller number K. PCA finds the optimal set of K vectors such that the projection of the 2,500 pixel images onto these vectors minimizes the projection error. I determine what these vectors are from the training set. The precise number K to use is somewhat arbitrary, and I choose the first 86 vectors to retain "99% of the variance" in the training set images (this means the images can be reconstructed from the first 86 principle components while losing only very little detail). The cross-correlation and test sets are projected onto these same 86 vectors determined from the training set. Thus, the dimensionality of this problem has been compressed by 97%, from 2,500 pixels to 86 pixels!
The Artificial Neural Network (ANN)
ANNs are machine learning algorithms that imitate the functioning of the human brain. A network comprises of computational nodes arranged into input, "hidden", and output layers. I chose the simplest model for this experiment, a network with an input layer, one hidden layer, and an output layer. The input layer consists 86 nodes that represent each pixel of the image projections plus one bias unit carrying a fixed value of 1, for a total of 87 nodes. The output layer has two nodes that denote the male or female gender classifications. The number of nodes in the hidden layer is again somewhat arbitrary, yet intuitively, the hidden layer should have fewer nodes than the input layer but more than the output layer. I specified the hidden layer to have 10 nodes plus one bias unit. The implementation of this network was based on the the code I wrote as part of successfully completing the Coursera Machine Learning course. A visual representation of the network is shown below.
The value in the k-th node in the hidden and output layers is calculated using the sigmoid function via
(for j=2, 3), where the Θ values represent weights that must be trained before the ANN has any predictive power. Aside from the weight at each node, the ANN carries one additional free parameter (Λ) to control regularization. This feature can be adjusted as needed to prevent overfitting or underfitting of the training data.
Training Methodology and Results
Using the training set data, I optimized the weights in the ANN with the backpropagation algorithm by minimizing a merit function known as the "cost". Training the regularization parameter requires the cross validation set. For several values of Λ spanning 0 (no regularization) to 200 (high regularization), the ANN weights were trained with the training set. For each value of Λ, the associated set of weights was used to calculate the cost for each of the training and cross validation sets. The training accuracy (percentage of correctly classified faces) obtained from this exercise ranged from 71.6-100% for the training set and 64.7-85.3% for the cross validation set. The best Λ corresponds to the minimum cost for the cross validation set. In this case, the best solution for the ANN comes from training the network with Λ=8. The figure below shows how the cost varies with Λ.
With the ANN now fully trained, the final evaluation of performance is made with the test set. The test set consists of 26 males and 14 females. A total of 37/40 (92.5%) faces were correctly classified as male or female, which is comparable the accuracy achieved by human classifiers. Of the three misclassifications, two were false positives (males classified as females), and other was a false negative (female classified as male). The low number of false positives and false negatives means this ANN has demonstrably high precision and recall, which is ideal for machine learning implementations.