Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec;648(8092):97-108.
doi: 10.1038/s41586-025-09716-2. Epub 2025 Nov 5.

Fair human-centric image dataset for ethical AI benchmarking

Affiliations

Fair human-centric image dataset for ethical AI benchmarking

Alice Xiang et al. Nature. 2025 Dec.

Abstract

Computer vision is central to many artificial intelligence (AI) applications, from autonomous vehicles to consumer devices. However, the data behind such technical innovations are often collected with insufficient consideration of ethical concerns1-3. This has led to a reliance on datasets that lack diversity, perpetuate biases and are collected without the consent of data rights holders. These datasets compromise the fairness and accuracy of AI models and disenfranchise stakeholders4-8. Although awareness of the problems of bias in computer vision technologies, particularly facial recognition, has become widespread9, the field lacks publicly available, consensually collected datasets for evaluating bias for most tasks3,10,11. In response, we introduce the Fair Human-Centric Image Benchmark (FHIBE, pronounced 'Feebee'), a publicly available human image dataset implementing best practices for consent, privacy, compensation, safety, diversity and utility. FHIBE can be used responsibly as a fairness evaluation dataset for many human-centric computer vision tasks, including pose estimation, person segmentation, face detection and verification, and visual question answering. By leveraging comprehensive annotations capturing demographic and physical attributes, environmental factors, instrument and pixel-level annotations, FHIBE can identify a wide variety of biases. The annotations also enable more nuanced and granular bias diagnoses, enabling practitioners to better understand sources of bias and mitigate potential downstream harms. FHIBE therefore represents an important step forward towards trustworthy AI, raising the bar for fairness benchmarks and providing a road map for responsible data curation in AI.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Sony Group Corporation, with inventors J.T.A.A. and A.X., has a pending US patent application US20240078839A1, filed on 14 August 2023, that is currently under examination. It covers aspects of the human-centric image dataset specification and annotation techniques that were used in this paper. The same application has also been filed in Europe (application number 23761605.7, filed on 15 January 2025) and China (application number 202380024486.X, filed on 30 August 2024) and the applications are pending.

Figures

Fig. 1
Fig. 1. Annotations about the image subjects, instrument and environment are available for all images in FHIBE.
For visualization purposes, we display one type of metadata per image in this figure. Each annotation is linked to the annotators who made or checked the annotation. If the annotator disclosed their demographic attributes (age, pronouns, ancestry), that information is also provided. A full list of annotations is provided in Supplementary Information A. NA, not applicable.
Fig. 2
Fig. 2. Example FHIBE images annotated with detailed pixel-level annotations, keypoints, segmentation masks and bounding boxes.
Pixel-level annotations include keypoint annotations (small red circles) indicating the geometric structure (white lines) of human bodies and faces (for example, right eye inner, left foot index); segmentation masks dividing the human body and face into segments, assigning a label to each pixel (for example, left arm, jewellery); and face and person bounding boxes (red and blue rectangles, respectively).
Fig. 3
Fig. 3. Biases in CLIP predictions on FHIBE.
a, Predicted label probabilities (rows) conditioned on ground-truth pronouns (columns) (left); CLIP more often assigns a gender-neutral ‘unspecified’ label to ‘he/him/his’ than to ‘she/her/hers’. Right, gender-classification error rates vary with both pronoun and hairstyle and are lowest for stereotypical pronoun–hairstyle combinations (for example, ‘he/him/his’ with ‘short/no hair’). b, For indoor environments, masking the person increases the accuracy, whereas, for outdoor environments, masking decreases the accuracy. This suggests that CLIP may treat the presence of a person as a spurious cue for outdoor scenes, with the effect being particularly pronounced for individuals of African ancestry. c, Scene type predictions conditioned on ancestry. CLIP is more likely to predict rural environments for images containing individuals of African or Asian ancestry. The numbers on each bar denote the group size (bottom) and the corresponding probability estimate (top), indicating that perceived rural associations are stronger for these groups.
Fig. 4
Fig. 4. BLIP-2 analysis results.
Summary of the gender, occupation, ancestry and toxic response analyses. a, Responses to non-gendered likeability prompts show implicit gender attribution. b, Pronoun predictions are more accurate for ‘he/him/his’ than ‘she/her/hers’, which exhibits a fivefold higher error rate. c,d, Neutral prompts about occupations highlight stereotypical associations, revealing gender-based (c) and ancestry-related (d) stereotypes. eg, Negatively framed prompts elicit toxic responses linked to pronouns, skin tone and ancestry, with toxic gender-related responses (e), skin-tone-related responses (f) and ancestry-related responses (g). The numbers on each bar indicate the group size (bottom) and probability estimate (top).
Fig. 5
Fig. 5. Dataset comparison based on bounding box, segmentation mask and keypoint properties.
a, The bounding box (BBox) area to image area ratio; larger values indicate larger bounding boxes, suggesting that subjects are closer to the camera (left). Middle left, the face bounding box area to image area ratio; larger values indicate that subjects are closer to the camera. Only FHIBE and COCO were compared as the other datasets lack relevant labels. Middle right, the bounding box width to height ratio; values of <1 suggest that subjects are in vertical positions. Right, the normalized distance between the bounding box centre and image centre; smaller values indicate that the subjects are more centred. b, Person bounding box centre distributions. The centres are normalized by the image size to be in [0, 1]. FHIBE subjects are the most centred ones, with COCO and FACET demonstrating the largest spatial coverage. c, Person segmentation mask concavity, defined as 1maskconvexareaimagearea; higher values denote increased mask complexity (left). Right, person segmentation mask area to image area ratio; larger values indicate that subjects are closer to the camera (more detailed masks). Note that non-person categories are ignored. d, The average Euclidean distance between keypoint pairs; a greater distribution spread indicates a higher spatial coverage (left). Middle, heat map of FHIBE keypoint locations, showing a canonical shape with keypoints concentrated around standing humans centred in the image, with red density likely representing facial keypoints. Right, heat map of COCO keypoint locations, displaying a less canonical distribution, with keypoints more uniformly dispersed across the image, suggesting the presence of humans in diverse locations.
Extended Data Fig. 1
Extended Data Fig. 1. Distribution of subjects associated with key attributes in FHIBE.
This figure shows the distribution of subjects corresponding to key attributes in the FHIBE dataset. Some subjects may have multiple annotated labels for specific attributes, resulting in variations in the total sample count across attributes. In compliance with the IRB protocol, certain sensitive attributes are not publicly released, as detailed in Supplementary Information A. For transparency, the aggregated distribution of key sensitive attributes is presented. While a few extreme outliers are observed in the self-reported weight and height values, these do not significantly affect the overall distribution.
Extended Data Fig. 2
Extended Data Fig. 2. Proportional distribution of subjects for pronoun, age, and apparent skin colour across FHIBE and other datasets.
This figure compares the proportional distribution of subjects for (a) pronoun, (b) age, and (c) apparent skin colour attributes in FHIBE and other datasets used in this paper. Original attribute labels are preserved. Datasets lacking a specific attribute are excluded from the corresponding subfigure. Note that some subjects may have multiple annotated labels for specific attributes, resulting in variations in the total sample count across attributes.
Extended Data Fig. 3
Extended Data Fig. 3. Feature importance for face detection.
This figure shows feature importance scores extracted from random forest models for two face detection methods: (a) RetinaFace and (b) MTCNN. Features are ranked from most to least important, and the elbow method was applied to select the top-K attributes (K = 5 for RetinaFace, K = 4 for MTCNN) for use in decision tree models.
Extended Data Fig. 4
Extended Data Fig. 4. Decision tree models for face detection.
This figure illustrates decision tree models for two face detection methods: (a) RetinaFace and (b) MTCNN. The models highlight key attributes predictive of face detection performance. Notably, attributes such as baldness have strong correlations with gender.
Extended Data Fig. 5
Extended Data Fig. 5. Feature importance for person detection.
This figure shows feature importance scores extracted from random forest models for two person detection methods: (a) Faster R-CNN and (b) Deformable DETR. Features are ranked from most to least important. The elbow method was applied to select the top-K attributes (K = 5 for Faster R-CNN, K = 6 for Deformable DETR) for use in decision tree models. Lighting refers to the direction of head/face.
Extended Data Fig. 6
Extended Data Fig. 6. Decision tree models for person detection: Faster R-CNN and Deformable DETR.
This figure illustrates decision tree models for person detection using (a) Faster R-CNN and (b) Deformable DETR. Notably, subject interactions, such as hugging or embracing, have a large impact on the performance of both models.
Extended Data Fig. 7
Extended Data Fig. 7. Face parsing performance by age and facial hair colour.
This figure illustrates face parsing performance across facial hair colour categories for subjects aged 60+ years using the DML-CSR model. It highlights variations in model performance conditioned on facial hair colour, particularly for individuals with white facial hair.
Extended Data Fig. 8
Extended Data Fig. 8. Error rates across hairstyle pairs for face verification models.
This figure shows the percentage of incorrect predictions for face verification using (a) ArcFace, (b) CurricularFace, and (c) FaceNet models. For He/Him/His pronouns, errors are concentrated in cases with non-stereotypical hairstyles, whereas for She/Her/Hers pronouns, errors remain high whenever hairstyle variation within the pair is large. The number on top of each bar in black denotes the ratio of incorrect samples within that subgroup, while the number in red denotes the percentage of individuals with that pronoun who exhibit the corresponding hairstyle combination. This pattern highlights that hairstyle diversity disproportionately impacts error rates for She/Her/Hers pronouns. Error rates are conditioned on hairstyle changes and pronoun groups, underscoring variability in model performance.

References

    1. Sambasivan, N. et al. “Everyone wants to do the model work, not the data work”: data cascades in high-stakes AI. In Proc. ACM CHI Conference on Human Factors in Computing Systems (ACM, 2021).
    1. Birhane, A. & Prabhu, V. U. Large image datasets: a pyrrhic win for computer vision? In Proc. IEEE Winter Conference on Applications of Computer Vision (WACV) 1536–1546 (IEEE, 2021).
    1. Andrews, J. T. et al. Ethical considerations for collecting human-centric image datasets. In Proc. Advances in Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B) 55320–55360 (Curran Associates, 2023).
    1. Hundt, A., Agnew, W., Zeng, V., Kacianka, S. & Gombolay, M. Robots enact malignant stereotypes. In Proc. ACM Conference on Fairness, Accountability, and Transparency (FAccT) 743–756 (ACM, 2022).
    1. Wilson, B., Hoffman, J. & Morgenstern, J. Predictive inequity in object detection. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, 2019).

LinkOut - more resources