# coco2017

**Repository Path**: hf-datasets/coco2017

## Basic Information

- **Project Name**: coco2017
- **Description**: Mirror of https://huggingface.co/datasets/phiyodr/coco2017
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2023-10-30
- **Last Updated**: 2025-02-08

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

---
language:
- en
pretty_name: COCO2017
size_categories:
- 100K<n<1M
task_categories:
- image-to-text
task_ids:
- image-captioning
tags:
- coco
- image-captioning
dataset_info:
  features:
  - name: license
    dtype: int64
  - name: file_name
    dtype: string
  - name: coco_url
    dtype: string
  - name: height
    dtype: int64
  - name: width
    dtype: int64
  - name: date_captured
    dtype: string
  - name: flickr_url
    dtype: string
  - name: image_id
    dtype: int64
  - name: ids
    sequence: int64
  - name: captions
    sequence: string
  splits:
  - name: train
    num_bytes: 64026361
    num_examples: 118287
  - name: validation
    num_bytes: 2684731
    num_examples: 5000
  download_size: 30170127
  dataset_size: 66711092
---
# coco2017

Image-text pairs from [MS COCO2017](https://cocodataset.org/#download).

## Data origin

* Data originates from [cocodataset.org](http://images.cocodataset.org/annotations/annotations_trainval2017.zip)
* While `coco-karpathy` uses a dense format (with several sentences and sendids per row), `coco-karpathy-long` uses a long format with one `sentence` (aka caption) and `sendid` per row. `coco-karpathy-long` uses the first five sentences and therefore is five times as long as `coco-karpathy`.
    * `phiyodr/coco2017`: One row corresponds one image with several sentences.
    * `phiyodr/coco2017-long`: One row correspond one sentence (aka caption). There are 5 rows (sometimes more) with the same image details.

## Format

```python
DatasetDict({
    train: Dataset({
        features: ['license', 'file_name', 'coco_url', 'height', 'width', 'date_captured', 'flickr_url', 'image_id', 'ids', 'captions'],
        num_rows: 118287
    })
    validation: Dataset({
        features: ['license', 'file_name', 'coco_url', 'height', 'width', 'date_captured', 'flickr_url', 'image_id', 'ids', 'captions'],
        num_rows: 5000
    })
})
```

## Usage

* Download image data and unzip
  
```bash
cd PATH_TO_IMAGE_FOLDER

wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
#wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip # zip not needed: everything you need is in load_dataset("phiyodr/coco2017")

unzip train2017.zip
unzip val2017.zip
```

* Load dataset in Python
  
```python
import os
from datasets import load_dataset
PATH_TO_IMAGE_FOLDER = "COCO2017"

def create_full_path(example):
    """Create full path to image using `base_path` to COCO2017 folder."""
    example["image_path"] = os.path.join(PATH_TO_IMAGE_FOLDER, example["file_name"])
    return example

dataset = load_dataset("phiyodr/coco2017")
dataset = dataset.map(create_full_path)
```