# Datasets

# Working with Datasets

A dataset contains the data that participants will annotate in an AI Task Builder Batch. This page covers dataset creation, upload, and advanced configuration options.

For the complete batch workflow, see [Working with Batches](/api-reference/ai-task-builder/batches).

## Creating a dataset

```bash
POST /api/v1/data-collection/datasets
```

```json
{
  "name": "Product reviews Q4 2024",
  "workspace_id": "6278acb09062db3b35bcbeb0"
}
```

| Field          | Type   | Required | Description                      |
| -------------- | ------ | -------- | -------------------------------- |
| `name`         | string | Yes      | A name for your dataset          |
| `workspace_id` | string | Yes      | The ID of the Prolific workspace |

### Response

```json
{
  "id": "0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d",
  "name": "Product reviews Q4 2024",
  "status": "UNINITIALISED"
}
```

## Uploading data

Upload your dataset as a CSV file using presigned URLs.

### Step 1: Request a presigned URL

```bash
GET /api/v1/data-collection/datasets/{dataset_id}/upload-url/{filename}
```

For example:

```bash
GET /api/v1/data-collection/datasets/0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d/upload-url/reviews.csv
```

### Step 2: Upload to S3

Use the presigned URL from the response to upload your CSV file directly to S3.

```bash
curl -X PUT \
  -H "Content-Type: text/csv" \
  --data-binary @reviews.csv \
  "{presigned_url}"
```

## CSV format

Your CSV should contain one row per datapoint. Each column is displayed to participants alongside the instructions.

```csv
id,review_text,product_name,rating
1,"Great product, exactly what I needed!",Widget Pro,5
2,"Arrived damaged, very disappointed",Widget Pro,1
3,"Works as expected, nothing special",Basic Widget,3
```

## Metadata columns

Columns prefixed with `META_` are not displayed to participants. Use these for internal data you need in your results but don't want participants to see.

```csv
id,review_text,META_source,META_timestamp
1,"Great product!",amazon,2024-01-15T10:30:00Z
2,"Not worth it",trustpilot,2024-01-16T14:22:00Z
```

In this example, participants see only the `id` and `review_text` columns. The `META_source` and `META_timestamp` columns are included in your results but hidden during annotation.

## Custom task grouping

By default, tasks are grouped randomly when you set up a batch (using the `tasks_per_group` parameter). To define your own groupings, include a `META_TASK_GROUP_ID` column in your CSV.

Rows with the same `META_TASK_GROUP_ID` value will be grouped together into a single task group. Participants complete all tasks within a group in one submission.

```csv
id,review_text,product_name,META_TASK_GROUP_ID
1,"Great product!",Widget Pro,widget_pro_reviews
2,"Excellent quality",Widget Pro,widget_pro_reviews
3,"Not worth the price",Basic Widget,basic_widget_reviews
4,"Does the job",Basic Widget,basic_widget_reviews
```

In this example, tasks 1 and 2 are grouped together, as are tasks 3 and 4. A participant assigned to the `widget_pro_reviews` group will annotate both reviews in a single submission.

<Note>
  If your dataset includes `META_TASK_GROUP_ID`, these groupings take precedence over the `tasks_per_group` parameter during batch setup.
</Note>

## Dataset status

Poll the dataset endpoint to check processing status.

```bash
GET /api/v1/data-collection/datasets/{dataset_id}
```

| Status          | Description                                |
| --------------- | ------------------------------------------ |
| `UNINITIALISED` | Dataset created but no data uploaded       |
| `PROCESSING`    | Dataset is being processed                 |
| `READY`         | Dataset is ready to be attached to a batch |
| `ERROR`         | Something went wrong during processing     |

Wait for the status to reach `READY` before creating a batch with this dataset.