Working with Datasets

A dataset contains the data that participants will annotate in an AI Task Builder Batch. This page covers dataset creation, upload, and advanced configuration options.

For the complete batch workflow, see Working with Batches.

Creating a dataset

$POST /api/v1/data-collection/datasets
1{
2 "name": "Product reviews Q4 2024",
3 "workspace_id": "6278acb09062db3b35bcbeb0"
4}
FieldTypeRequiredDescription
namestringYesA name for your dataset
workspace_idstringYesThe ID of the Prolific workspace

Response

1{
2 "id": "0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d",
3 "name": "Product reviews Q4 2024",
4 "status": "UNINITIALISED"
5}

Uploading data

Upload your dataset as a CSV file using presigned URLs.

Step 1: Request a presigned URL

$GET /api/v1/data-collection/datasets/{dataset_id}/upload-url/{filename}

For example:

$GET /api/v1/data-collection/datasets/0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d/upload-url/reviews.csv

Step 2: Upload to S3

Use the presigned URL from the response to upload your CSV file directly to S3.

$curl -X PUT \
> -H "Content-Type: text/csv" \
> --data-binary @reviews.csv \
> "{presigned_url}"

CSV format

Your CSV should contain one row per datapoint. Each column is displayed to participants alongside the instructions.

1id,review_text,product_name,rating
21,"Great product, exactly what I needed!",Widget Pro,5
32,"Arrived damaged, very disappointed",Widget Pro,1
43,"Works as expected, nothing special",Basic Widget,3

Metadata columns

Columns prefixed with META_ are not displayed to participants. Use these for internal data you need in your results but don’t want participants to see.

1id,review_text,META_source,META_timestamp
21,"Great product!",amazon,2024-01-15T10:30:00Z
32,"Not worth it",trustpilot,2024-01-16T14:22:00Z

In this example, participants see only the id and review_text columns. The META_source and META_timestamp columns are included in your results but hidden during annotation.

Custom task grouping

By default, tasks are grouped randomly when you set up a batch (using the tasks_per_group parameter). To define your own groupings, include a META_TASK_GROUP_ID column in your CSV.

Rows with the same META_TASK_GROUP_ID value will be grouped together into a single task group. Participants complete all tasks within a group in one submission.

1id,review_text,product_name,META_TASK_GROUP_ID
21,"Great product!",Widget Pro,widget_pro_reviews
32,"Excellent quality",Widget Pro,widget_pro_reviews
43,"Not worth the price",Basic Widget,basic_widget_reviews
54,"Does the job",Basic Widget,basic_widget_reviews

In this example, tasks 1 and 2 are grouped together, as are tasks 3 and 4. A participant assigned to the widget_pro_reviews group will annotate both reviews in a single submission.

If your dataset includes META_TASK_GROUP_ID, these groupings take precedence over the tasks_per_group parameter during batch setup.

Dataset status

Poll the dataset endpoint to check processing status.

$GET /api/v1/data-collection/datasets/{dataset_id}
StatusDescription
UNINITIALISEDDataset created but no data uploaded
PROCESSINGDataset is being processed
READYDataset is ready to be attached to a batch
ERRORSomething went wrong during processing

Wait for the status to reach READY before creating a batch with this dataset.