Datasets | Prolific API

A dataset contains the data that participants will annotate in an AI Task Builder Batch. This page covers dataset creation, upload, and advanced configuration options.

For the complete batch workflow, see Working with Batches.

Creating a dataset

$ POST /api/v1/data-collection/datasets

1 {
2   "name": "Product reviews Q4 2024",
3   "workspace_id": "6278acb09062db3b35bcbeb0"
4 }

Field	Type	Required	Description
`name`	string	Yes	A name for your dataset
`workspace_id`	string	Yes	The ID of the Prolific workspace

Response

1 {
2   "id": "0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d",
3   "name": "Product reviews Q4 2024",
4   "status": "UNINITIALISED"
5 }

Uploading data

Upload your dataset as a CSV file using presigned URLs.

Step 1: Request a presigned URL

$ GET /api/v1/data-collection/datasets/{dataset_id}/upload-url/{filename}

For example:

$ GET /api/v1/data-collection/datasets/0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d/upload-url/reviews.csv

Step 2: Upload to S3

Use the presigned URL from the response to upload your CSV file directly to S3.

$ curl -X PUT \
>   -H "Content-Type: text/csv" \
>   --data-binary @reviews.csv \
>   "{presigned_url}"

CSV format

Your CSV should contain one row per datapoint. Each column is displayed to participants alongside the instructions.

1 id,review_text,product_name,rating
2 1,"Great product, exactly what I needed!",Widget Pro,5
3 2,"Arrived damaged, very disappointed",Widget Pro,1
4 3,"Works as expected, nothing special",Basic Widget,3

Metadata columns

Columns prefixed with META_ are not displayed to participants. Use these for internal data you need in your results but don’t want participants to see.

1 id,review_text,META_source,META_timestamp
2 1,"Great product!",amazon,2024-01-15T10:30:00Z
3 2,"Not worth it",trustpilot,2024-01-16T14:22:00Z

In this example, participants see only the id and review_text columns. The META_source and META_timestamp columns are included in your results but hidden during annotation.

Custom task grouping

By default, tasks are grouped randomly when you set up a batch (using the tasks_per_group parameter). To define your own groupings, include a META_TASK_GROUP_ID column in your CSV.

Rows with the same META_TASK_GROUP_ID value will be grouped together into a single task group. Participants complete all tasks within a group in one submission.

1 id,review_text,product_name,META_TASK_GROUP_ID
2 1,"Great product!",Widget Pro,widget_pro_reviews
3 2,"Excellent quality",Widget Pro,widget_pro_reviews
4 3,"Not worth the price",Basic Widget,basic_widget_reviews
5 4,"Does the job",Basic Widget,basic_widget_reviews

In this example, tasks 1 and 2 are grouped together, as are tasks 3 and 4. A participant assigned to the widget_pro_reviews group will annotate both reviews in a single submission.

If your dataset includes META_TASK_GROUP_ID, these groupings take precedence over the tasks_per_group parameter during batch setup.

Dataset status

Poll the dataset endpoint to check processing status.

$ GET /api/v1/data-collection/datasets/{dataset_id}

Status	Description
`UNINITIALISED`	Dataset created but no data uploaded
`PROCESSING`	Dataset is being processed
`READY`	Dataset is ready to be attached to a batch
`ERROR`	Something went wrong during processing

Wait for the status to reach READY before creating a batch with this dataset.