# Datasets # Working with Datasets A dataset contains the data that participants will annotate in an AI Task Builder Batch. This page covers dataset creation, upload, and advanced configuration options. For the complete batch workflow, see [Working with Batches](/api-reference/ai-task-builder/batches). ## Creating a dataset ```bash POST /api/v1/data-collection/datasets ``` ```json { "name": "Product reviews Q4 2024", "workspace_id": "6278acb09062db3b35bcbeb0" } ``` | Field | Type | Required | Description | | -------------- | ------ | -------- | -------------------------------- | | `name` | string | Yes | A name for your dataset | | `workspace_id` | string | Yes | The ID of the Prolific workspace | ### Response ```json { "id": "0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d", "name": "Product reviews Q4 2024", "status": "UNINITIALISED" } ``` ## Uploading data Upload your dataset as a CSV file using presigned URLs. ### Step 1: Request a presigned URL ```bash GET /api/v1/data-collection/datasets/{dataset_id}/upload-url/{filename} ``` For example: ```bash GET /api/v1/data-collection/datasets/0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d/upload-url/reviews.csv ``` ### Step 2: Upload to S3 Use the presigned URL from the response to upload your CSV file directly to S3. ```bash curl -X PUT \ -H "Content-Type: text/csv" \ --data-binary @reviews.csv \ "{presigned_url}" ``` ## CSV format Your CSV should contain one row per datapoint. Each column is displayed to participants alongside the instructions. ```csv id,review_text,product_name,rating 1,"Great product, exactly what I needed!",Widget Pro,5 2,"Arrived damaged, very disappointed",Widget Pro,1 3,"Works as expected, nothing special",Basic Widget,3 ``` ## Metadata columns Columns prefixed with `META_` are not displayed to participants. Use these for internal data you need in your results but don't want participants to see. ```csv id,review_text,META_source,META_timestamp 1,"Great product!",amazon,2024-01-15T10:30:00Z 2,"Not worth it",trustpilot,2024-01-16T14:22:00Z ``` In this example, participants see only the `id` and `review_text` columns. The `META_source` and `META_timestamp` columns are included in your results but hidden during annotation. ## Custom task grouping By default, tasks are grouped randomly when you set up a batch (using the `tasks_per_group` parameter). To define your own groupings, include a `META_TASK_GROUP_ID` column in your CSV. Rows with the same `META_TASK_GROUP_ID` value will be grouped together into a single task group. Participants complete all tasks within a group in one submission. ```csv id,review_text,product_name,META_TASK_GROUP_ID 1,"Great product!",Widget Pro,widget_pro_reviews 2,"Excellent quality",Widget Pro,widget_pro_reviews 3,"Not worth the price",Basic Widget,basic_widget_reviews 4,"Does the job",Basic Widget,basic_widget_reviews ``` In this example, tasks 1 and 2 are grouped together, as are tasks 3 and 4. A participant assigned to the `widget_pro_reviews` group will annotate both reviews in a single submission. If your dataset includes `META_TASK_GROUP_ID`, these groupings take precedence over the `tasks_per_group` parameter during batch setup. ## Dataset status Poll the dataset endpoint to check processing status. ```bash GET /api/v1/data-collection/datasets/{dataset_id} ``` | Status | Description | | --------------- | ------------------------------------------ | | `UNINITIALISED` | Dataset created but no data uploaded | | `PROCESSING` | Dataset is being processed | | `READY` | Dataset is ready to be attached to a batch | | `ERROR` | Something went wrong during processing | Wait for the status to reach `READY` before creating a batch with this dataset.