How to Set Up Metadata Curation Workflows¶
This guide is for curation administrators — the person responsible for designing a curation workflow: choosing a JSON schema, deciding whether metadata is record-based or file-based, creating the CurationTask, and reviewing the validation results contributors submit.
If you're a data contributor opening a task an administrator has already created, see How to Enter and Update Metadata for a Curation Task instead.
What you'll accomplish¶
By following this guide, you will:
- Find and select the right JSON schema for your data type
- Create a record-based or file-based metadata curation workflow
- Configure curation tasks that guide contributors through metadata entry
Prerequisites¶
- A Synapse account with project creation permissions
- Python environment with synapseclient and the
curatorextension installed (pip install --upgrade "synapseclient[curator]") - An existing Synapse project and folder where you want to manage metadata
- A JSON Schema registered in Synapse (many schemas are already available for Sage-affiliated projects, or you can register your own by following the JSON Schema tutorial)
- If you are using the Curator CSV data model, you can create JSON schemas by following this guide
- (Optional) An existing Synapse team if you want multiple users to collaborate on the same Grid session. Pass the team's ID as
assignee_principal_idwhen creating the curation task.
Step 1: Authenticate and import required functions¶
from synapseclient.extensions.curator import (
create_record_based_metadata_task,
create_file_based_metadata_task,
query_schema_registry
)
from synapseclient import Synapse
from synapseclient.models import Grid
from synapseclient.models.table_components import Query
syn = Synapse()
syn.login()
Step 2: Find the right schema for your data¶
Before creating a curation task, identify which JSON schema matches your data type. Many schemas are already registered in Synapse for Sage-affiliated projects. The schema registry contains validated schemas organized by data coordination center (DCC) and data type.
If you need to register your own schema, follow the JSON Schema tutorial to understand the registration process.
# Find the latest schema for your specific data type
schema_uri = query_schema_registry(
synapse_client=syn,
dcc="ad", # Your data coordination center, check out the `syn69735275` table if you do not know your code
datatype="IndividualAnimalMetadataTemplate" # Your specific data type
)
print("Latest schema URI:", schema_uri)
When to use this approach: You know your DCC and data type, you want the most current schema version, and it has already been registered into https://www.synapse.org/Synapse:syn69735275/tables/.
Alternative - browse available schemas:
# Get all versions to see what's available
all_schemas = query_schema_registry(
synapse_client=syn,
dcc="ad",
datatype="IndividualAnimalMetadataTemplate",
return_latest_only=False
)
Step 3: Choose your metadata workflow type¶
Note
The way Grid sessions are created in this step will change in the near future. Expect updates to the Grid creation API and to this guide. Currently Data Contributers should create their own Grids due to how permissions work. This will be fixed in the near future.
Option A: Record-based metadata¶
Use this when metadata is normalized in structured records to eliminate duplication and ensure consistency.
record_set, curation_task, grid = create_record_based_metadata_task(
synapse_client=syn,
folder_id="syn987654321", # Folder where RecordSet Entity will be stored
record_set_name="AnimalMetadata_Records",
record_set_description="Centralized metadata for animal study data",
curation_task_name="AnimalMetadata_Curation", # Must be unique within the project
upsert_keys=["StudyKey"], # Fields that uniquely identify records
instructions="Complete all required fields according to the schema. Use StudyKey to link records to your data files.",
schema_uri=schema_uri, # Schema found in Step 2
bind_schema_to_record_set=True,
assignee_principal_id=123456 # Optional: Assign to a user or team
)
print(f"Created RecordSet: {record_set.id}")
print(f"Created CurationTask: {curation_task.task_id}")
What this creates:
- A RecordSet where metadata is stored as structured records (like a spreadsheet)
- A CurationTask that guides users through completing the metadata
- Automatic schema binding for validation
Option B: File-based metadata (for unique per-file metadata)¶
Use this when metadata describes individual data files and is stored as annotations directly on each file.
entity_view_id, task_id = create_file_based_metadata_task(
synapse_client=syn,
folder_id="syn987654321", # Folder containing your data files
curation_task_name="FileMetadata_Curation", # Must be unique within the project
instructions="Annotate each file with metadata according to the schema requirements.",
attach_wiki=False, # Creates a wiki in the folder with the entity view (Defaults to False)
entity_view_name="Animal Study Files View",
schema_uri=schema_uri, # Schema found in Step 2
assignee_principal_id=123456 # Optional: Assign to a user or team
)
print(f"Created EntityView: {entity_view_id}")
print(f"Created CurationTask: {task_id}")
What this creates:
- An EntityView that displays all files in the folder
- A CurationTask for guided metadata entry
- Automatic schema binding to the folder for validation
- Optional wiki attached to the folder
- A Grid session for interactive metadata editing
Complete example script¶
Here's the full script that demonstrates both workflow types:
from pprint import pprint
from synapseclient.extensions.curator import (
create_record_based_metadata_task,
create_file_based_metadata_task,
query_schema_registry
)
from synapseclient import Synapse
# Step 1: Authenticate
syn = Synapse()
syn.login()
# Step 2: Find schema
schema_uri = query_schema_registry(
synapse_client=syn,
dcc="ad",
datatype="IndividualAnimalMetadataTemplate"
)
print("Using schema:", schema_uri)
# Step 3A: Create record-based workflow
record_set, curation_task, grid = create_record_based_metadata_task(
synapse_client=syn,
folder_id="syn987654321",
record_set_name="AnimalMetadata_Records",
record_set_description="Centralized animal study metadata",
curation_task_name="AnimalMetadata_Curation",
upsert_keys=["StudyKey"],
instructions="Complete metadata for all study animals using StudyKey to link records to data files.",
schema_uri=schema_uri,
bind_schema_to_record_set=True,
assignee_principal_id=123456 # Optional: Assign to a user or team
)
print("Record-based workflow created:")
print(f" RecordSet: {record_set.id}")
print(f" CurationTask: {curation_task.task_id}")
# Step 3B: Create file-based workflow
entity_view_id, task_id = create_file_based_metadata_task(
synapse_client=syn,
folder_id="syn987654321",
curation_task_name="FileMetadata_Curation",
instructions="Annotate each file with complete metadata according to schema.",
attach_wiki=True,
entity_view_name="Animal Study Files View",
schema_uri=schema_uri,
assignee_principal_id=123456 # Optional: Assign to a user or team
)
print("File-based workflow created:")
print(f" EntityView: {entity_view_id}")
print(f" CurationTask: {task_id}")
Additional utilities¶
Validate schema binding on folders¶
Use this script to verify the schema on a folder against the items contained within that folder:
from synapseclient import Synapse
from synapseclient.models import Folder
# The Synapse ID of the entity you want to bind the JSON Schema to. This should be the ID of a Folder where you want to enforce the schema.
FOLDER_ID = ""
syn = Synapse()
syn.login()
folder = Folder(id=FOLDER_ID).get()
schema_validation = folder.validate_schema()
print(f"Schema validation result for folder {FOLDER_ID}: {schema_validation}")
List existing curation tasks¶
Use this script to see all curation tasks in a project:
from pprint import pprint
from synapseclient import Synapse
from synapseclient.models.curation import CurationTask
PROJECT_ID = "" # The Synapse ID of the project to list tasks from
syn = Synapse()
syn.login()
for curation_task in CurationTask.list(
project_id=PROJECT_ID
):
pprint(curation_task)
References¶
API Documentation¶
- query_schema_registry - Search for schemas in the registry
- create_record_based_metadata_task - Create RecordSet-based curation workflows
- create_file_based_metadata_task - Create EntityView-based curation workflows
- RecordSet.get_detailed_validation_results - Get detailed validation results for RecordSet data
- Grid.create - Create a Grid session from a RecordSet or EntityView
- Grid.export_to_record_set - Export Grid data back to RecordSet and generate validation results
- Folder.bind_schema - Bind schemas to folders
- Folder.validate_schema - Validate folder schema compliance
- CurationTask.list - List curation tasks in a project
Related Documentation¶
- How to Enter and Update Metadata for a Curation Task - The contributor-facing companion to this guide
- JSON Schema Tutorial - Learn how to register schemas
- Schema Registry - Browse available schemas