Introduction

A detailed guide on using Argilla to create an open source RLHF dataset.

Author: Nora Petrova @ Prolific


Background

We set out to open source an RLHF dataset to HuggingFace in order to contribute to the community efforts towards aligning AI models to human preferences. We researched existing RLHF datasets in the public domain and we couldn’t find many, so we decided it was worthwhile to develop and release a dataset. Specifically, we determined that a dataset aimed at capturing social reasoning was a suitable area to explore as RLHF datasets are most useful when it comes to understanding human behaviour.


Objectives

Our main objective was to release a high quality RLHF dataset produced by sourcing human labels in an ethical way and thus provide the community with a dataset that can be used to fine-tune LLMs on tasks that require social reasoning skills.

Our secondary objective was to identify the steps needed to develop a dataset of this kind on the Prolific platform and produce a guide. This document goes towards producing such a guide to anyone wanting to create a similar dataset on Prolific.


Scope and Limitations

The scope of this project was to design and run two tasks on Prolific and collect data using Argilla as a data collection tool. Additionally, it involved performing data validation, data cleaning, data preparation and releasing the dataset to HuggingFace.

The limitations we had were as following:

  • There was no existing integration between Prolific and Argilla
  • We had limited communication with the participants and there was no onboarding of participants to the tasks

Research Questions

The research questions we set out to answer were:

  • Can labels provided by participants on the Prolific platform be used for an RLHF dataset
  • Does the resultant RLHF dataset help models fine-tuned on it perform better on social reasoning tasks? (benchmark dataset)
  • What does a frictionless integration between Prolific and Argilla look like?

Study Designs

Firstly, we developed a set of questions across various social reasoning sub-tasks to include in the dataset. The sub-tasks fall within the following categories:

  • understanding of emotions
  • intent recognition
  • social norms
  • social responsibility
  • reading of social cues
  • perspective taking
  • conflict resolution
  • ethics
  • moral judgement
  • communication skills
  • negotiation strategies
  • understanding of empathy
  • understanding of compassion
  • understanding of trust
  • understanding and use of humour
  • showing kindness
  • navigating diversity and cultural differences
  • use of figurative language
  • self-awareness

The development process involved manually curating and generating questions, as well as using LLMs to help us explore the space further and arrive at more nuanced questions that were aimed at understanding various aspects of human behaviour in social situations and environments.


Study 1: Written Responses

The objective of the first task was to collect written responses to our questions from participants on the Prolific platform. We produced instructions and guidelines that outlined the expectations we had of participants. The guidelines requested that the participants approached answering the questions with the following principles in mind:

  • Respectful, Honest & Authentic: Always respond with utmost respect, honesty and authenticity. Avoid derogatory, offensive or disrespectful language. Tread lightly upon sensitive topics and ensure that your words reflect your understanding, consideration and respect for the subject matter.
  • Empathy and Inclusivity: Ensure that your responses are empathetic. Try to understand and consider other people's feelings, thoughts, and experiences while responding. Be inclusive in your language, avoid making generalisations about groups of people, and make sure your response considers a diverse range of perspectives.
  • Neutrality: Present your views and explanations in a neutral language. Your goal is not to take a side but rather to provide an honest and respectful response. Avoid introducing your own biases into your response.
  • Explanation of Reasoning: When presenting your views, ensure you back them up with sound reasoning. Clarify why you hold these views.

The participants we sampled for this task fell in two groups:

  • Group 1: all demographic groups, with 100% approval rate and those that had completed 250 studies on Prolific or more.
  • Group 2: participants that were shortlisted for AI studies

We asked the participants 1000 questions spread across the sub-tasks listed above. A small subsample of the questions we asked included:

  • When interacting with someone who is upset, how do you approach the situation?
  • How do you decide when to share a personal story during a conversation?
  • How would you explain empathy to a young child?
  • How do you think culture influences our understanding and perceptions of gender expression?
  • How do you interpret a person's body language during a conversation?
  • How would you determine if someone is showing interest or merely being polite?

During the study, the participants were shown the question and were given a text box to write their response. The instructions were available to reference them at all times. We collected 4 responses per question and recruited around 400 participants.

We also asked the participants to rate the quality of the questions on a scale of 1-5. The mean quality score was 4.08 and the median score was 4.


Study 2: Rating Responses

The objective of the second task was to collect ratings of the hand-written responses collected from the first study. As in the first study, we provided instructions and guidelines that outlined the expectations we had of participants. The guidelines requested that the participants rated the responses with the following principles in mind:

  • Respectful: Responses should demonstrate respectful communication, recognising and respecting the dignity and opinions of all individuals involved in the conversation.
  • Honest: Responses should convey the factual and unbiased truth, enabling the respondents to express their honest opinions and views.
  • Authentic: The responses need to be authentic, being true to the person's character and beliefs, showing sincerity and avoiding pretence.
  • Empathetic: Responses should exhibit understanding and share the feelings of others, considering their perspective and emotion to guide the decision-making process.
  • Diversity and Inclusivity: Responses should embrace diversity and inclusivity, recognising and valuing diverse perspectives, and fostering a sense of belonging for all individuals, irrespective of their gender, race, age, religion, disability, or sexual orientation.
  • Neutral and Bias-free language: Responses should avoid language that may be offensive, discriminatory, judgmental, or biased. It should not favour any group or individual over others, reinforcing neutrality and fairness.
  • Reasoned explanation: Responses should provide explanations behind their reasoning or perspectives, elucidating their thoughts or decisions with clear, logical, and adequate reasons.
  • Please note that while it's important for a response to meet all criteria, the lack of one aspect might be compensated with a strong presence in others. Overall, the highest rated answers should reflect the best combination of all requirements.

The participants we sampled for this task fell in two groups:

  • Group 1: all demographic groups, with 100% approval rate and those that had completed 250 studies on Prolific or more.
  • Group 2: participants that were shortlisted for AI studies

Additionally, we filtered out participants that were included in the first study. We collected 3 responses per question, so that we could resolve disagreements between participants.

The rating task was set up in a pairwise comparison manner: participants would see two responses at a time and were asked to rate them on a scale of 1-8. The scale was as follows:

  1. Strong Preference for reponse A. This is the highest level of preference, indicating a clear and robust inclination towards the option.
  2. Moderate Preference for response A. This might indicate a noticeable but not overpowering preference towards the option.
  3. Weak Preference for response A. This could suggest a minimal or marginal inclination towards the option.
  4. Slight Preference for response A. This is the lowest level of preference, indicating a very subtle or minimal inclination towards the option.
  5. Slight Preference for response B. This is the lowest level of preference, indicating a very subtle or minimal inclination towards the option.
  6. Weak Preference for response B. This could suggest a minimal or marginal inclination towards the option.
  7. Moderate Preference for response B. This might indicate a noticeable but not overpowering preference towards the option.
  8. Strong Preference for response B. This is the highest level of preference, indicating a clear and robust inclination towards the option.

Data Collection

For data collection we used Argilla which was deployed on HF spaces. We interfaced with the instance through the Argilla SDK. As an integration between Prolific and Argilla, we used an appsmith application which served as a user portal. Whenever a user signed up for the study, they were redirected to the appsmith URL which allocated user credentials for Argilla for that user. The appsmith application contained a database of user credentials and tracked which ones were available and allocated. Once a participant obtained their credentials from the appsmith app, they were redirected to the Argilla instance URL on HF spaces where they could log in with their credentials.

For both studies, we created user accounts on the Argilla platform first and imported them in the database maintained by the appsmith app. Each user had their own workspace that contained the subset of dataset records that were allocated to them. For more information on the technical setup on Argilla, please refer to the Appendix.

Participants on Prolific are allowed to sign up for a study and return it without finishing. This posed a challenge, as a proportion of participants that returned without finishing had been allocated user credentials on Argilla. This resulted in user credentials in appsmith being marked as unavailable even though the workspace for that participant did not contain any annotations. To manage this, we periodically re-created accounts for dropped out participants using the same dataset records, to ensure that we collected the desired amount of responses per question and that we did not lose data on any questions.

After the participants were finished, we collected the data from the workspaces and created a resulting dataset for each study. The dataset from the first study was processed and made available in the right format for the second study. The dataset from the second study was processed, where the average rating was obtained for each pair of responses and it was prepared in an RLHF format containing <question>, <chosen>, <rejected> fields.


Data Validation and Quality Assurance

The responses for each dataset were carefully validated to ensure that data was collected in the desired way. Each workspace was inspected to ensure it contains the records that were allocated to the user that was given access to it. They were further checked after the participants had finished labelling to ensure that we had collected the desired number of responses per question; and in the second study the desired number of ratings per pair of responses.

The responses were passed through a hate speech classifier model which detected some responses which contained biased and/or harmful language. We decided to leave the responses in the dataset, so that models fine-tuned on it could learn to differentiate between biased/harmful and unbiased/harmless language.


Considerations for Future Improvements

There are two main areas for improvement that could make a significant difference:

  1. Fully managed integration between Argilla and Prolific
  2. Comprehensive onboarding and training for participants

A source of friction during the studies was the need for manual intervention in managing the integration between Argilla and Prolific, due to separate user management. A way to improve this workflow would be for Prolific to manage user creation on Argilla. This could work in the following way:

  1. A participant signs up for a study on Prolific
  2. They are automatically created an account on Argilla, provided with credentials and assigned work to their workspace from the available allocation. This involves keeping track of the yet to be allocated work, so that the dataset records that are added to the user’s workspace are in accordance with the configuration for the dataset.
  3. If the participant drops out, the records in their workspace are returned to the allocation pool, so that they can be re-allocated to a new participant.

This ensures that at any one time, the configuration of the dataset is respected and the allocation of available work is always up to date.

An area that could significantly improve the quality of responses provided by participants is having dedicated onboarding per task, as well as general training for AI related tasks. This will ensure that participants have all the context they require before they start their task and expectations are clear from the beginning.


Appendix

This section includes code snippets that were used to set up the Argilla instances for the studies.

Study 1: Written Responses code snippets

Initializing the Argilla SDK and connecting it to the instance

Python
import argilla as rg

rg.init(
    api_url="https://<account>-<dataset_name>.hf.space",
    api_key=<api_key>,
    workspace=<workspace_name>
)

Creating the dataset

Python

# dataset containing questions
df = pd.read_csv(<path_to_dataset>)


fields = [
    rg.TextField(name="question", title="Question")
]
questions = [
    rg.TextQuestion(
        name="response",
        title="Please provide your response:",
        required=True,
        use_markdown=False
    ),
    rg.RatingQuestion(
        name="quality",
        title="Rate the quality of the question",
        description="How well does the question capture the given social sitaution?\n\n1 (very poorly) - 5 (very well)",
        required=True,
        values=[1, 2, 3, 4, 5]
    )
]

records = [

rg.FeedbackRecord(fields={'question': r.question})

for _, r in df.iterrows()

]

dataset = rg.FeedbackDataset(

    fields=fields,

    questions=questions,

    guidelines=guidelines

)

dataset.add_records(records)

dataset.push_to_argilla(name=dataset_name, workspace=<workspace_name>)

User account creation

Python

n_records = len(df)
responses_per_question = 4
questions_per_user = 10
n_users = int((n_records * responses_per_question) / questions_per_user)

password_length = 20
users_with_passwords = []

for i in range(n_users):
    password = secrets.token_urlsafe(password_length)
    username = 'socres_000{}'.format(i)
    users_with_passwords.append({'username': username, 'password': password})

for user in users_with_passwords:
    rg.User.create(
        first_name=user['username'],
        username=user['username'],
        password=user['password'],
        role='annotator'
    )

Workspace management

Python

from collections import defaultdict
import random

users = [user for user in rg.User.list() if user.role == 'annotator' and user.username.startswith('socres')]

random.shuffle(records)
assignments = defaultdict(list)

allocations = {i: responses_per_question for i in range(n_records)}
left_to_allocate = n_records * responses_per_question

while left_to_allocate > 0:
    for i in range(n_users):
        choices = sorted([(k, v) for k, v in allocations.items() if v > 0], key=lambda t: t[1], reverse=True)
        question_idxs = [t[0] for t in choices[:questions_per_user]]
        for q_idx in question_idxs:
            allocations[q_idx] -= 1
            left_to_allocate -= 1
            assignments[users[i].username].append(records[q_idx])

assert sum(allocations.values()) == 0

for username, user_records in assignments.items():
         workspace = rg.Workspace.create(username)
    user = rg.User.from_name(username)
    workspace.add_user(user.id)

    dataset = rg.FeedbackDataset(
        guidelines=guidelines,
        fields=fields,
        questions=questions
    )
    dataset.add_records(user_records)
    dataset.push_to_argilla(name=dataset_name, workspace=workspace.name)

Study 2: Rating Responses code snippets

Initializing the Argilla SDK and connecting it to the instance

Python

import argilla as rg

rg.init(
    api_url="https://<account>-<dataset_name>.hf.space",
    api_key=<api_key>,
    workspace=<workspace_name>
)

Creating the dataset

Python

# dataset containing questions and responses from study 1
# formatted for pairwise comparison
df = pd.read_csv(<path_to_dataset>)

fields = [
    rg.TextField(name="question", title="Question"),
    rg.TextField(name="response-a", title="Response A"),
    rg.TextField(name="response-b", title="Response B")
]

questions = [
    rg.RatingQuestion(
        name="preference",
        title="Which response better answers the question, according to the guidelines provided in the instructions above? Please select 1 if response A is much better than response B, and 8 if response B is much better than response A. If the responses are similar, choose a number between 1 and 8 that is closer to the response that you think is better, with 4 and 5 being a slight preference for response A and B, respectively",
        description="The scale is 1 (strong preference for response A) - 4 (slight preference for response A) and 5 (slight preference for response B) - 8 (strong preference for response B)",
        required=True,
        values=[1,2,3,4,5,6,7,8]
    )
]

# build records from hf dataset
records = [
    rg.FeedbackRecord(fields={
        "question": r["question"],
        "response-a": r["response1"],
        "response-b": r["response2"]
    })
    for _, r in df.iterrows()
]

# create dataset
dataset = rg.FeedbackDataset(
    fields=fields,
    questions=questions,
    guidelines=guidelines
)

dataset.add_records(records)
dataset.push_to_argilla(name=dataset_name, workspace=<workspace_name>)

User account creation

Python

n_records = len(df)
responses_per_question = 3
questions_per_user = 60
n_users = int((n_records * responses_per_question) / questions_per_user)

password_length = 20
users_with_passwords = []

for i in range(n_users):
    password = secrets.token_urlsafe(password_length)
    username = 'socres_000{}'.format(i)
    users_with_passwords.append({'username': username, 'password': password})

for user in users_with_passwords:
    rg.User.create(
        first_name=user['username'],
        username=user['username'],
        password=user['password'],
        role='annotator'
    )

Workspace management

Python

from collections import defaultdict
import random

users = [user for user in rg.User.list() if user.role == 'annotator' and user.username.startswith('socres')]

random.shuffle(records)
assignments = defaultdict(list)

allocations = {i: responses_per_question for i in range(n_records)}
left_to_allocate = n_records * responses_per_question

while left_to_allocate > 0:
    for i in range(n_users):
        choices = sorted([(k, v) for k, v in allocations.items() if v > 0], key=lambda t: t[1], reverse=True)
        question_idxs = [t[0] for t in choices[:questions_per_user]]
        for q_idx in question_idxs:
            allocations[q_idx] -= 1
            left_to_allocate -= 1
            assignments[users[i].username].append(records[q_idx])

assert sum(allocations.values()) == 0

for username, user_records in assignments.items():
         workspace = rg.Workspace.create(username)
    user = rg.User.from_name(username)
    workspace.add_user(user.id)

    dataset = rg.FeedbackDataset(
        guidelines=guidelines,
        fields=fields,
        questions=questions
    )
    dataset.add_records(user_records)
    dataset.push_to_argilla(name=dataset_name, workspace=workspace.name)

Need further help?
Click here to contact us