RCD-2020 (Retrieval From Conversational Dialogues)

A FIRE 2020 Shared Task

Get Started

Task motivation and description

Motivation

Application of Information retrieval (IR) systems in dialogue based interactive systems has increasingly drawn attention from the research community. IR systems can be used to retrieve relevant information that can be used either for system generated answers or to add more context about a particular topic during an interactive dialogue between two entities. We explore the utility of IR systems to retrieve more information about entities discussed in interactive dialogues in this task.

Description

The objective of the proposed track, to be organized as a part of FIRE 2020, is to automatically contextualize two-party or multi-party dialogue systems. We explore the utility of IR systems to retrieve more information about entities discussed in interactive dialogues. To study the problem under a laboratory-based reproducible setting, we propose a number of simplifications. First, since collecting multi-party chat dialogues can lead to individual privacy concerns, items requiring contextualization. Using movie scripts as a starting point, the task requires participants to retrieve relevant information from wikipedia given a dialogue span from a movie script. The task of the participants is to identify important or central entities in a span of movie dialogues and to subsequently retrieve a list of passages that provide more context or information about entities in that span. The annotated span of text (i.e. the one requiring contextualization) will not be disclosed to the participants. Instead, they need to estimate the information need from the given conversation. As an example, given the excerpt of the example script from the movie ‘12 angry men’, a simple approach would be to execute the whole script as a query and retrieve a ranked list of documents from a collection. Participants are encouraged to explore different methods to identify/formulate the query from the dialogue and then proceed with the retrieval step. 

The participants are provided with a manually annotated sample of dialogue spans extracted from four movie scripts along with entire movie scripts. The collection from which passages are to be retrieved for contextualization is the Wikipedia collection (dump from 2019). Each document in the Wikipedia collection is composed of explicitly marked-up passages (in the form of the paragraph tags). The retrievable units in our task are the passages (instead of whole documents).

Guidelines

For the RCD track, we have chosen conversations from movie scripts that constitute situations requiring contextualization. These are long conversations between one or more actors. One such example is the highlighted span of text (requiring contextualization) from the movie 12 angry men as shown below.

  • NO2: guilty. I thought it was obvious. I mean nobody proved otherwise.
  • NO8: is on the prosecution. The defendant doesn’t have to open his mouth. That’s in the Constitution. The Fifth Amendment. You’ve heard of it.
  • NO2: I... what I meant... well, anyway, I think he was guilty.
In the above piece of conversation example, the goal is to develop a system that will identify that Fifth Amendment is the piece of text that may require contextualization, and retrieve a ranked list of Wikipedia passages corresponding to this concept.

Tasks

There are two separate tasks. Each team can participate in any one or both the tasks. The first task is entity linking whereas the second pertains to retrieving relevant information about identified entities.
Following are the two tasks.
  • Task 1: Given an excerpt of a dialogue act as shown in example above, output the span of text indicating a potential piece of information need (requiring contextualization), i.e., in case of the above example, output the text Fifth Amendment.
  • Task 2: Given an excerpt of a dialogue act (see above example), return a ranked list of passages containing information on the topic of the information need (requiring contextualization), i.e., with respect to the above example return passages from Wikipedia that contain information on the Fifth Amendment.

Training and Test Phases
During the training phase, we will release two pieces of information, namely i) a conversation piece from a movie script and ii) the span of text comprising the concept requiring contextualization. During the test phase, we will release only the conversation piece. Participants in Task-1 would then need to find the relevant piece of text from a given conversation, e.g. find 'Fifth Amendment' from the text excerpt in the above example. Participants in Task-2 only may not need to explicitly find 'Fifth Amendment' from the text. Rather they need to find documents (Wiki passages) from the given collection that provide information on this particular topic. Although the tasks are independent, we believe that scoring well on Task-1 will benefit the effectiveness of Task-2 as well. This is because, as the given conversation context may be quite diverse in terms of topics, identifying a suitable topic could help to construct a well-formed query that could help retrieve a more focussed list of documents. A too verbose query on the other hand (a simple approach could be to use the entire context as the query) may not retrieve relevant documents corresponding to the information need (Fifth Amendment in the example).

Evaluation Metrics

  • Task 1: For task-1 (information extraction), we will use the overlap of the ground-truth text span (e.g. ‘fifth amendment’) and the predicted text span (e.g. ‘Constitution. The Fifth Amendment’) with the Jacard coefficient measure. Exact match would lead to a Jacard coefficient of 1, and false positives or false negatives would penalize a predictive approach.
  • Task 2: For task-2 (passage retrieval), we will use mean average precision (MAP) to compare and evaluate different approaches. This metric would favour systems that retrieve a higher number of relevant passages towards the top ranks.

Dataset

The movie scripts that were chosen to depict situations requiring contextualization involve relatively long conversations between one or more actors. For each movie script, we annotated a span of text that were manually assessed to be indicative of potential contextualization as shown in section above. We selected a number of play-style movie scripts for annotation, i.e. the ones which involve long dialogue acts for plot development. In total, we annotated 4 movie scripts, namely

The text extracted from the movie script along with the   BRAT   formatted annotation file can be found  here.

The dataset for this track comprises of:

  • Document Collection: The objective in this track (Task-2) is to retrieve a ranked list of Wikipedia passages. We release a pre-processed version of Wikipedia dump, where we have enclosed each paragraph of a Wikipedia page into separate XML tags. Each tag is assigned a unique identifier. The basic retrieval unit for this ranking task is hence a Wikipedia passage. Note that it's up to you if you want to treat the passages as units contained within a document or treat them independently while ranking them. Each passage has been (and will be) judged independently. There are two options for the collection that you can download.
    • A TREC formatted markup document consisting of passage demarcated Wikipedia collection can be downloaded from this Google drive link.
    • A more simplified format (a tab separated file text file) containing the passage id in the first column and passage content in the second can be downloaded from this Google drive link.
    A sample mavenized Lucene project to help the participants get started with indexing and retrieval can be found here.
  • Queries/Topics: A query for the ranking task contains a dialog piece from a movie. The topic file is similar in structure to a standard TREC query. Each query (topic) starts with a topic tag. The num tag assigns a unique identifier to the topic. You have to use these ids in the output results file (more on this later). Each topic has a desc tag which contains a list of dialogues enclosed within 'p' tags. Each 'p' tag indicates a change of speaker. Some topics have identical description fields because while annotating we identified multiple concepts from the same dialog piece that may require contextualization. In such a case, the one with a smaller number is associated to the concept which occurs in the text before the other corresponding to the larger number. In addition to the 'desc' field, the training topics contains an additional title field which describes the exact span of text representing the information need. We make this information available for the participants to get an idea about the types of text spans that would typically require contextualization from dialog streams.
  • Relevance Judgments: For the training set of topics, we release a TREC qrel formatted file with 4 white-space separated columns, the first column denoting query id, the second unused, the third - a string denoting the passage (basic retrieval unit) identifier and the fourth indicating the relevance label (1/0) in our case. The training topics along with the relevance judgments can be downloaded from this link to training data. To obtain the relevance assessments, we constructed a pool of passages by a combination of retrieval with a number of standard IR models, such as BM25, LM, DFR etc. Passages in the pool were assessed by the organizors with respect to each input dialogue. For test topics, we will add the documents collected from the participating systems in the pool and reevaluate the extended pool.

Run Submission Format

For each query in the test topic file, the participants need to submit automatically generated outputs one or both the tasks.
  • Task-1: Participants need to submit a two column file where each line contains the query id and the predicted text span from the given description. An example task-1 submission file looks like
        1 [\t] That's in the Constitution. The Fifth Amendment.
        4 [\t] cute little switchknife
        ...
        ...
        
  • Task-2: Participants need to submit a standard TREC .res formatted file comprising of the following. The first column denotes query id (matches the ids of the test topics), the second is unused, the third is the retrieved document (Wiki passage) identifier, the fourth is the rank of this document (passage), the fifth is the similarity score and the sixth column denotes a run-name to distinguish between different runs (the run-name should be meaningful and representative of the method used to generate the run). An example task-2 submission file looks like
        1	Q0	10046153-34	1 13.23 BM25_termselection
        1	Q0	10275774-4	2 12.58 BM25_termselection
        ...
        ...
        2	Q0	5202223-19	1 7.64 BM25_termselection
        2	Q0	527390-11	2 7.37 BM25_termselection
        ...
        ...
        
Note that since the retrievable units for our track are Wikipedia passages (not documents), in the collection provided we have assigned unique identifiers to passages. The naming convention for a passage is doc number-passage offset, i.e. two integers separated by a hyphen character (encoded within the pno tags within the collection provided). Task-2 participants will be required to print these identifiers in the third column of the run submission file. Using other arbitrary identifiers would not enable us to match the relevant retrieved passages from the ones in the qrel file. A sample XML excerpt for a Wikipedia document, titled Anarchism is shown below.




For submission, you need to send your runs to rcd2020@firetask@gmail.com.

Organizers

Important Dates

  • Training Data Release (Download link)- 16th July, 2020  ✓
  • Test Data Release (Download link) - 16th July, 2020  ✓
  • Run Submission Deadline - 5th September, 2020  ✓
  • Results Declaration - 15th September, 2020  ✓
  • Working Note Submission - 8th October, 2020
  • Review Notifications - 25th October, 2020
  • Final Version of Working Note - 5th November, 2020

Results

Results for Task 1:

We got the runs from a single participating team - Adapt Centre (DCU). The team submitted 5 runs for tasks 1 and 2. The following are the results, measued in terms of weighted BLEU scores, for task 1. More concretely, a variant of the BLEU score is computed to measure the overlap between the predicted and the ground-truth of the information need nugget arising out of every conversational context. Instead of the word n-grams, as used in the machine translation BLEU scores, we use weighted character n-grams (n=3,4,5) with more weight put on higher values of n.



Team Name Run Name Weighted BLEU Score Word-level Jaccard Overlap
ADAPT F6_4_model2 0.0636 0.0487
ADAPT F7_4_model3 0.0989 0.0727
ADAPT F5_0_model1 0.0984 0.0487
ADAPT Y7_2_model5_2 0.0020 0.0000
ADAPT F8_4_model4 0.1090 0.0636

Results for Task 2
Team Name Run Name MAP P@5 MRR
ADAPT F5_0_Model1 0.0021 0.0400 0.0922
ADAPT Y7_2_Model5_2 0.0001 0.0160 0.0023
ADAPT F7_4_model3 0.0016 0.0417 0.1086
ADAPT F6_4_model2 0.0013 0.0160 0.0518
ADAPT F8_4_model4 0.0003 0.0400 0.0704

The ground-truth for task-1 for the test set topics (comprising of information need requiring Wikification) can be found in the following TREC topic formatted file. The desc field contains a piece of conversation (from a movie), whereas the title field contains the annotated piece of ground-truth text forming the unit of information need.

While evaluating the results, we took into account overlap for each individual conversational piece (and importantly, not on individual information need unit). For example, in the following conversational piece there are two instances of information unit, namely acacia tree and Ming Mecca chip. The reference string for this piece is considered to be the concatenation of these two. Also, the predictions for these two topics are concatenated for computing the overlap with the ground-truth.

Mr. Cohen? Mr. Cohen? Please stop for a second Mr. Cohen? Damn it already! Stop following me. I'm not interested in your money. I'm searching for a way to understand our world. I'm searching for perfection. I don't deal with mediocre materialistic people like you! ... Take the acacia tree...in East Africa. It is the most prevalent plant in all of Kenya because it has managed to secure its niche by defeating its major predator, the giraffe. To accomplish this, the tree has made a contract with a highly specialized red ant. The tree has evolved giant spores which act as housing for the ants In return for shelter, the ants supply defense. When a giraffe starts to eat the tree's leaves, the shaking branch acts like an alarm. The ants charge out and secrete an acid onto the giraffe's tongue. The giraffe learns its lesson and never returns. Without each other, the tree would be picked .... Just silicon. A Ming Mecca chip.

To reproduce results for task 1, simply use the script. Go through the github readme to set up the environment. The script only requires the folder for task 1 runs to generate the BLEU score.

For task 1, the script uses two pieces of information - a) the annotated text for each topic (see here) and b) the equivalence relation of the topics (see here) - to compute the overlap.

For evaluating task 2 runs, please run this script which requires qrels and retrieval runs (formatted as TREC runs) as two inputs. The relevance judgments i.e. qrels are available here.

Contact US

Please reach out to the organizers for any questions. You can also mail to rcd2020firetask@gmail.com.