COLD: Causal reasOning in cLosed Daily activities

Indian Institute of Technology Kanpur (IIT Kanpur)
COLD Thumbnail

Picture: The proposed COLD framework that bridges the gap between open-ended causal reasoning and symbolic representation-based question answering. The humanwritten Event Sequence Descriptions (ESDs) are obtained from crowdsource workers and include a telegrammic-style sequence of events when performing an activity. The Observational Graph \(\mathcal{G}_o\) and the Causal Graph \(\mathcal{G}_c\) for an activity are used to create causal query triplets (details in Algorithm 1), shown towards the right. Using counterfactual reasoning, “going to the kitchen” is possible without going to the market (if the ingredients are already available), making “come home with the ingredients.” a more plausible effect among the given choices. Similarly, in the second example, the event “going to market” has no direct relation with the event “heating the oven”

Abstract: Large Language Models (LLMs) have shown state-of-the-art performance in a variety of tasks, including arithmetic and reasoning; however, to gauge the intellectual capabilities of LLMs, causal reasoning has become a reliable proxy for validating a general understanding of the mechanics and intricacies of the world similar to humans. Previous works in natural language processing (NLP) have either focused on open-ended causal reasoning via causal commonsense reasoning (CCR) or framed a symbolic representation-based question answering for theoretically backed-up analysis via a causal inference engine. The former adds an advantage of real-world grounding but lacks theoretically backed-up analysis/validation, whereas the latter is far from real-world grounding. In this work, we bridge this gap by proposing the COLD (Causal reasOning in cLosed Daily activities) framework, which is built upon human understanding of daily real-world activities to reason about the causal nature of events. We show that the proposed framework facilitates the creation of enormous causal queries (∼ 9 million) and comes close to the mini-turing test, simulating causal reasoning to evaluate the understanding of a daily real-world task. We evaluate multiple LLMs on the created causal queries and find that causal reasoning is challenging even for activities trivial to humans. We further explore (the causal reasoning abilities of LLMs) using the backdoor criterion to determine the causal strength between events.

Causal Reasoning in the Prior Arts

The table below shows a broad overview of the existing causal Dataset/Benchmarks presented in the NLP community. We find that most of the existing set of work relies on real-world events to reason about causality in NLP, where human annotators are asked to reason causally between the nature of events. However, most of these datasets/benchmarks try to establish a connection using a simple question prompt, which may not be enough to construct the underlying causal graph. Moreover, most of the real-world grounding-based methods remain open-ended due to the events taking place in the wild, making it difficult to consider constructing a causal graph where multiple variables play a role. More recently, with increased research attention on the causal reasoning abilities of LLMs, researchers have tried framing causal queries based on a causal inference engine, requiring the underlying causal graphs. However, when constructing causal queries from prompting LLMs, natural language is used to verbalize the causal concepts in the form of symbolic variables that may not have a real-world grounding. Moreover, the created causal queries are difficult for a human with little or no knowledge of causal inference concepts. Table 5 shows a comparison of all these features in detail, where COLD satisfies all the features.

COLD comparison_table

Comparison of causal experimental settings used in prior LLM evaluation benchmarks. The real-world grounding plays a crucial role in evaluating LLMs, which is not present in the symbolic benchmarks.

Building a Causal Reasoning Framework

COLD causal_thumbnail_small

Figure: \( \mathbf{U} \) denotes the unobserved variables, confounding all events present in a real-world activity. In an activity, some events cause other events to happen. For example, in “traveling by an airplane”, the event of “check-in luggage” causes events like “taking back luggage.”

The events in CCR (causal commonsense reasoning) generally refer to actions taking place in an activity in the real world. For example, consider the activity of “traveling by an airplane” given in the above figure, where the occurrence of all the events is confounded by a universal variable \(U\) (“intention to perform a task”). Moreover, there are a few events that cause one another. For example, the event “checking in luggage” (E1) caused the occurrence of events like “waiting at the luggage belt” (E2) after the flight, i.e., in an alternate universe where one does not checks in luggage and goes with the cabin bags, will never wait for their luggage after the flight has landed. Moreover, some of the events have no causal impact, like “find the boarding gate” (E3) has no causal relationship with “checking in luggage” (E1).

\(\Delta_{(E_1 \rightarrow E_2)} = \mathbb{P}(E_2|do(E_1)) - \mathbb{P}(E_2|do(\neg E_1))\)

\(\Delta_{(E_1 \rightarrow E_3)} = \mathbb{P}(E_3|do(E_1)) - \mathbb{P}(E_3|do(\neg E_1))\)

where \(do(.)\) denotes the do operator [Pearl, 2012] showing the intervention on \(E1\), and \(\Delta\) is the causal estimand capturing the causal strength between two events, i.e., \(\Delta (E_1 \rightarrow E_2)\) is expected to be higher when compared to \(\Delta (E_1 \rightarrow E_3)\). Note, CCR excludes the causal events that are beyond the reach of commonsense knowledge, for example, does “planting trees” have a direct impact on the “rainy season”? or does “providing free education” improve the “economic condition of the country/state”; does “carpooling” directly impact “air pollution”, etc.

A noteworthy point concerning causality is that though the logical temporal (or prototypical) order of these events provides a weak signal about causal relationships, temporal precedence does not always imply causation. For example, one could erroneously argue that “boarding a plane” is also the cause of “waiting at the luggage belt” since without “boarding a plane,” one cannot wait for the luggage belt.

For building a causal reasoning framework (based on CCR) around real-life daily activities, one would require a few primary features readily available:

We found that "Scripts" (Schank, 1975; Schank and Abelson, 1975) provide a concrete medium that satisfies all these requirements. Scripts are defined as a sequence of events describing a prototypical activity, such as "going to a restaurant", and hence capture commonsense knowledge about the world.

COLD causal_thumbnail

Figure: Left: the figure represents the closed nature of daily real-world activities (capturing commonsense, commonly understood by humans), start and end given the context of the task, i.e., the pre-activity world and post-activity world activities marginalize out the dependence of event occurring during the activity with the rest of the world. Right: Causal Graph \(\mathcal{G}_c\) for “going grocery shopping.” Notice the collider (red nodes) makes the independent set of nodes (highlighted in different colors) unconditionally independent in the causal graph. In contrast, when given a condition on a collider (“put bags in cart”, the two clusters (yellow and blue) become dependent (if collider is observed, both yellow and blue clusters may have been observed as well).

Given the nature of Script knowledge (satisfying the criterion listed above), we use a script corpus called DeScript [Wanzare et al., 2016] for creating the observational graphs. DeScript is a corpus with a telegram-style sequential description of an activity in English (e.g., baking a cake, taking a bath, etc.). DeScript is created via crowd-sourcing. For a given activity, crowd-workers write a point-wise and sequential short description of various events involved in executing the activity (this one complete description is called an ESD (Event Sequence Description)). DeScript collects data for a set of 40 daily activities (100 ESDs each) varying in complexity and background knowledge. Additionally, for a given activity, semantically similar events from different ESDs are manually aligned by human annotators. In our work, we use these DAGs as the observational distribution of an activity (\(\mathcal{G}_o^{(a)}\), where \( a\in \mathcal{A}\), where \(\mathcal{A}\) is the set of all activities). These DAGs provide a medium for generating enormous trajectories (scales from \(1.6e+{16}\) to \(1.3e+{27}\), also see the below Tables), that are coming directly from human annotations (alignment as well as the ESDs), providing us a proxy to represent the understanding of daily activities.

COLD causal_query_triplets_insights

The table provides details of the observational graph (\(\mathcal{G}_o\)) for 5 activities. The Causal Query Triplets represent the total number of triplets generated via Algorithm 1 (see paper). The instance version shows the number of samples present in the instance version (including different text instances describing the same event) of the created dataset. The below Tables shows a small sample taken for 5 activities. Overall, the huge number of samples highlights the exhaustive nature of evaluation that can be done for LLMs.

COLD causal_query_triplets_examples

The table shows examples of causal query triplets created using the causal graphs (\(\mathcal{G}_c\)) and observational graphs (\(\mathcal{G}_o\)) in Algorithm 1. The top row is taken from the COPA dataset [Gordon et al., 2012] for the purpose of comparison. Note the examples in the table show the samples taken from the instance version.

Limitations and Future Directions: One of the primary limitations of our work is the limited set of activities. Though the frameworks support generating exhaustive/enormous causal queries, finding general commonsense reasoning activities/tasks that are well understood by humans remains challenging. Moreover, creating a causal graph for an activity increases as we move toward more long-term tasks. However, as a general test of causal intelligence, our framework provides a suitable platform to validate the reasoning capabilities more rigorously. In the future, it would be interesting to sample trajectories from the observational distribution to create a training dataset and check if causal reasoning ability can be acquired by language modeling objectives (including other variants like presented in Lampinen et al. [2023]). We leave this detailed analysis for future endeavors. The proposed algorithm for causal triplet generation generates the simplest variant of causal queries in the form of causal triplets (also referred to as Pairwise Causal Discovery (PCD) task by [Chen et al., 2024]). More complicated causal queries can be generated, such as considering cases with common confounders, long/short causal chain dependency, etc. Moreover, taking formal definitions. (i.e., using the formal causal inference language) causal queries inspired from Jin et al. [2023, 2024] can be framed for a more rigorous analysis. Being at the initial state, we stick to the simple causal queries that provide two choices, and the task is to choose the more plausible cause. The creation of underlying causal graphs provides endless possibilities for creating varied versions of causal queries. In this work, we only consider an unconditional version of d-separation. In the future, the same causal graphs could be used to define more datasets for covering other rungs of the ‘causal ladder’ [Pearl and Mackenzie, 2018].

We hope the created framework will be useful for future research in causal reasoning and will provide a platform for more detailed discussions on relating the causal nature of real-world activities to symbolic based approaches.

Authors' Note: This work raises many questions that still need to be answered. While aiming to define a system that can combine both real-world hypothetical events (that are part of commonsense knowledge) and the causal graphical models, it becomes immensely difficult to combine various points of view or perspectives that people in the prior arts have considered for linking cause with effects, and come up with a single solution. In this work, we just scratched the surface and provided a partial viewpoint of such a framework. And the fundamental grounding of this framework is still left to be explored in the future. We were fortunate to show that such a framework would be helpful to cast light on the understanding of Language models via different forms of validations like prompt-based, temporal-based, backdoor-criterion-based, etc., and it is interesting to find that creating a closed system helps open up new ways of thinking about causality and linking it to real-world events.

BibTeX


      @inproceedings{
        joshi2024cold,
        title={{COLD}: Causal reasOning in cLosed Daily activities},
        author={Abhinav Joshi and Areeb Ahmad and Ashutosh Modi},
        booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
        year={2024},
        url={https://openreview.net/forum?id=7Mo1NOosNT}
        }