Learning on the go: Building and testing our Gen AI-based evidence assistant

09 October 2025

Reading time

09:35 min

Immersive Reader

Comments

09 October 2025

Reading time

09:35 min

Immersive Reader

Post comment

Each year, more than $200 billion is invested in life-saving policies and programs globally. Ensuring these resources deliver maximum impact depends on decision-makers having timely access to high-quality, relevant and actionable evidence. Yet too often, this evidence remains scattered, hard to interpret, or unavailable when it is most needed. At 3ie, we are investing in the power of generative artificial intelligence (Gen AI) to bridge this critical gap.

Our forthcoming Gen AI-based assistant, ‘DevChat’, will use the world’s largest repository of human-vetted impact evaluations and systematic reviews—3ie’s Development Evidence Portal (DEP)—to deliver real-time, multilingual, and tailored evidence summaries to decision-makers. By leveraging Gen AI tools, along with our unique evidence repository of over 18,000 impact evaluation studies and 1500 systematic reviews, we hope to ensure that every dollar spent drives meaningful, measurable change in people’s lives.

In this blog, we share lessons from building and testing DevChat and reflect on a key question: how can we build Gen AI tools that decision-makers truly trust and find useful so that decision-makers can turn evidence into policies and actions that improve lives?

From summaries to evidence-informed decisions: Why design matters

Gen AI and the numerous new low-code tools that enable the rapid development of chatbots have led to the widespread adoption of these technologies, and for good reason: a good chatbot can skim and summarize thousands of papers in minutes, sparing users weeks of reading. But when the inquiry is, “What policies are most effective in increasing women’s labor force participation?”, the challenge for the chatbot isn’t speed; it’s judgment. Which studies does the bot elevate, and why? What definition of “effective” is it using: hours worked, earnings, sustained participation? How does it handle disagreement—say, an intervention that boosts women’s labor force participation in one setting, but not another? An ideal, responsible evidence assistant shouldn’t declare a single winner. Instead, it should present a transparent set of studies based on the criteria used for selecting them (such as relevance, rigor, recency, or representativeness), show where findings diverge, and explain the conditions under which an intervention seems to be effective. Without this, we risk creating tools that generate factual inaccuracies (reading numbers out of context, misquoting findings) or interpretive inaccuracies (oversimplifying nuanced results), thereby providing answers that may be technically correct but less relevant to the user’s prompt, leaving decision-makers with noise instead of clarity.

These limitations, however, don’t imply that Gen AI isn’t useful. They highlight why careful design, subject matter expertise, human oversight, and expectation setting are essential pieces of the “Gen AI for social good” puzzle.

Lessons learned

DevChat aims to provide quick, bite-sized insights from DEP’s evidence base to inform pressing policy issues. At the same time, interpreting evidence will still require users to understand their context and critically assess how findings apply. Based on our development and testing of DevChat, as well as observations of a few other evidence-support chatbots, a few critical lessons stand out.

Accuracy is improving; clear back-end instructions are key for consistent, relevant responses. An earlier challenge with the Gen AI technology was the risk of “hallucinations” (i.e., when a chatbot fabricates information that is not present in the source). However, by using a RAG (Retrieval Augmented Generation) pipeline, developers have been able to dramatically diminish this risk and limit the chatbots’ responses to the context you (the user) provide. In 3ie’s case, DevChat draws its responses from the human-curated DEP, ensuring that answers are grounded in verified evidence. However, a chatbot can display different types of hallucinations without fabricating information. For example, a chatbot may overestimate the effectiveness of an intervention, drawing conclusions from a single study without considering context or limitations. Another type of hallucination is irrelevance, where information does not directly address the user’s query. To help mitigate these risks, DevChat draws from systematic reviews and is equipped with clear back-end guidance, allowing it to transparently indicate when information is missing or only partially relevant. For example: “I was unable to find impact evaluations or systematic reviews conducted in South Asia, but the studies below provide relevant insights on this topic in other settings.” This approach helps reduce different forms of hallucinations, supporting transparent evidence summaries to inform decision-making.
Tradeoffs between comprehensiveness and usability are unavoidable. One way to make answers more comprehensive is to require the chatbot to list every relevant study. But we found that this slows responses, clutters communication, and risks overwhelming the users with information. While concise answers may be faster and more user-friendly, they may omit studies. This reflects a deeper design question: what kind of response is most useful to a policymaker in practice – a quick entry point for further interaction, or an extensive research report? Gen AI models are inherently probabilistic, so variability in responses is inherent to the design. That variability can be a strength if users want a chatbot that is conversational, but a weakness if they expect identical, structured outputs each time.
Academic writing is not standardized. Most rigorous, high-quality evidence is found in academic publications. However, academic literature spans diverse subjects (from public health and social work to economics and political science), and uses field-specific vocabulary, writing style, and structure. Due to this lack of standardization, chatbots struggle to always infer the most salient insights from research studies. To some extent, this challenge can be remedied with pre-processing of documents, back-end instructions, and targeted prompting. But clear caveats should also be provided to ensure that users understand these structural limitations and use the chatbot’s output as a guide or starting point, to be critically examined and repurposed using their own best judgment.
Testing is an integral part of development and there is a need for nuanced benchmarking. Due to the nascent and rapidly evolving nature of Gen AI technology, there are no established gold standards for evaluating chatbot credibility. At 3ie, we developed a testing framework to stress-test DevChat and other chatbots on a few essential parameters: we check whether responses actually address the question accurately, whether facts line up with what is cited, whether the interpretations are reasonable (with no causal leaps or overconfident claims); and how the chatbot handles uncertainty or gaps in evidence, including when studies disagree or when the backend database doesn’t have the needed detail (specifically for DevChat). Our aim was to ensure transparent, consistent judgment - that means being explicit about what each testing criterion means and ensuring that all testers share a common understanding before evaluating responses. This is critical because the interpretation of the criteria and chatbot responses is prone to subjectivity. Without this alignment, results risk reflecting individual interpretation, rather than the chatbot’s actual performance.
Interactivity is great but resource-intensive. In general, Gen AI has an excellent grasp of our spoken language, making interaction with a chatbot approachable and easy. However, making a chatbot interactive is technically and financially challenging. This is one reason many chatbots lack satisfactory memory capabilities, i.e., the ability to “remember” and use context from previous interactions in responses. Storing and retrieving past interactions requires additional tokens and infrastructure, which is expensive. Given that this is a useful function for a conversational chatbot, we continue to scan the horizon for not only additional funding, but also easy-to-use open-source alternatives.

Click here for more details on the recommendations and learnings from testing these chatbots across key criteria.

Training, testing and collaboration

DevChat is a work in progress, as is the technology it relies on. For development practitioners pursuing a similar goal – of building a chatbot to streamline evidence access and uptake – we offer the following reflections and resources:

There is a strong need to develop clear guidance and training on the strengths and limitations of Gen AI and 3ie’s Foundations of AI training offers an accessible launchpad if your organization is starting out in this arena.
Establishing a testing framework can help guide more systematic evaluation; regularly testing DevChat throughout its development and comparing it with other chatbots has been critical for deepening our understanding of the technology (if you’d like to reference our testing criteria or require guidance in building your own, please reach out to 3ie’s Data Innovations Group).

The broader field of Gen AI for development practice also still has significant room for growth. With DevChat, we’re taking a step toward making credible research more accessible when it matters most. If you want to join us in shaping how Gen AI is used for global development, we invite you to ask questions, provide suggestions, and collaborate with us! To join the list of early testers of DevChat, please send us an email at dig[at]3ieimpact[dot]org.

Learning on the go: Building and testing our Gen AI-based evidence assistant

From summaries to evidence-informed decisions: Why design matters

Lessons learned

Training, testing and collaboration

Leave a comment

Related blogs