Shafaqna English- As generative AI systems become embedded in agricultural advisory services, experts argue that evaluating them solely on technical accuracy is insufficient. To work effectively in farmers’ fields, these tools must also be assessed for usability, trust, inclusivity and governance.
Agricultural advisory platforms powered by large language models (LLMs) are increasingly used to deliver guidance on pest management, crop practices and market prices. But developers face major challenges: systems must function across diverse languages, literacy levels and local farming conditions.
Traditional benchmarking of LLMs focuses on technical metrics such as accuracy, reasoning and factual correctness. While important, these indicators capture only part of what determines success in real-world agricultural settings. Farmers must be able to understand, access and trust the information provided—factors often overlooked in model-centric evaluations.
During a November 6, 2025 webinar hosted by the AGX AI community and IFPRI, researchers and practitioners emphasized the need for a broader evaluation framework spanning three levels: model, system and process.
At the model level, benchmarking assesses technical performance—such as factual accuracy in answering agriculture-specific questions. Initiatives like AIEP Golden Q&As, AgriBench, AgREASON and AgXQA aim to ground testing in real farmer queries, incorporating local pests, terminology and geospatial constraints. Experts stressed the importance of combining quantitative metrics with qualitative assessments to evaluate how models handle ambiguity and uncertainty.
At the system level, evaluation shifts to user experience and practical relevance. Precision Development’s (PxD) “user journey” approach illustrates this perspective. Its assessments revealed that even when AI advice was technically sound, confusing user interfaces—such as unclear navigation buttons—undermined usability and trust. These issues, invisible to accuracy scores, can determine whether farmers adopt or abandon the service.
At the process level, governance, risk mitigation and human oversight are examined. This includes evaluating product stability, error handling and the inclusion of expert review throughout deployment. Crucially, this level also addresses equity and inclusion.
Speakers highlighted that advisory systems must account for differences in gender, literacy, language and access to technology. Model-level metrics may mask disparities in who benefits from AI tools. Inclusive benchmarking should therefore measure outcome-oriented indicators—such as reach, adoption, trust and agency—disaggregated across social groups. Surveys, focus groups and user testing can help assess whether services empower marginalized farmers, particularly women.
Participants concluded that effective benchmarking must move beyond fragmented, model-only tests toward collaborative, multilevel frameworks. By aligning technical performance with sociotechnical realities, developers can better ensure that AI-powered agricultural advisory systems are safe, scalable and truly useful in the field.
Source: IFPRI

