Knowledge Graph Modeling-Driven Large Language Model Operating System (LLM OS) for Task Automation in Process Engineering Problem-Solving (2024)

Sakhinana Sagar Srinivas1,Vijay Sri Vaikunth2,Venkataramana Runkana1
corresponding author.

Abstract

We present the Process Engineering Operations Assistant (PEOA), an AI-driven framework designed to solve complex problems in the chemical and process industries. The framework employs a modular architecture orchestrated by a meta-agent, which serves as the central coordinator, managing an action generator and instruction-tuned small-scale language models (expert models). The action generator decomposes complex problems into sub-tasks and identifies suitable expert models to execute each, delivering precise solutions for multi-step problem-solving. Key techniques include advanced knowledge modeling using property graphs for improved information retrieval, facilitating more accurate and contextually relevant solutions. Additionally, the framework utilizes a teacher-student transfer-learning approach with GPT-4 (Omni) to fine-tune the action generator and expert models for domain adaptation, alongside an iterative problem-solving mechanism with sophisticated error handling. Custom datasets were developed to evaluate the framework against leading proprietary language models on various engineering tasks. The results demonstrate the framework’s effectiveness in automating calculations, accelerating prototyping, and providing AI-augmented decision support for industrial processes, marking a significant advancement in process engineering capabilities.

Introduction

In recent years, significant advancements have been made in retrieval-augmented generation techniques (RAG), which combine the capabilities of large language models (LLMs) with external knowledge sources to enhance information retrieval and question-answering tasks. However, while traditional RAG techniques excel at localized information retrieval, they struggle with global questions requiring a holistic understanding of knowledge bases. Recently, there has been a surge of interest in the Graph RAG approach (SciPhiAI 2024b; Edge etal. 2024; Hu etal. 2024), which integrates the strengths of property graph-based knowledge modeling from unstructured data and graph-based indexing with the retrieval and generation capabilities of LLMs. By leveraging these strengths, the Graph RAG approach aims to overcome the limitations of traditional RAG techniques. The combined market capitalization of the Oil and Gas, Semiconductor, Fast-moving consumer goods (FMCG), Pharmaceuticals, Automobile, Aviation, and Energy sectors amounts to approximately $20 trillion USD. These industrial/manufacturing sectors involve complex chemical and process engineering challenges from a broader perspective. By enhancing problem-solving capabilities in these major industries with Graph RAG approaches for both process knowledge graph modeling and retrieval for question-answering (Q&A) tasks, we have the potential to contribute to economic growth, technological advancement, and improved competitiveness on a global scale. The rapidly evolving landscape of chemical and process engineering presents numerous complex challenges that necessitate innovative solutions for design, optimization, and troubleshooting. To address these challenges, we present the Process Engineering Operations Assistant (PEOA) framework—a modular, AI-driven Large Language Model Operating System (LLM OS) designed to tackle intricate problems in the chemical and process industry by automating key steps in the problem-solving process. The framework architecture revolves around a central orchestrator, or meta-agent, which coordinates the framework’s various components. The meta-agent works in tandem with an action generator, which plays a key role in breaking down complex problems into manageable sub-tasks and identifying the most appropriate tools (or expert models) for each step in solving the sub-tasks. To execute these sub-tasks with high precision, the action generator employs a collection of expert models, each specialized in different aspects. In essence, the framework utilizes a two-stage pipeline that iteratively decomposes complex problems into manageable sub-tasks, selects and chains together suitable tools, and executes solutions. The (subject-matter) expert models include small-scale language models (SLMs) for code generation, mathematical reasoning, and structured information retrieval from property graphs, enabling the framework to leverage external knowledge and solve diverse problems by decomposing, executing, and refining multi-step problem-solving trajectories. The framework incorporates an advanced error-handling mechanism. When a runtime error occurs, it uses a reflection procedure to identify the faulty step and associated tool (expert model). An expert model then generates a revised solution, considering both the immediate error and the broader problem context. This procedure iterates until a successful solution is achieved or a predefined limit is reached. The debugging mechanism functions in two phases: error identification and solution revision. It allows the framework to dynamically adapt its problem-solving strategy, refine solutions iteratively, and tackle increasingly complex tasks that may require multiple rounds of adjustment, thereby improving its robustness and effectiveness in real-world engineering scenarios. The proposed framework addresses the limitations of current language models and problem-solving approaches in the industry, which are hindered by a lack of domain-specific knowledge and expertise, an inability to integrate diverse tools and data sources, and a limited capacity for complex, multi-step reasoning. These limitations result in inefficient and time-consuming problem-solving workflows that impede innovation and progress in the chemical and process industry. The proposed framework serves as a decision support tool, enabling process engineers to focus on high-level decision-making and innovation, accelerate design cycles through rapid prototyping and testing, and optimize chemical processes to enhance yield, efficiency, and safety. Figure 1 illustrates the framework.

Knowledge Graph Modeling-Driven Large Language Model Operating System (LLM OS) for Task Automation in Process Engineering Problem-Solving (1)

A key challenge is the lack of tool-integrated solutions for the chemical and process domain. To address this, we use a teacher-student transfer-learning approach with GPT-4 (Omni) as the teacher model to create tool-integrated solution trajectories. These serve as synthetic datasets for customizing the PEOA framework, generating detailed, step-by-step solutions that facilitate the transfer of advanced problem-solving capabilities to the student model. At its core, the framework utilizes a modular architecture that combines instruction-tuned small-scale language models (expert models) with graph retrieval-augmented code generation capabilities, leveraging knowledge graph databases for multi-hop reasoning and improved factual accuracy. For graph retrieval, we use an advanced knowledge modeling technique that parses complex documents (scholarly articles), constructs semantic knowledge graphs (i.e., transforming these documents into structured, searchable graphs), and indexes them for efficient information retrieval. Instruction-tuning small-scale language models (SLMs) like expert models is crucial because they often lack extensive pre-trained knowledge and specialized problem-solving skills needed for complex domain-specific tasks in chemical and process engineering. Unlike proprietary large-scale models such as GPT-4 (Omni), which have more comprehensive pre-trained knowledge, expert models require adaptation to effectively utilize external information, such as language models with vector similarity search on knowledge graphs, resulting in more accurate and efficient solutions. By using instruction tuning with Graph Retrieval-Augmented Code Generation (GRACG), the framework can generate structured, multi-step solution trajectories that systematically solve complex tasks. To evaluate the proposed framework, we developed custom datasets focused on mathematical modeling, computational methods, and chemical and process engineering. We conducted extensive experiments comparing the framework’s performance to leading proprietary LLMs on a range of complex engineering tasks. Our work is the first step in significantly enhancing the capabilities of process engineers by automating routine calculations, accelerating prototyping and optimization, and providing AI-augmented decision support for complex industrial processes. The framework can manage the lifecycle of SLMs (expert models), including fine-tuning, monitoring, and updating these models. These tools enable the maintenance of the framework’s accuracy and relevance over time, ensuring optimal performance and decision support for complex industrial processes. In summary, the PEOA framework represents a significant advancement in automating complex problem-solving in chemical and process engineering and offers a powerful solution for optimizing processes, accelerating innovation, and supporting high-level decision-making in this challenging field.

Related Work

Large Language Models (LLMs) have demonstrated notable capabilities in various reasoning tasks, including those involving graph-structured data. However, despite their success, LLMs often face challenges with factual accuracy due to limitations in their training data and a lack of real-time knowledge integration (YuntongHu 2023). To address these issues, Retrieval-Augmented Generation (RAG) has been developed, enhancing LLMs by integrating external data retrieval into the generative process, which improves the relevance and accuracy of responses (Lewis etal. 2020). Traditional RAG approaches, however, focus mainly on text-based entity retrieval and often overlook the structural intricacies of graph data, which are critical for tasks requiring multi-hop reasoning and context preservation across documents to answer global queries (Yasunaga etal. 2021). For example, conventional RAG methods split text into chunks, map these into a vector space, and measure similarity with the query vector but fail to capture the topological information inherent in graphs (Velickovic etal. 2018). The integration of graphs with LLMs and RAG is an emerging research area. Previous work has explored using LLMs for knowledge graph creation (Trajanoska, Stojanov, and Trajanov 2023), completion (Yao etal. 2023), and causal graph extraction (Ban etal. 2023; Zhang etal. 2024b). Advanced RAG methods leverage graph structures as knowledge indexes (Baek, Aji, and Saffari 2023), use subsets of graphs for answering queries (He etal. 2024; Zhang 2023), and ground narrative outputs in subgraphs (Kang etal. 2023). Recently, a Graph RAG approach (Edge etal. 2024) has been introduced that utilizes LLMs to construct knowledge graphs and employs graph modularity and community detection to generate comprehensive, diverse query-focused summaries for Q&A tasks. Additionally, Triplex (SciPhiAI 2024b, a), an advanced language model for efficient knowledge graph construction, extracts subject-predicate-object triplets from unstructured data, offering significant cost reductions and improved performance compared to traditional methods and general-purpose models like GPT-4. Several open-source libraries now support graph databases for RAG applications. For instance, LangChain (LangChain 2024) and LlamaIndex (LlamaIndex 2024) facilitate graph-based RAG applications with integration into Neo4j (Neo4j 2024) and NebulaGraph (NebulaGraph 2024). These advancements enhance the performance and scalability of Graph RAG systems by structuring information in a modular and hierarchical manner. At its core, the PEOA framework utilizes advanced knowledge graph construction and Graph Retrieval-Augmented Code Generation (GRACG) to tackle complex chemical and process engineering challenges. By transforming unstructured data into structured, context-rich graphs, the framework enables efficient, context-aware querying while preserving relationships and integrating diverse data types. This approach improves problem-solving capabilities by leveraging domain-specific tools (e.g., expert models) and creating structured solution trajectories, thereby enhancing accuracy and efficiency. The combination of knowledge modeling through property graphs and graph-retrieval augmentation allows the framework to deliver precise, systematic solutions, streamline workflows, and automate complex engineering tasks, accelerating design cycles and supporting high-level decision-making in the process industry.

Proposed Method

We aim to address the complex challenges faced by chemical and process engineers in designing, optimizing, and troubleshooting industrial processes. To this end, we are developing the Process Engineering Operations Assistant (PEOA), a task automation framework conceptualized as a Large Language Model Operating System (LLM OS). This modular framework combines AI-driven capabilities with computational tools to streamline problem-solving in chemical and process engineering. At its core, the PEOA framework leverages the LLM OS to manage and orchestrate foundational language models, automating key steps in the problem-solving process. The objective is to allow process engineers to focus on high-level decisions and innovation while accelerating design cycles through faster prototyping and testing and optimizing chemical processes to identify optimal conditions that maximize yield, efficiency, and safety. Small-scale language models for code (SLMs) such as Google Code Gemma (Google 2024) and Meta Code Llama (MetaAI 2023) often lack extensive pre-trained knowledge related to domain-specific tasks, such as specialized mathematical reasoning and problem-solving skills for the chemical and process industry, compared to proprietary large-scale models (LLMs) like GPT-4 (Omni) (Achiam etal. 2023). Additionally, SLMs are not designed to effectively incorporate and utilize external knowledge from various domain-specific tools (e.g., vector similarity search on knowledge graph databases of code repositories/documentation, or retrieval-augmented generation with Stack Overflow APIs) for more accurate and efficient problem-solving beyond their pre-trained knowledge (Zhang etal. 2024a). These limitations hinder the performance of SLMs in specialized domains. Instruction-tuning SLMs to access external information offers a promising solution, improving their use of relevant background knowledge for more accurate outputs. We utilize Instruction Tuning with Graph Retrieval-Augmented Code Generation (GRACG), allowing the proposed framework utilizing SLMs combined with the ReAct (Reason + Act) (Yao etal. 2022) prompting technique, to generate ‘solution trajectories’—structured, step-by-step problem-solving sequences that break down complex tasks, integrate various tools, and produce coherent solutions. Unlike traditional RAG, which relies on linear text retrieval, GRACG techniques utilize knowledge graph databases that preserve graph topology. Graph-based representation of relationships and hierarchies between entities and concepts is more effective than flat text, providing richer contextual information, enhancing multi-hop reasoning, and reducing hallucinations. The solution trajectory consists of multiple steps, executed sequentially to systematically and incrementally solve complex chemical and process industry problems. Each step in a trajectory includes a high-level step description (a subtask), a specific tool to use from a predefined set, and the tool-executed output reformulated in natural language. While this approach shows promise, a significant challenge remains in its implementation. There isn’t a pre-existing tool-integrated problem-solving solution trajectory (curated dataset) relevant to the domain that illustrates the comprehensive process of solving complex, multi-step reasoning tasks step-by-step through the integration of various tools. Such a trajectory would provide a structured approach for instruction-tuning SLMs by enhancing domain-specific knowledge and computational tool usage to generate code for solving process engineering calculations. To overcome this limitation, we utilize a teacher-student learning paradigm (Kim etal. 2024) for adapting SLMs to domain-specific tasks with similar performance to proprietary LLMs. It involves a foundational LLM, such as GPT-4 (Omni), serving as a robust teacher (subject-matter expert) to generate high-quality, tool-integrated solution trajectories that serve as synthetic, instruction-tuning datasets demonstrating effective problem-solving strategies. The machine-generated datasets are used to develop a robust and customizable student model—PEOA—for solving process engineering calculations. The teacher model is prompted with few-shot examples to generate step-by-step solutions that involve calling specific tools to solve domain-specific tasks. Each solution trajectory consists of a sequence of steps (i.e., subtasks to perform), with corresponding tools to be used at each step, and outputs (i.e., the result of executing the tool on the given step) reformulated into natural language. Our method efficiently transfers knowledge from the large teacher model by distilling its advanced mathematical reasoning and problem-solving capabilities to a smaller student model. The student model learns effective strategies for performing complex multi-step reasoning, breaking down complex tasks into smaller, more manageable steps, integrating diverse tools, and producing coherent step-by-step solutions (tool-specific outputs). Tools are specialized components or services that enhance the capabilities of language models, enabling them to handle complex and diverse tasks. These include code generators for creating executable snippets, math problem solvers for mathematical reasoning, and vector-search retrieval on knowledge graph databases for structured information access. By integrating these tools, language models can expand their problem-solving abilities. Integrating external tools with automatic tool chaining (Shi etal. 2024) allows SLMs to execute tasks beyond their pre-trained knowledge, augmenting their problem-solving abilities. Tool learning involves four stages: it begins with task planning, where the SLM analyzes a user’s query and decomposes it into sub-tasks with tuning-free methods, like few-shot prompting with ReACT techniques. Next, in tool selection, the SLM identifies the most appropriate tools for each sub-task. During tool calling, the SLM extracts and formats the necessary parameters from the user’s query to invoke the selected tools. Finally, in response generation, the SLM synthesizes the tool outputs with its own pre-trained knowledge to provide a comprehensive and coherent response. Tool learning can follow two paradigms: one-step task solving, where SLMs plan sub-tasks upfront and generate responses without adjusting for errors, and iterative task solving, where SLMs interact with tools iteratively, correcting tool outputs based on feedback. In this work, we use iterative task solving to enable SLMs to handle complex queries more effectively by leveraging external tool chaining. Given a natural language query Q𝑄Qitalic_Q, we begin by decomposing it into smaller, manageable sub-tasks. Let 𝒮={s1,s2,,sn}𝒮subscript𝑠1subscript𝑠2subscript𝑠𝑛\mathcal{S}=\{s_{1},s_{2},\ldots,s_{n}\}caligraphic_S = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } be the set of sub-tasks derived from Q𝑄Qitalic_Q. The aim is to enable the proposed framework to use a sequence of tools from the set 𝒯={t1,t2,,t|𝒯|}𝒯subscript𝑡1subscript𝑡2subscript𝑡𝒯\mathcal{T}=\{t_{1},t_{2},\ldots,t_{|\mathcal{T}|}\}caligraphic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT | caligraphic_T | end_POSTSUBSCRIPT } to solve the task. For each sub-task sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the most appropriate tool tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is selected from the set of tools 𝒯𝒯\mathcal{T}caligraphic_T. The framework first determines if tool usage is necessary to solve the sub-task. If so, the program chains them together to complete the task. When tools are not required, the framework relies on its internal pre-trained knowledge to solve the task. The tool protocols provide meta-information to understand each tool’s purpose and usage. The tool protocols 𝒟={d1,d2,,d|𝒟|}𝒟subscript𝑑1subscript𝑑2subscript𝑑𝒟\mathcal{D}=\{d_{1},d_{2},\ldots,d_{|\mathcal{D}|}\}caligraphic_D = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT | caligraphic_D | end_POSTSUBSCRIPT } consist of documented protocols disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to each tool ti𝒯subscript𝑡𝑖𝒯t_{i}\in\mathcal{T}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T. Each protocol di𝒟subscript𝑑𝑖𝒟d_{i}\in\mathcal{D}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D offers detailed information about its associated tool tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, including an overview of functionality and use cases, argument requirements specifying necessary inputs, and a response schema outlining expected output structure and type. The detailed tool protocols allow the framework to learn tool usage, understand the input-output schema and capabilities of various tools, and manage data flow dependencies, enabling it to chain together and utilize multiple tools to solve the end-user task. Tool learning is a crucial component of the proposed framework, supporting its core objective of streamlining workflows and automating complex problem-solving tasks in process engineering. The PEOA framework consists of a meta (top-level) agent orchestrating a specialized action generator (𝒜𝒜\mathcal{A}caligraphic_A) and expert models (tools) (tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) modeled by SLMs. The meta agent delegates the input question and the solution history to the action generator to predict the next high-level sub-task and select the appropriate tool needed to solve the sub-task, which expert models then execute precisely, updating the solution state. The framework iterates over a two-stage pipeline to solve multi-step reasoning tasks using various expert models as tools, combining the generation of sub-tasks and tool selection followed by invoking the specialized expert models to efficiently address complex problems. Tool-integrated solution trajectories generated by the teacher model fine-tune the action generator and expert models. The framework employs a diverse set of expert models to execute actions based on the tool chosen by the action generator. These models include csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (DeepSeek-Coder-7B-Instruct (Guo etal. 2024)) for generating executable code snippets, and msubscript𝑚\mathcal{M}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, a RAG technique variant that combines the mathematical reasoning of DeepSeek-Math-7B-Instruct (Shao etal. 2024) with the computational power of Wolfram Alpha’s (Hindin 2010) API for advanced problem-solving. Additionally, qsubscript𝑞\mathcal{M}_{q}caligraphic_M start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (Meta Llama or Google Gemma) is used for crafting search queries, translating sub-tasks into understandable formats to retrieve information from web search engines like DuckDuckGo or Stack Overflow APIs, and parsing their outputs. Lastly, KQsubscript𝐾𝑄\mathcal{M}_{KQ}caligraphic_M start_POSTSUBSCRIPT italic_K italic_Q end_POSTSUBSCRIPT (Meta Llama or Google Gemma), a graph-RAG variant, is employed for conducting structured information retrieval through similarity searches from knowledge graph databases of scholarly sources such as numerical libraries/code documentation. The action generator (𝒜𝒜\mathcal{A}caligraphic_A) is realized with Meta Llama or Google Gemma. Finally, the top-level agent integrates the results, potentially with its own knowledge, to craft a coherent, human-friendly response that provides context, explanations, and insights, allowing the framework to tackle a wide range of complex tasks by leveraging specialized tools as needed. The action generator 𝒜𝒜\mathcal{A}caligraphic_A takes the task instruction x𝑥xitalic_x and the concatenated solution history hi1subscript𝑖1h_{i-1}italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT up to the previous step and predicts the next step sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the associated tool tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to solve the sub-task, described as follows:

𝒜(Ia,x,hi1)=𝒜(Ia,x,[s1o1si1oi1])[ti,si]𝒜subscript𝐼𝑎𝑥subscript𝑖1𝒜subscript𝐼𝑎𝑥delimited-[]subscript𝑠1normsubscript𝑜1normsubscript𝑠𝑖1subscript𝑜𝑖1subscript𝑡𝑖subscript𝑠𝑖\mathcal{A}(I_{a},x,h_{i-1})=\mathcal{A}(I_{a},x,[s_{1}\,\|\,o_{1}\,\|\,\ldots%\,\|\,s_{i-1}\,\|\,o_{i-1}])\rightarrow[t_{i},s_{i}]caligraphic_A ( italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x , italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) = caligraphic_A ( italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x , [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ … ∥ italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∥ italic_o start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] ) → [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]

where hi1subscript𝑖1h_{i-1}italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT is the solution history up to step i1𝑖1i-1italic_i - 1 and Iasubscript𝐼𝑎I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT indicates a concise instruction prompt provided to the action generator to predict the next step sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the tool tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The step sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the high-level description of the action to be taken at each stage in the tool-integrated solution trajectory. The expert model tisubscriptsubscript𝑡𝑖\mathcal{M}_{t_{i}}caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT associated with the tool tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generates the output oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the step sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

ti(Im,x,hi1,si)oisubscriptsubscript𝑡𝑖subscript𝐼𝑚𝑥subscript𝑖1subscript𝑠𝑖subscript𝑜𝑖\mathcal{M}_{t_{i}}(I_{m},x,h_{i-1},s_{i})\rightarrow o_{i}caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_x , italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) → italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

where tisubscriptsubscript𝑡𝑖\mathcal{M}_{t_{i}}caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the expert model corresponding to the tool tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the output of the current step, and hi=hi1(si,oi)subscript𝑖subscript𝑖1subscript𝑠𝑖subscript𝑜𝑖h_{i}=h_{i-1}\cup(s_{i},o_{i})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∪ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the updated solution history including sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT serves as a concise instruction prompt provided to the expert model to generate the output for a given step in the solution trajectory. The output oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is generated by executing the tool’s action. For example, a code snippet cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generated by csubscript𝑐\mathcal{M}_{c}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is executed by a code interpreter to produce oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The iterative process continues until the action generator 𝒜𝒜\mathcal{A}caligraphic_A identifies the final answer to x𝑥xitalic_x in the solution history hhitalic_h. 𝒜𝒜\mathcal{A}caligraphic_A and tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are trained on tool-integrated solution trajectories generated by a teacher LM (GPT-4 (Omni)). At inference time, the proposed framework uses 𝒜𝒜\mathcal{A}caligraphic_A to predict steps (sub-tasks) and tools, and tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to execute these steps until it finds the final answer. Table 1 demonstrates the framework’s ability to break down a complex problem into manageable steps, utilize appropriate tools (in this case, code execution), and provide a clear, step-by-step solution. Note: The example was chosen for simplicity and illustration.

The framework employs a sophisticated error-handling (code debugging) (Gou etal. 2023) and adaptive problem-solving mechanism, utilizing a dynamic interplay between an action generator 𝒜𝒜\mathcal{A}caligraphic_A and specialized expert models tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which work in tandem to decompose, execute, and refine multi-step problem-solving trajectories. When encountering a runtime error, the action generator 𝒜𝒜\mathcal{A}caligraphic_A employs a reflection mechanism to identify both the faulty step sifsubscriptsuperscript𝑠𝑓𝑖s^{f}_{i}italic_s start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the associated tool tifsubscriptsuperscript𝑡𝑓𝑖t^{f}_{i}italic_t start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

[sif,tif]=𝒜(Ie,x,hi1,oif)subscriptsuperscript𝑠𝑓𝑖subscriptsuperscript𝑡𝑓𝑖𝒜subscript𝐼𝑒𝑥subscript𝑖1subscriptsuperscript𝑜𝑓𝑖[s^{f}_{i},t^{f}_{i}]=\mathcal{A}(I_{e},x,h_{i-1},o^{f}_{i})[ italic_s start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = caligraphic_A ( italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_x , italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where Iesubscript𝐼𝑒I_{e}italic_I start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT represents the error identification instruction, x𝑥xitalic_x denotes the original task, hi1subscript𝑖1h_{i-1}italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT is the cumulative solution history up to the previous step, and oifsubscriptsuperscript𝑜𝑓𝑖o^{f}_{i}italic_o start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the faulty output that triggered the error. This error localization process leverages the system’s understanding of tool protocols, input-output schemas, and the interdependencies between various computational tools and steps in the problem-solving sequence. Once the error is localized, the corresponding expert model tisubscriptsubscript𝑡𝑖\mathcal{M}_{t_{i}}caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT generates a revised step and output prediction as follows:

oi=ti(Ir,hi1,sif,oif)subscript𝑜𝑖subscriptsubscript𝑡𝑖subscript𝐼𝑟subscript𝑖1subscriptsuperscript𝑠𝑓𝑖subscriptsuperscript𝑜𝑓𝑖o_{i}=\mathcal{M}_{t_{i}}(I_{r},h_{i-1},s^{f}_{i},o^{f}_{i})italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where Irsubscript𝐼𝑟I_{r}italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is a specially crafted revision instruction. This revision process not only corrects the immediate error but also considers the broader context of the problem, ensuring that the revised step aligns with the overall solution strategy. This error-correction and refinement process iterates until successful execution is achieved or a predefined iteration limit is reached. With each iteration, the solution history is updated, creating a comprehensive record of the problem-solving trajectory, including both successful steps and addressed challenges. This iterative approach enables the framework to tackle increasingly complex tasks that may require multiple rounds of refinement. Combined with the proposed framework’s ability to chain multiple tools and parse their outputs, this approach significantly enhances its problem-solving capabilities. The framework’s ability to dynamically generate, execute, and refine both individual steps and overarching action sequences is particularly noteworthy. In summary, the proposed framework, PEOA, is a novel approach for automating complex problem-solving in process engineering, enabling it to accelerate design cycles, optimize chemical processes, and support high-level decision-making. It operates in two intertwined phases analogous to localization and repair: error identification and solution revision. In the error identification phase, the framework leverages a reflection mechanism within its action generator (𝒜𝒜\mathcal{A}caligraphic_A) to analyze runtime errors (oifsuperscriptsubscript𝑜𝑖𝑓o_{i}^{f}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT) and pinpoint faulty steps (sifsuperscriptsubscript𝑠𝑖𝑓s_{i}^{f}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT) and tools (tifsuperscriptsubscript𝑡𝑖𝑓t_{i}^{f}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT) within the tool-integrated solution trajectory. During solution revision, the corresponding expert model (tisubscriptsubscript𝑡𝑖\mathcal{M}_{t_{i}}caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT), guided by a revision instruction (Irsubscript𝐼𝑟I_{r}italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT), proposes a revised output (oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), considering the error and solution history (hi1subscript𝑖1h_{i-1}italic_h start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT). This revised solution is integrated into the solution trajectory, and the process iterates until a satisfactory solution is achieved, enabling the framework to dynamically adapt its problem-solving strategy for complex calculations.

Knowledge Modeling : Document Parsing/Indexing for Graph-Based Semantic Search and Retrieval

For graph-retrieval augmented code generation (GRACG), we perform document parsing (LlamaIndex 2023b) to extract structured information from unstructured PDFs. This involves reading the PDF, analyzing its structure, and extracting content like text, images, tables, and code. We store and query this information using a production-grade graph database, such as Neo4j, which supports property graphs and vector searches. The graph database organizes parsed document elements and their metadata into nodes, relationships, and properties, preserving contextual relationships and enabling efficient, context-aware retrieval. Text chunking divides large texts into smaller, manageable segments (chunks) to preserve context, improve processing efficiency, and enhance document-specific KG search engines indexing and retrieval. We use a sliding window technique, moving a fixed-size window across the text with a predefined stride, ensuring overlapping chunks that maintain contextual continuity. Text segments are stored as chunk nodes with metadata, including title, page numbers, summaries, and keywords. Text embedding models generate dense semantic vector representations of text segments, stored as additional metadata to enable semantic search and context-aware vector retrieval. Parsed text segments, stored as chunk nodes, are processed by an LLM like GPT-4 (Omni), which infers and generates knowledge graph triples by identifying entities and relationships. It outputs single-hop paths in the format (subject(entity) — relation — object(entity)). This approach dynamically constructs an ontology—a formal representation of domain concepts (e.g., entities, attributes, and categories) and their relationships (e.g., associations and hierarchies)—while developing a schema that defines the database structure. Entity nodes represent specific concepts or objects in text chunks. Entity nodes link to related chunk nodes via ‘MENTIONS’ relationships. In summary, each text chunk in the property graph store has two node types: chunk nodes and entity nodes, capturing various attributes and metadata associated with the text segment. The knowledge graph serves as both an ontology and a schema, providing a flexible, semantically rich framework for organizing and querying the extracted knowledge. Each table is represented as a node with metadata properties such as table ID, title, source page, and summary description. We use text embedding techniques to create vector representations of the table content, facilitating efficient similarity searches. Each row in a table is represented as a row node with properties corresponding to the column values. Relationships between each row node and its respective table node use a relationship type such as ‘BELONGS’, facilitating efficient querying. Similarly, for images, we store metadata related to scholarly image data as image nodes with properties such as page number, resolution, format, and summary descriptions generated by LLMs like GPT-4 (Omni). These descriptions provide high-level scene interpretation and content analysis. We also use CLIP embeddings to convert images into low-dimensional embeddings that capture the semantic content of the images. These vector representations are stored as node metadata properties, enabling efficient similarity searches and facilitating the retrieval of semantically similar images from a local file system. Each image node is connected to its top-K visually similar nodes through visually similar relationships, enabling the retrieval of visually similar images. In summary, each image in the property graph store is represented by a single node that stores metadata and semantic content representations (generated by CLIP embeddings) and is connected to other nodes through visual similarity relationships. We use a code hierarchy parser (LlamaIndex 2023a) to break down long code files from Github repositories into manageable segments by creating a hierarchical structure. This process, called skeletonization (e.g., using abstract syntax trees), replaces code blocks with comments that reference specific nodes for detailed context. The parser organizes code into nodes based on scope (e.g., functions, methods, classes, modules) and links these nodes to their parent and child nodes, enhancing readability and accelerating KG vector retrieval. The parser also handles comments, import statements, and variable declarations. For metadata extraction, we gather information on project structure, dependencies, and version information. Finally, we perform entity de-duplication by addressing duplicate entities in KGs. This involves identifying similar nodes using cosine similarity and Levenshtein distance, merging overlapping groups of similar nodes, filtering subsets to retain comprehensive node groups, and ultimately merging nodes within each group to discard redundancies and preserve the most descriptive identifiers. Entity de-duplication merges duplicates to maintain graph accuracy, reduce noise, and ensure searches and analyses are performed on unique data. Graph retrieval involves selecting the top-k𝑘kitalic_k entity nodes based on vector similarity to the user query, traversing to retrieve adjacent triples (one-hop neighbors) and corresponding parent nodes. In summary, we transform unstructured data into structured, searchable knowledge, covering the workflow from parsing PDFs to constructing and querying knowledge graphs. These graphs extract, organize, and utilize information from complex documents to assist with code generation tasks. This approach emphasizes LLMs (such as GPT-4 (Omni)) for dynamic ontology creation, graph databases for semantic searches, and context preservation for enhanced performance. The expert model (Google Gemma or Meta Llama) interprets the user’s query, integrates retrieved information from the knowledge graph, finds relevant information from a structured knowledge base with its pre-existing knowledge, and generates a coherent, contextually appropriate response. This combination leverages the expert model’s language understanding and generation capabilities while grounding its outputs in external, structured knowledge, resulting in more accurate and informative answers.

Experiments

Benchmark Datasets:

We developed two custom benchmark datasets to train and evaluate our framework for solving complex chemical and process engineering problems: the mathematical and computational tuning (MathComp) dataset and the chemical process tuning (ChemProc) dataset. The MathComp dataset contains over 8,500 question-answer pairs, focusing on mathematical modeling and numerical algorithms. It is designed to customize the framework for using computational tools for tasks such as solving differential equations, linear algebra, optimization, and related mathematical tasks. The ChemProc dataset includes over 7,000 question-answer pairs, covering topics specific to chemical engineering such as mass and energy balances, thermodynamics, heat transfer, reaction kinetics, fluid mechanics, separation processes, and process control. These high-quality datasets were essential for adapting the framework to handle specialized engineering problems by providing domain-specific knowledge and enabling it to leverage computational tools. We compiled these datasets from publicly available scholarly sources, including textbooks ranging from basic to advanced levels, ensuring a comprehensive and diverse collection of problems and solutions. The datasets were divided into training (70%), validation (15%), and test (15%) sets to facilitate rigorous evaluation. In summary, these diverse datasets provide the domain-specific knowledge, computational problem-solving skills, and rigorous evaluation framework absent in existing, more general datasets. MathComp focuses on mathematical modeling and numerical algorithms, while ChemProc covers core chemical and process engineering principles.

Experimental Settings:

In our experimental setup, we leveraged the custom MathComp and ChemProc datasets to train and evaluate the proposed framework. A key innovation in our approach was the implementation of a sophisticated knowledge modeling technique using property graphs. We developed a custom document parsing pipeline to extract structured information from complex, unstructured PDFs of scholarly articles. This process involved analyzing document structure, extracting various content types, and retrieving metadata. To store and query this information effectively, we utilized enterprise-level graph databases like Neo4j, allowing us to create a rich, interconnected representation of domain knowledge. We structured the data as a labeled property graph, with nodes representing different elements (text, images, tables, and code) and edges capturing the relationships. The resulting knowledge graph served dual purposes—as both an ontology and a schema—providing a flexible framework for organizing and querying the extracted knowledge. For benchmarking, we compared the framework against leading proprietary models like GPT-4, Claude-3 Opus, and Google Gemini Pro. We fine-tuned smaller language models (DeepSeek-Coder-7B-Instruct and DeepSeek-Math-7B-Instruct) using the Hugging Face PEFT library, employing techniques like QLoRA. Our hyperparameter configuration included a batch size of 24, a learning rate of 1e-4, and 50 training epochs, among other settings. Training was conducted on NVIDIA GPUs, with multiple independent runs to ensure robustness. We reported ensemble averages of the results to provide a comprehensive evaluation of the framework’s performance in handling complex chemical and process engineering tasks.

Evaluating Tool Proficiency:

Our study employs various evaluation metrics to assess the effectiveness of the proposed framework tool learning (Qu etal. 2024) across different stages: task planning, tool selection, tool calling, and response generation. We evaluate the task planning capabilities of the framework through several key metrics: Tool Usage Awareness, Pass Rate, and Accuracy. Tool Usage Awareness measures the ability of the framework to correctly identify if a query requires an external tool, expressed as Awareness=Number of Correct IdentificationsTotal Number of QueriesAwarenessNumber of Correct IdentificationsTotal Number of Queries\text{Awareness}=\frac{\text{Number of Correct Identifications}}{\text{Total %Number of Queries}}Awareness = divide start_ARG Number of Correct Identifications end_ARG start_ARG Total Number of Queries end_ARG. The Pass Rate assesses the effectiveness of the proposed task planning in addressing the query, calculated by Pass Rate=Number of Successfully Completed TasksTotal Number of TasksPass RateNumber of Successfully Completed TasksTotal Number of Tasks\text{Pass Rate}=\frac{\text{Number of Successfully Completed Tasks}}{\text{%Total Number of Tasks}}Pass Rate = divide start_ARG Number of Successfully Completed Tasks end_ARG start_ARG Total Number of Tasks end_ARG. Accuracy evaluates the precision of the plan generated by the framework by comparing it to a gold standard solution, calculated as Accuracy=Number of Correct PlansTotal Number of PlansAccuracyNumber of Correct PlansTotal Number of Plans\text{Accuracy}=\frac{\text{Number of Correct Plans}}{\text{Total Number of %Plans}}Accuracy = divide start_ARG Number of Correct Plans end_ARG start_ARG Total Number of Plans end_ARG. Additionally, the values of these metrics range from 0 to 1, where 0 indicates the worst performance and 1 indicates the best performance. The evaluation metrics used for tool selection include Recall, NDCG, and COMP. Recall@K measures the proportion of selected top-K tools that are present in the set of ground-truth tools, formulated as Recall@K=1|Q|q=1|Q||TqKTq||Tq|Recall@K1𝑄superscriptsubscript𝑞1𝑄subscriptsuperscript𝑇𝐾𝑞subscriptsuperscript𝑇𝑞subscriptsuperscript𝑇𝑞\text{Recall@K}=\frac{1}{|Q|}\sum_{q=1}^{|Q|}\frac{|T^{K}_{q}\cap T^{*}_{q}|}{%|T^{*}_{q}|}Recall@K = divide start_ARG 1 end_ARG start_ARG | italic_Q | end_ARG ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_Q | end_POSTSUPERSCRIPT divide start_ARG | italic_T start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∩ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | end_ARG start_ARG | italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | end_ARG, where Q𝑄Qitalic_Q is the set of queries, Tqsubscriptsuperscript𝑇𝑞T^{*}_{q}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the set of relevant tools for the query q𝑞qitalic_q, and TqKsubscriptsuperscript𝑇𝐾𝑞T^{K}_{q}italic_T start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the top-K tools for the query q𝑞qitalic_q selected by the framework. Normalized Discounted Cumulative Gain (NDCG@K) considers the proportion and positions of positive tools, with Discounted Cumulative Gain (DCG@K) calculated as DCGq@K=i=1K2gi1log2(i+1)subscriptDCG𝑞@𝐾superscriptsubscript𝑖1𝐾superscript2subscript𝑔𝑖1subscript2𝑖1\text{DCG}_{q}@K=\sum_{i=1}^{K}\frac{2^{g_{i}}-1}{\log_{2}(i+1)}DCG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT @ italic_K = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_i + 1 ) end_ARG and NDCG@K as NDCG@K=1|Q|q=1|Q|DCGq@KIDCGq@KNDCG@K1𝑄superscriptsubscript𝑞1𝑄subscriptDCG𝑞@𝐾subscriptIDCG𝑞@𝐾\text{NDCG@K}=\frac{1}{|Q|}\sum_{q=1}^{|Q|}\frac{\text{DCG}_{q}@K}{\text{IDCG}%_{q}@K}NDCG@K = divide start_ARG 1 end_ARG start_ARG | italic_Q | end_ARG ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_Q | end_POSTSUPERSCRIPT divide start_ARG DCG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT @ italic_K end_ARG start_ARG IDCG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT @ italic_K end_ARG, where gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the graded relevance score (assigned by human evaluators) at position i𝑖iitalic_i and IDCG is the ideal DCG. ‘@K’ indicates that the cumulative gain is considered up to the K-th item in the ranked list. COMP@K assesses whether the top-K selected tools form a complete set with respect to the ground-truth set, defined as COMP@K=1|Q|q=1|Q|I(ΦqΨqK)COMP@K1𝑄superscriptsubscript𝑞1𝑄𝐼subscriptΦ𝑞subscriptsuperscriptΨ𝐾𝑞\text{COMP@K}=\frac{1}{|Q|}\sum_{q=1}^{|Q|}I(\Phi_{q}\subseteq\Psi^{K}_{q})COMP@K = divide start_ARG 1 end_ARG start_ARG | italic_Q | end_ARG ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_Q | end_POSTSUPERSCRIPT italic_I ( roman_Φ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ⊆ roman_Ψ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ), where ΦqsubscriptΦ𝑞\Phi_{q}roman_Φ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the ground-truth tool set for query q𝑞qitalic_q, ΨqKsubscriptsuperscriptΨ𝐾𝑞\Psi^{K}_{q}roman_Ψ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the top-K tools retrieved. The indicator function I()𝐼I(\cdot)italic_I ( ⋅ ) in COMP@K checks if the ground-truth set ΦqsubscriptΦ𝑞\Phi_{q}roman_Φ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is a subset of the top-K retrieved set ΨqKsubscriptsuperscriptΨ𝐾𝑞\Psi^{K}_{q}roman_Ψ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, returning 1 for true (I(ΦqΨqK)=1𝐼subscriptΦ𝑞subscriptsuperscriptΨ𝐾𝑞1I(\Phi_{q}\subseteq\Psi^{K}_{q})=1italic_I ( roman_Φ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ⊆ roman_Ψ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = 1) and 0 for false (I(ΦqΨqK)=0𝐼subscriptΦ𝑞subscriptsuperscriptΨ𝐾𝑞0I(\Phi_{q}\subseteq\Psi^{K}_{q})=0italic_I ( roman_Φ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ⊆ roman_Ψ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = 0). The subset condition ensures that all relevant tools are included in the retrieved top-K results. The evaluation metrics for tool selection—Recall@K, NDCG@K, and COMP@K—each range from 0 to 1, where higher values indicate better performance. In evaluating tool calling, we assess the framework using three metrics: Consistency with Stipulations, Correctness of Parameter Extraction, and Error Handling. Consistency with Stipulations measures how well the provided parameters match the tool’s documentation requirements, calculated as (Number of parameters consistent with the stipulationsTotal number of parameters required)×100%Number of parameters consistent with the stipulationsTotal number of parameters requiredpercent100\left(\frac{\text{Number of parameters consistent with the stipulations}}{%\text{Total number of parameters required}}\right)\times 100\%( divide start_ARG Number of parameters consistent with the stipulations end_ARG start_ARG Total number of parameters required end_ARG ) × 100 %. Correctness of Parameter Extraction evaluates the accuracy in extracting the correct parameters from the user query, defined as (Number of correctly extracted parametersTotal number of parameters)×100%Number of correctly extracted parametersTotal number of parameterspercent100\left(\frac{\text{Number of correctly extracted parameters}}{\text{Total %number of parameters}}\right)\times 100\%( divide start_ARG Number of correctly extracted parameters end_ARG start_ARG Total number of parameters end_ARG ) × 100 %. Error Handling assesses the system’s ability to manage errors during tool calling, measured as (Number of errors handled successfullyTotal number of errors encountered)×100%Number of errors handled successfullyTotal number of errors encounteredpercent100\left(\frac{\text{Number of errors handled successfully}}{\text{Total number %of errors encountered}}\right)\times 100\%( divide start_ARG Number of errors handled successfully end_ARG start_ARG Total number of errors encountered end_ARG ) × 100 %. These metrics are expressed as percentages to quantitatively assess the effectiveness of the framework in tool calling, with values ranging from 0% (worst performance) to 100% (best performance). A value of 0% for any metric indicates complete failure (e.g., no parameters meet the stipulations, no parameters correctly extracted, or no errors managed), while 100% indicates perfect performance (e.g., all parameters meet the stipulations, all parameters correctly extracted, or all errors managed effectively). The evaluation metrics used for response generation include BLEU, ROUGE-L, and Exact Match. BLEU (Bilingual Evaluation Understudy) is calculated using the formula: BLEU=BPexp(n=1Nwnlogpn)BLEU𝐵𝑃superscriptsubscript𝑛1𝑁subscript𝑤𝑛subscript𝑝𝑛\text{BLEU}=BP\cdot\exp\left(\sum_{n=1}^{N}w_{n}\log p_{n}\right)BLEU = italic_B italic_P ⋅ roman_exp ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where BP𝐵𝑃BPitalic_B italic_P is the brevity penalty, wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the weight for n-gram precision, and pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the modified n-gram precision. ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) focuses on the longest common subsequence (LCS) and its formula is: ROUGE-L=Fβ=(1+β2)LCS-precisionLCS-recallLCS-precision+β2LCS-recallROUGE-Lsubscript𝐹𝛽1superscript𝛽2LCS-precisionLCS-recallLCS-precisionsuperscript𝛽2LCS-recall\text{ROUGE-L}=F_{\beta}=\frac{(1+\beta^{2})\cdot\text{LCS-precision}\cdot%\text{LCS-recall}}{\text{LCS-precision}+\beta^{2}\cdot\text{LCS-recall}}ROUGE-L = italic_F start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT = divide start_ARG ( 1 + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ⋅ LCS-precision ⋅ LCS-recall end_ARG start_ARG LCS-precision + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ LCS-recall end_ARG, where β𝛽\betaitalic_β is usually set to 1.0. In ROUGE-L, LCS-Precision is the ratio of the length of the LCS to the total number of words in the candidate response, LCS-Recall is the ratio of the length of the LCS to the total number of words in the reference response, and the F-measure balances these using the harmonic mean. Exact Match measures the percentage of responses that are exactly the same as the reference answer, and its formula is: Exact Match=Number of Exact MatchesTotal Number of ResponsesExact MatchNumber of Exact MatchesTotal Number of Responses\text{Exact Match}=\frac{\text{Number of Exact Matches}}{\text{Total Number of% Responses}}Exact Match = divide start_ARG Number of Exact Matches end_ARG start_ARG Total Number of Responses end_ARG. The metrics BLEU, ROUGE-L, and Exact Match all range from 0 to 1 (or 0 to 100%), with 0 indicating the worst performance (no match or overlap with the reference response) and 1 (or 100%) indicating the best performance (perfect match or complete alignment with the reference response). These metrics provide a comprehensive evaluation of the quality of generated responses by assessing them against machine-generated (Gold-LLM such as GPT-4 (Omni)) reference responses in terms of precision, recall, and exact match. In summary, evaluation metrics are crucial in tool learning to ensure the framework can effectively plan tasks, select and call tools, and generate accurate and useful responses. These metrics help in instruction-tuning the action generator and expert models (tools) and improving the framework’s performance in handling complex tasks with the aid of external tools.

User-Centric Evaluation:

We present a comprehensive human evaluation approach for assessing the effectiveness of tool learning with the framework, going beyond metrics. Our approach involves eight key aspects, all rated by humans: user satisfaction, usability, task completion, response quality, context awareness, adaptability, error handling, and qualitative feedback. User satisfaction and usability are gauged through Likert-scale surveys, with scores ranging from 1 (minimum) to 5 (maximum). Task completion is measured by whether specific tasks are successfully completed (Yes/No). Response quality is evaluated based on four criteria: relevance, clarity, completeness, and accuracy, each scored from 1 to 5. Context awareness is evaluated by presenting a series of related queries to check if the framework maintains coherence, while adaptability is tested using various query types, with both aspects scored from 1 to 5. Error handling is examined by introducing deliberate errors to see how well the framework corrects itself, also scored from 1 to 5. Qualitative feedback is categorized as High, Medium-High, or Medium, providing deeper insights into user experiences. This multi-faceted evaluation ensures a thorough understanding of the framework’s performance from a human-centric perspective, highlighting its strengths.

Experimental Results

The experimental results on the evaluation of the PEOA framework in task planning, tool selection, tool calling, and response generation are detailed in several tables. In task planning, Table 2 compares state-of-the-art proprietary LLMs using metrics such as Tool Usage Awareness (TUA), Pass Rate (PR), and Accuracy (Acc), all expressed as percentages, where TUA ranges from 0% (failure) to 100% (perfect identification), PR from 0% (none correct) to 100% (all correct), and Accuracy from 0% (none correct) to 100% (all correct). Table 3 for tool selection uses Recall, NDCG, and COMP metrics, with Recall@K measuring the proportion of relevant tools in the top-K selected (0% to 100%), NDCG@K assessing ranking quality (0 to 1), and COMP@K verifying if the selected tools form a complete set (0% to 100%). For tool calling, Table 4 employs Consistency with Stipulations (Cons), Correctness of Parameter Extraction (PE), and Error Handling (EH), with Cons ranging from 0% (none meet requirements) to 100% (all meet requirements), PE from 0% (none correct) to 100% (all correct), and EH from 0% (ineffective) to 100% (effective). The experimental results for response generation are shown in Table 5 using BLEU, ROUGE-L, and Exact Match (EM), where BLEU measures n-gram precision (0 to 1), ROUGE-L focuses on the longest common subsequence (0 to 1), and EM assesses exact matches between generated and reference responses (0% to 100%). The experimental results show that the proposed framework performs effectively across

DatasetAlgorithmTUA (%)PR (%)Acc (%)
MathCompGPT-4 Turbo-preview87.5482.8084.67
GPT-4-1106-preview76.6572.7774.91
Claude-3 Opus85.8380.3782.31
Claude-3 Haiku82.9177.8579.97
Claude-3 Sonnet79.6474.5476.97
Google Gemini Pro86.8081.3583.51
POEA78.8773.8375.94
ChemProcGPT-4 Turbo-preview88.9483.8485.97
GPT-4-1106-preview75.6271.9873.89
Claude-3 Opus84.8879.8681.42
Claude-3 Haiku81.8376.8878.68
Claude-3 Sonnet78.7173.7975.90
Google Gemini Pro85.7980.7482.78
POEA76.9671.6374.52

DatasetAlgorithmRecall (%)NDCGCOMP (%)
MathCompGPT-4 Turbo-preview86.980.8084.54
GPT-4-1106-preview74.980.6672.88
Claude-3 Opus85.820.7883.90
Claude-3 Haiku82.440.7580.69
Claude-3 Sonnet79.450.7177.71
Google Gemini Pro87.560.8285.74
POEA78.820.6976.79
ChemProcGPT-4 Turbo-preview87.870.8185.82
GPT-4-1106-preview75.870.6773.82
Claude-3 Opus86.370.7985.24
Claude-3 Haiku83.860.7681.34
Claude-3 Sonnet79.920.7277.35
Google Gemini Pro88.990.8386.83
POEA77.770.6875.55

DatasetAlgorithmCons (%)PE (%)EH (%)
MathCompGPT-4 Turbo-preview87.7385.2584.34
GPT-4-1106-preview71.6968.7467.86
Claude-3 Opus86.5683.9182.81
Claude-3 Haiku82.4579.4478.07
Claude-3 Sonnet78.7476.1874.67
Google Gemini Pro89.9888.0586.99
POEA80.4178.6677.05
ChemProcGPT-4 Turbo-preview87.8485.0684.15
GPT-4-1106-preview73.6070.1969.29
Claude-3 Opus85.6682.3181.22
Claude-3 Haiku81.8178.3877.19
Claude-3 Sonnet76.7474.3572.89
Google Gemini Pro88.9887.1285.86
POEA79.6477.2375.70

DatasetAlgorithmBLEUROUGE-LEM (%)
MathCompGPT-4 Turbo-preview0.800.7883.61
GPT-4-1106-preview0.740.7278.64
Claude-3 Opus0.770.7581.75
Claude-3 Haiku0.750.7379.00
Claude-3 Sonnet0.720.7176.47
Google Gemini Pro0.820.8084.70
POEA0.680.6673.68
ChemProcGPT-4 Turbo-preview0.810.7984.79
GPT-4-1106-preview0.750.7378.89
Claude-3 Opus0.780.7682.36
Claude-3 Haiku0.760.7480.61
Claude-3 Sonnet0.740.7278.15
Google Gemini Pro0.830.8184.90
POEA0.690.6774.13

DatasetAlgorithmUSUsabilityTask Completion
MathCompGPT-4-Turbo-preview4.524.4390.32%
GPT-4-1106-preview4.134.0185.27%
Claude-3 Opus4.314.2288.14%
Claude-3 Haiku4.224.1187.09%
Claude-3 Sonnet4.043.9282.16%
Google Gemini Pro4.674.5592.48%
PEOA4.083.9180.53%
ChemProcGPT-4-Turbo-preview4.564.4590.37%
GPT-4-1106-preview4.244.1286.15%
Claude-3 Opus4.334.2088.22%
Claude-3 Haiku4.214.0986.47%
Claude-3 Sonnet4.124.0183.04%
Google Gemini Pro4.724.6393.09%
PEOA4.124.0281.76%

various stages of evaluation, closely matching the performance of proprietary LLMs, though there remains a slight performance gap. The Tables 6 and 7 compare the PEOA framework with proprietary LLMs across five metrics: user satisfaction (US), usability, task completion, response quality, and context awareness. A 1-5 scale is used for all metrics except task completion, which is measured as a percentage. Table 8 compares the PEOA framework with proprietary LLMs on adaptability, error handling, and qualitative feedback. Our comprehensive human evaluation approach demonstrates that the proposed framework matches the performance of proprietary language models across multiple aspects of tool learning effectiveness.

DatasetAlgorithmResponse QualityContext Awareness
MathCompGPT-4 Turbo-preview4.554.43
GPT-4-1106-preview4.124.08
Claude-3 Opus4.384.27
Claude-3 Haiku4.224.16
Claude-3 Sonnet4.084.03
Google Gemini Pro4.644.52
PEOA4.134.02
ChemProcGPT-4 Turbo-preview4.574.42
GPT-4-1106-preview4.184.09
Claude-3 Opus4.354.30
Claude-3 Haiku4.204.12
Claude-3 Sonnet4.104.05
Google Gemini Pro4.674.51
PEOA4.154.03

DatasetAlgorithmAdaptabilityEHFeedback
MathCompGPT-4 Turbo-preview4.424.53High
GPT-4-1106-preview4.304.48High
Claude-3 Opus4.284.39Medium-High
Claude-3 Haiku4.254.35Medium-High
Claude-3 Sonnet4.324.42Medium-High
Google Gemini Pro4.474.50High
PEOA4.054.12Medium
ChemProcGPT-4 Turbo-preview4.454.52High
GPT-4-1106-preview4.334.47High
Claude-3 Opus4.314.41Medium-High
Claude-3 Haiku4.284.37Medium-High
Claude-3 Sonnet4.354.44Medium-High
Google Gemini Pro4.504.53High
PEOA4.074.15Medium

Ablation Studies:

We conducted several ablation studies to thoroughly evaluate the contributions and effectiveness of various components of the PEOA framework, particularly focusing on its instruction-tuning, graph-based retrieval methods, and iterative problem-solving mechanisms for solving complex chemical and process engineering calculations. The ablation study aims to isolate and evaluate the contributions of each major component in the framework. By systematically disabling key components, we can better understand their roles and optimize the framework for improved performance in real-world process engineering applications. The ablation study evaluates four key variants of the framework. The first variant (W/o GRACG) uses instruction-tuning of expert models (tools). ‘W/o’ stands for ‘without’, and ‘W/’ stands for ‘with.’ The second variant (W/o GRACG W/ RAG) uses instruction-tuning of expert models combined with traditional RAG (naive). The third variant (W/o Instruction-Tuning) employs GRACG for enhanced retrieval and code generation, focusing on graph-based context benefits without instruction-tuning of expert models. The fourth variant (W/o Error-Handling) tests iterative problem-solving without a dynamic error-handling mechanism, exploring the impact on accuracy and robustness. These studies help understand the contribution of each component to the overall performance.

DatasetAlgorithmTUA (%)PR (%)Acc (%)
MathCompPEOA (Baseline)78.8773.8375.94
W/o GRACG54.4249.4750.88
W/o GRACG W/ RAG66.2560.5461.52
W/o Instruction-Tuning51.2645.7748.60
W/o Error-Handling59.9456.8560.75
ChemProcPEOA (Baseline)76.9671.6374.52
W/o GRACG53.1047.9949.93
W/o GRACG W/ RAG64.6558.7460.36
W/o Instruction-Tuning50.0244.4147.69
W/o Error-Handling58.4955.1559.62

DatasetAlgorithmRecall (%)NDCGCOMP (%)
MathCompPEOA (Baseline)78.820.6976.79
W/o GRACG55.970.5053.75
W/o GRACG W/ RAG63.840.5764.50
W/o Instruction-Tuning51.230.4346.07
W/o Error-Handling61.480.5558.36
ChemProcPEOA (Baseline)77.770.6875.55
W/o GRACG55.220.4952.89
W/o GRACG W/ RAG62.990.5663.46
W/o Instruction-Tuning50.550.4345.33
W/o Error-Handling60.660.5457.42

DatasetAlgorithmCons (%)PE (%)EH (%)
MathCompPEOA (Baseline)80.4178.6677.05
W/o GRACG56.2954.2751.82
W/o GRACG W/ RAG67.5463.7161.64
W/o Instruction-Tuning48.2550.3446.01
W/o Error-Handling61.9161.3561.64
ChemProcPEOA (Baseline)79.6477.2375.70
W/o GRACG55.7553.2950.72
W/o GRACG W/ RAG66.9062.5660.56
W/o Instruction-Tuning47.7849.4346.18
W/o Error-Handling61.5260.2460.56

DatasetAlgorithmBLEUROUGE-LEM (%)
MathCompPEOA (Baseline)0.680.6673.68
W/o GRACG0.490.4650.10
W/o GRACG W/ RAG0.550.5661.16
W/o Instruction-Tuning0.410.4247.16
W/o Error-Handling0.520.5155.59
ChemProcPEOA (Baseline)0.690.6774.13
W/o GRACG0.500.4650.41
W/o GRACG W/ RAG0.560.5761.53
W/o Instruction-Tuning0.410.4247.44
W/o Error-Handling0.530.5256.34

The ablation study results clearly demonstrate that the complete PEOA framework (Baseline) outperforms the ablated variants across various metrics. This highlights the synergistic effect of the framework’s components and underscores the importance of incorporating all aspects for optimal performance in specialized engineering tasks.

Additional Experiments:

We performed additional experiments to verify the property graph construction of the proposed framework PEOA, which utilizes LlamaIndex integration with Neo4j(Neo4j 2024) and GPT-4 (Omni) to extract triplets. We compared this with two recent advanced approaches: Triplex(SciPhiAI 2024b, a), which offers significant cost savings and efficient knowledge graph construction, and Graph RAG(Edge etal. 2024), which provides superior summarization capabilities for complex tasks. Table 13 shows the performance comparison in terms of the Exact Match (EM) metric, which quantifies the percentage of predictions that exactly match the reference or ground truth responses.

AlgorithmMathCompChemProc
Triplex(SciPhiAI 2024b, a)59.5063.20
Graph RAG(Edge etal. 2024)72.5073.80
PEOA73.6874.13

We conducted additional experiments to evaluate only the Graph RAG techniques on property graphs constructed from custom datasets of scholarly articles related to mathematics, chemical, and process engineering. We generated a test set of 1000 Question-Context-Answer (QCA) triplets based on the text content of scholarly articles used to construct the property graphs, using GPT-4 (Omni) as a baseline to evaluate the different techniques. In a Graph RAG approach, evaluation focuses on Retrieval Evaluation (assessing the accuracy and relevance of retrieved information) and Response Evaluation (measuring the quality and appropriateness of generated responses). The evaluation metrics included answer relevance (AnwRel), context relevance (ConRel), faithfulness (Faith), and correctness (Correct), ensuring responses were pertinent, contextually appropriate, accurate, and truthful.

DatasetAlgorithmAnwRelConRelFaithCorrect
MathTriplex82.3480.4279.5884.76
Graph RAG78.1875.4977.5281.27
PEOA83.4782.0381.7985.65
ChemTriplex81.2379.8678.7983.12
Graph RAG78.2775.8976.7380.19
PEOA83.7779.5380.8884.95

Conclusion

In conclusion, the proposed framework represents a significant advancement in the field of process engineering by automating complex problem-solving tasks. The experimental results demonstrate that the framework performs effectively across various evaluation stages, closely matching the performance of leading proprietary LLMs while offering a modular, adaptable approach suited to the specific needs of chemical and process engineering. Future work will focus on further refining the framework’s capabilities, expanding its application to other domains, and exploring additional enhancements in tool integration and knowledge modeling.

References

  • Achiam etal. (2023)Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.;Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; etal. 2023.Gpt-4 technical report.arXiv preprint arXiv:2303.08774.
  • Baek, Aji, and Saffari (2023)Baek, J.; Aji, A.F.; and Saffari, A. 2023.Knowledge-augmented language model prompting for zero-shot knowledgegraph question answering.arXiv preprint arXiv:2306.04136.
  • Ban etal. (2023)Ban, T.; Chen, L.; Wang, X.; and Chen, H. 2023.From query tools to causal architects: Harnessing large languagemodels for advanced causal discovery from data.arXiv preprint arXiv:2310.05432.
  • Edge etal. (2024)Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.;and Larson, J. 2024.From Local to Global: A Graph RAG Approach to Query-FocusedSummarization.arXiv preprint arXiv:2404.16130.
  • Google (2024)Google, C. 2024.Codegemma: Open code models based on gemma.arXiv preprint arXiv:2406.11409.
  • Gou etal. (2023)Gou, Z.; Shao, Z.; Gong, Y.; Shen, Y.; Yang, Y.; Duan, N.; and Chen, W. 2023.Critic: Large language models can self-correct with tool-interactivecritiquing.arXiv preprint arXiv:2305.11738.
  • Guo etal. (2024)Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu,Y.; Li, Y.; etal. 2024.DeepSeek-Coder: When the Large Language Model Meets Programming–TheRise of Code Intelligence.arXiv preprint arXiv:2401.14196.
  • He etal. (2024)He, X.; Tian, Y.; Sun, Y.; Chawla, N.V.; Laurent, T.; LeCun, Y.; Bresson, X.;and Hooi, B. 2024.G-retriever: Retrieval-augmented generation for textual graphunderstanding and question answering.arXiv preprint arXiv:2402.07630.
  • Hindin (2010)Hindin, H.J. 2010.Wolfram Alpha.Mathematics and Computer Education, 44(1): 77.
  • Hu etal. (2024)Hu, Y.; Lei, Z.; Zhang, Z.; Pan, B.; Ling, C.; and Zhao, L. 2024.Graph Retrieval-Augmented Generation (GRAG).arXiv preprint arXiv:2305.16506.
  • Kang etal. (2023)Kang, M.; Kwak, J.M.; Baek, J.; and Hwang, S.J. 2023.Knowledge graph-augmented language models for knowledge-groundeddialogue generation.arXiv preprint arXiv:2305.18846.
  • Kim etal. (2024)Kim, J.; Paranjape, B.; Khot, T.; and Hajishirzi, H. 2024.Husky: A Unified, Open-Source Language Agent for Multi-StepReasoning.arXiv preprint arXiv:2406.06469.
  • LangChain (2024)LangChain. 2024.LangChain Graphs.
  • Lewis etal. (2020)Lewis, P.; etal. 2020.Retrieval-augmented generation for knowledge-intensive nlp tasks.In Advances in Neural Information Processing Systems,volume33, 9459–9474.
  • LlamaIndex (2023a)LlamaIndex. 2023a.CodeHierarchyAgentPack.https://github.com/run-llama/llama˙index/tree/main/llama-index-packs/llama-index-packs-code-hierarchy.
  • LlamaIndex (2023b)LlamaIndex. 2023b.LlamaParse.https://github.com/run-llama/llama˙parse.
  • LlamaIndex (2024)LlamaIndex. 2024.Knowledge Graph Index.
  • MetaAI (2023)MetaAI. 2023.CodeLlama.https://github.com/meta-llama/codellama.
  • NebulaGraph (2024)NebulaGraph. 2024.GraphRAG: Retrieval-Augmented Generation with LLM Based on KnowledgeGraphs.
  • Neo4j (2024)Neo4j. 2024.Project NaLLM.
  • Qu etal. (2024)Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; and Wen, J.-R.2024.Tool Learning with Large Language Models: A Survey.arXiv preprint arXiv:2405.17935.
  • SciPhiAI (2024a)SciPhiAI. 2024a.SciPhi/Triplex.Accessed: 2024-07-27.
  • SciPhiAI (2024b)SciPhiAI. 2024b.Triplex — SOTA LLM for Knowledge Graph Construction.Accessed: 2024-07-27.
  • Shao etal. (2024)Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Zhang, M.; Li, Y.; Wu, Y.; andGuo, D. 2024.Deepseekmath: Pushing the limits of mathematical reasoning in openlanguage models.arXiv preprint arXiv:2402.03300.
  • Shi etal. (2024)Shi, Z.; Gao, S.; Chen, X.; Feng, Y.; Yan, L.; Shi, H.; Yin, D.; Chen, Z.;Verberne, S.; and Ren, Z. 2024.Chain of Tools: Large Language Model is an Automatic Multi-toolLearner.arXiv preprint arXiv:2405.16533.
  • Trajanoska, Stojanov, and Trajanov (2023)Trajanoska, M.; Stojanov, R.; and Trajanov, D. 2023.Enhancing knowledge graph construction using large language models.arXiv preprint arXiv:2305.04676.
  • Velickovic etal. (2018)Velickovic, P.; etal. 2018.Graph attention networks.In International Conference on Learning Representations.
  • Yao etal. (2023)Yao, L.; Peng, J.; Mao, C.; and Luo, Y. 2023.Exploring large language models for knowledge graph completion.arXiv preprint arXiv:2306.04136.
  • Yao etal. (2022)Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; and Cao, Y.2022.React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629.
  • Yasunaga etal. (2021)Yasunaga, M.; etal. 2021.QA-GNN: Reasoning with language models and knowledge graphs forquestion answering.In North American Chapter of the Association for ComputationalLinguistics (NAACL).
  • YuntongHu (2023)YuntongHu, L.Z., ZhengZhang. 2023.Beyond text: A deep dive into large language models’ ability onunderstanding graph data.NeurIPS 2023 Workshop: New Frontiers in Graph Learning.
  • Zhang (2023)Zhang, J. 2023.Graph-toolformer: To empower LLMs with graph reasoning ability viaprompt augmented by ChatGPT.arXiv preprint arXiv:2304.11116.
  • Zhang etal. (2024a)Zhang, T.; Patil, S.G.; Jain, N.; Shen, S.; Zaharia, M.; Stoica, I.; andGonzalez, J.E. 2024a.Raft: Adapting language model to domain specific rag.arXiv preprint arXiv:2403.10131.
  • Zhang etal. (2024b)Zhang, Y.; Zhang, Y.; Gan, Y.; Yao, L.; and Wang, C. 2024b.Causal graph discovery with retrieval-augmented generation basedlarge language models.arXiv preprint arXiv:2402.15301.
Knowledge Graph Modeling-Driven Large Language Model Operating System (LLM OS) for Task Automation in Process Engineering Problem-Solving (2024)
Top Articles
Tiramisu Martini - delicious and decadent co*cktail recipe
20 ALL DAY Slow Cooker Recipes - Recipes That Crock!
M3Gan Showtimes Near Amc Quail Springs Mall 24
Raleigh Craigs List
Terraria Artisan Loaf
Home Store On Summer
Smoke Terminal Waterbury Photos
Best Seafood Buffet In Laughlin Nevada
Savage X Fenty Wiki
Creepshot. Org
How Much Food Should I Buy For Christmas? | Gousto Christmas
Flag Mashup Bot
8x20, 8x40 Shipping containers storage container for rent or sale - general for sale - by dealer - craigslist
Myportal Udm
Chs.mywork
Lakeport Craigslist
Lima Crime Stoppers
Fragments Of Power Conan Exiles
Greyhound Bus Station Syracuse Ny
Magicseaweed Capitola
T33N Leak Age 5-17
6 Best Doublelist Alternatives Worth Trying in 2024
Fungal Symbiote Terraria
Only Murders In The Building Wiki
Pull And Pay Middletown Ohio
Funny Marco Birth Chart
Full Volume Bato
Lux Nails Columbia Mo
Craigslist Mexico Cancun
Eastman Classifieds Kingsport
Skyward Crawford Ausable
Waive Upgrade Fee
How To Get Stone Can In Merge Mansion 2022
Jcpenney Salon Salinas
Abingdon Avon Skyward
"Rainbow Family" will im Harz bleiben: Hippie-Camp bis Anfang September geplant
Www Texaslottery Com
Tires Shop Santoyo
My.chemeketa
MAXSUN Terminator Z790M D5 ICE Motherboard Review
Nz Herald Obituary Notices
Jeld Wen Okta Com Login
Ten Conservative Principles
Erica Mena Net Worth Forbes
Crandon Skyward
Ms Trigger Happy Twitter
50 Shades Of Grey Movie 123Movies
Trivago Anaheim California
Towne Pizza Carman Road
Shaver Lake Webcam Gas Station
Fitgirl Starfield
Vimeo Downloader - Download Vimeo Videos Online - VEED.IO
Latest Posts
Article information

Author: Eusebia Nader

Last Updated:

Views: 5474

Rating: 5 / 5 (60 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Eusebia Nader

Birthday: 1994-11-11

Address: Apt. 721 977 Ebert Meadows, Jereville, GA 73618-6603

Phone: +2316203969400

Job: International Farming Consultant

Hobby: Reading, Photography, Shooting, Singing, Magic, Kayaking, Mushroom hunting

Introduction: My name is Eusebia Nader, I am a encouraging, brainy, lively, nice, famous, healthy, clever person who loves writing and wants to share my knowledge and understanding with you.