Furthermore, it is important to highlight that learning with counterexamples was only possible with three of the IRL tools. Among these three, only two tools successfully generated a rule. Specifically, Popper learned a rule with variables, while Aleph returned a hypothesis consisting of the two positive examples provided but did not learn a rule with variables in the second setting. This variation in performance suggests that the capabilities of the IRL tools differed in their ability to handle the inclusion of counterexamples.
5.1. Results of Experiment 1
While Aleph has the capability to learn from both positive and negative examples, the rules it learned in this scenario are not usable. The outcome generated by Aleph for positive and negative examples solely provides the positive examples, rather than learning a rule that implies isAllowedToUse(A, B) with variables A, B, and potential unbound variables. As a result, the rules learned by Aleph with negative examples did not meet the desired criteria for applicability.
AMIE3 learns rules based on a knowledge graph without the explicit provision of negative examples, resulting in rules learned solely from positive examples. It is important to note that the format of the examples provided differs for AMIE3, as well as for AnyBurl, which we will discuss later.
AMIE3 learned two rules. The first rule implies isAllowedToUse(A, B) based on the existence of G, which contains both A and B. Given the restrictions on the selected examples, where all are packages within teammates.logic, this rule accurately reflects a valid assignment for G.
The second rule learned by AMIE3 has a unique characteristic that arises from translating our meta-model into triples, as required by AMIE3. This rule states that isAllowedToUse(A, B) is implied by both isA(A, H) and isA(B, H). In the knowledge graph, there are triples such as teammates_logic_core isA package, which leads to the conclusion that the only suitable assignment for H is package. Thus, the learned rule reflects that A and B are of the same type, namely, packages.
However, these rules do not fully align with the essence of our stated hypothesis, as they do not consider the importsClass(X, Y) predicate and its connection to the head predicate, nor the involvement of the bound variables A and B.
Popper stands out among the tested IRL tools as the only one capable of learning rules that incorporate variables in the learning task, including counterexamples, for the architecture rule isAllowedToUse(A, B).
The rule learned based on positive examples alone differs from the rule learned using both examples and counterexamples. The rule derived from positive examples only implies isAllowedToUse(A, B) when both A and B are packages. Although this rule is conditionally induced by the meta-model, it does not capture the essence of our hypothesis and is not considered suitable for abstracting the provided architectural rule examples.
The rule learned using both positive and negative examples proves to be the most promising outcome of the entire experiment. This rule includes the predicate importsClass(D, C) with the free variables D and C in the rule body. The incorporation of this predicate and the presence of free variables enhance the rule’s potential to align with our intended hypothesis.
Compared to our hypothesis, the learned rule does not contain the first part stating that the bound variables A and B are packages, and it does not explicitly state that C and D need to exist (using the existential quantifier ∃). However, the rule learned by Popper still captures the essence of our hypothesis by connecting the variables C and D as classes contained in A and B, respectively, as well as incorporating the importsClass(D, C) predicate with the free variables D and C in the rule body.
Existence is achieved by finding suitable assignments for the respective variables, and these unary predicates, which set the types of the variables (whether they represent classes or packages), are predefined by the simple meta-model and the binary predicates they are used in. Consequently, it is implicit that these aspects of our hypothesis hold in every example (and counterexample). Therefore, they do not contribute to distinguishing the provided examples from the provided counterexamples.
GPT-3.5 demonstrated its capacity to acquire rules through two modes of the experiment: (i) from positive examples and (ii) from additional counterexamples. However, when exclusively learning from positive examples, it generated a rule that is in conflict with our meta-model. Specifically, the induced rule erroneously posits that C contains a class B, i.e., containsClass(C, B), implying an isAllowedToUse relationship between a package and a class. Contrary to this, our meta-model unambiguously stipulates that the isAllowedToUse dependency exclusively operates between packages.
Furthermore, the rule learned from both provided examples and counterexamples exhibits an anomalous component. It asserts that package(A) equals teammates_logic_core, a syntactical oddity. This departure from conventional syntax is noteworthy, as it introduces a value, teammates_logic_core, into a predicate intended to yield a truth value of true or false, raising questions about the interpretability of such assignments.
While GPT-3.5 effectively processed the experimental data and formulated rules suggesting the presence of isAllowedToUse(A, B), it was unable in both instances to induce rules that comprehensively encompassed the positive examples provided.
Unlike its predecessor, GPT-4 is capable of learning rules through the presentation of examples alone or through the inclusion of counterexamples at the outset of its training process. In both scenarios, the rules acquired during training demonstrate the capability to encompass the given positive examples, yielding flawless accuracy when positive examples alone are provided. However, when counterexamples are introduced into the training data, GPT-4 struggles to formulate a rule that effectively excludes these counterexamples, resulting in a rule set with an accuracy of 50 percent.
The comparison of our initial hypothesis with the observed rules yields intriguing insights. Firstly, the rule formed solely from positive examples has nothing in common with our hypothesis. The learned rule only expects a shared entity, Z, encompassing both A and B (containsPackage(Z, A) and containsPackage(Z, B)), but lacks further commonalities. In contrast, the rules derived from both examples and counterexamples exhibit greater alignment with our original hypothesis. Both not only recognize A and B as packages but also introduce X and Y as the respective containsClass for A and B. However, a critical deviation emerges in the remainder of the learned rule. It erroneously asserts that neither importsClass(A, B) nor importsClass(B, A) occur, which contradicts our understanding. This is a fundamental misinterpretation, as importsClass is intended to function with class types, not package types. In addition, the learned rule does not contain the part of the hypothesis that defines the importsClass relationship between the classes X and Y, i.e., importsClass(X, Y).
However, the rule that Google Bard learned with counterexamples was not as good as the first one. It wrongly included the idea that B should be a class, i.e., class(B), which does not match our meta-model. According to our meta-model, the architectural relationship we are looking at (isAllowedToUse(A, B)) only exists between packages. This mistake makes the rule wrong for all the positive examples we have.
Table 3.
Learning duration and rule accuracy for each tool in Experiment 1.
Table 3.
Learning duration and rule accuracy for each tool in Experiment 1.
Tool | With Examples | With Examples and Counterexamples | ||
---|---|---|---|---|
Learning Time (s) | Accuracy | Learning Time (s) | Accuracy | |
Aleph | 0.010 | 1.0 | 0.097 | – |
AMIE3 | 0.150 | 1.0 | – | – |
AnyBURL | <1.000 | 1.0 | – | – |
ILASP | 4.922 | 1.0 | 1918.680 | – |
Popper | <1.000 | 1.0 | <1.000 | 1.0 |
# User Messages | Accuracy | # User Messages | Accuracy | |
Google Bard | 6 messages | 1.0 | 44 messages | 0.0 |
GPT-3.5 | 19 messages | 0.0 | 54 messages | 0.0 |
GPT-4 | 22 messages | 1.0 | 44 messages | 0.5 |
Except for ILASP, all IRL tools had learning times below one second in Experiment 1. Aleph was the fastest with 0.01 s, while ILASP took 4.922 s. In the second run, Aleph and Popper had low learning times, but ILASP took 1918 s. This long learning time can be attributed to the exploration of the entire hypothesis space. This results in its inability to learn a rule and declaring it as unsatisfiable. All learned rules learned by IRL tools—except for Aleph where no parameterized rules were learned—achieved 100% accuracy.
For the tested LLMs, the outcomes exhibit greater variability. GPT-3.5 struggled to consistently formulate precise rules. Google Bard, in contrast, succeeded in developing an accurate rule but only for positive examples. GPT-4 demonstrated a mixed performance, achieving perfect accuracy (1.0) when working with positive examples, but its accuracy dropped to 0.5 with the inclusion of counterexamples, due to the misclassification of all counterexamples.
Assessing the learning time in interactions with chatbot-like interfaces for LLMs presents a challenge, as it is not directly comparable to the learning time of IRL tools. Therefore, we have opted to report the number of messages that the user sent to the chatbot. The total number of messages (# User Messages in the table) exchanged in these chats is approximately double this figure, considering the back-and-forth dialogue between the user and the LLM during the rule-learning process. The duration of each LLM experiment, which includes retrieving relevant responses from the LLM, composing replies, and documenting the chat history for the reproduction dataset, consistently ranged from 10 to 15 min.
After conducting experiments with tools from the distinct ML domains of IRL and LLMs, it is enlightening to highlight some fundamental differences in the learning process’s dynamics and user experience.
With IRL tools, the learning process predominantly involves setting up the test data (KB and examples) for the task at hand, enabling the tools to autonomously navigate through the hypothesis space until a rule is identified. Once initiated, this task proceeds without requiring any further input from the user or expert.
In contrast, LLMs, particularly when employing the flipped interaction mode in our experiments, initiate the learning process with an initial prompt, followed by an iterative exchange of messages between the LLM and the user. This interactive approach not only accommodates a diverse range of messages from both parties but also supports adaptive responses during the rule-learning phase. We recognize significant potential in further exploring these interactive possibilities, as they have a profound impact on both the learning journey and its outcomes.
Regarding RQ1, our findings indicate that Popper, utilizing examples and counterexamples, learned a rule closely aligned with the proposed hypothesis. However, other IRL tools learned rules that did not fully capture the essence of the architectural rule, yet still encompassed the presented examples. Notably, these tools generated rules almost instantly for the small-scale experiments, achieving perfect accuracy. In the context of learning software architectural rules, LLMs exhibited significant limitations. Despite being extensively trained on substantial datasets, LLMs frequently encounter challenges in accurately learning software architectural rules, closely aligning with our hypothesis. Notably, only in one experimental setup (Google Bard, absent counterexamples) did the essence of our hypothesis become evident in the learned rule. We observed ML techniques from IRL and LLMs that showed promise in learning effective rules in our experiments. However, many tested approaches tended towards simpler explanations for the examples provided. This inclination resulted in the derivation of basic rules that ultimately fell short of meeting our expectations.
5.2. Results of Experiment 2
Table 4.
Experiment 2: Popper’s learned rule bodies for the rule head isAllowedToUse(A, B) and their amount for different example subsets.
Table 4.
Experiment 2: Popper’s learned rule bodies for the rule head isAllowedToUse(A, B) and their amount for different example subsets.
Learned Rule Body | # of Subsets |
---|---|
21 | |
2 | |
2 | |
2 |
For the remaining five cases where different rules were learned, we examined the examples within each subset to determine the reasons behind the variation. In two instances, Popper derived the rule isAllowedToUse(A, B) from a free variable C that contained both packages A and B. These instances corresponded to accurate rules, as all the examples within each subset belonged to the same package (either teammates.logic or teammates.storage). Notably, this was not the case for the counterexamples, which contributed to the differentiation in learned rules.
In the case of the other two rules that have been learned (twice and once, respectively), we noticed a common counterexample not(isAllowedToBeUsed(teammates_logic_core, teammates_storage_entity) that appeared in all three subsets. However, we believe that this particular negative example may be inaccurately labeled or could represent an inconsistency within the documented architecture and the implementation.
We identified a class (FeedBackResponseLogic) within the package teammates.logic.core that imports and interacts with objects from (FeedbackResponse), which is located in the package teammates.storage.entity. This violates the architectural rule, as the expected behavior suggests no allowance to use objects from the other package. However, in the implementation, we observed the usage of classes from both packages, which contradicts the rule of not being allowedToUse. Due to this inconsistency, the expected rule fails to distinguish between the provided examples and the counterexamples. This violation raises doubts about the accuracy of the labeling and highlights a potential discrepancy between the documented architecture and its actual implementation.
In response to RQ2, our findings demonstrate that Popper is capable of inducing rules from various subsets of examples within the same system, suggesting a degree of generalizability across different architectural contexts. However, it is important to note that the specific subset employed can influence the learned rules and, consequently, their generalizability. For instance, if the IRL tool can find a simpler explanation, such as “all positive examples have packages located in the same package”, it prioritizes this shorter explanation over more nuanced rules.