Paper Review Automatically Generating Precise Oracles from Structured Natural Language Specifications



Swami is an automated technique that extracts test oracles and generates executable tests from structured natural language specifications. It focuses on exceptional behavior and boundary conditions (IMPORTANT).

NL driver script + assertions. Parameterized with inputs.

Our technique, Swami, consists of three steps: identifying parts of the documentation relevant to the implementation to be tested, extracting test templates from those parts of the documentation, and generating executable tests from those templates. We now illustrate each of these steps on the Array(len) constructor specification from ECMA-262 (Figure 1).


Evaluated on ECMA-262 javascript spec. * Of the tests Swami generates, 60.3% are innocuous — they can never fail. Of the remaining tests, 98.4% are precise to the specification and only 1.6% are flawed and might raise false alarms.

Comment: is high innocuous percentage a problem?

Discovered 1 previously unknown defect (suspicious) and 15 missing JavaScript features in Rhino. 1 previously unknown defect in Node.js. 18 semantic ambiguities in the ECMA-262 specification

" good tests that correctly encode the specification and would catch some improper implementation”.

Emm, how to verify that it correctly encode the specification?

The specification often says two values should be equal without specifying which equality operator should be used.

Well — that’s why we use NL!


"regular-expression-based approach” — I actually quite appreciate the simplicity here.

Practical Value

What you can learn from this to make your research better?

Details and Problems From the presenters’ point of view, what questions might audience ask?

Have authors reported the detected defects to the maintainers?

We have submitted a bug report for the new defect1 and a missing feature request for the 3 features not covered by existing requests2. 1 Weird. Number doesn’t have prototype so undefined. Looks like spec is under-spec'ed. 2 Doesn’t implement three APIs. 3 4 5

In a study of ten popular, well-tested, open-source projects, the coverage of exception handling statements lagged significantly behind overall statement coverage

Spec might be outdated as well Implementation is available

Symbol resolution / correspondence problem?

Does it perform over txt? What about the existing hyper-text structures (PDF to text, why?)?

uses the abstract operation ToUnit32(len), specified in a different part of the specification document (Figure 2), and Swami needs an executable method that encodes that operation. For JavaScript, we found that implementing 10 abstract operations, totaling 82 lines of code, was sufficient for our purposes.

For all sections, how many of them are mapped to one/many classes? What is the number of extracted test templates in each of them?

"Swami generates two types of tests: boundary condition and exceptional condition tests. “ — that seems a fairly restrictive condition. Why not value test?

Maybe not really NLP or structural.

How long is the typical generated test?


Applicability: what if we apply Swami on less well-maintained spec? How will it help small teams?


What is the “recall”? How many manual tests can be replaced? Why?

How is okapi used? Why is it optional?

Swami’s regular expression approach to Section Identification is precise: in our evaluation, 100% of the specification sections identified encoded testable behavior. But it requires specific specification structure. Without that structure, Swami relies on its Okapi-model approach. Which is used for evaluation? Original or OKAPI? When the documentation is not as clearly delineated, Swami can still identify which sections are relevant, but it requires access to the source code. , whereas Swami can generate black- box tests entirely from the specification document, without needing the source code.

However, of the irrelevant specifications, 45.3% do not satisfy the Template Initialization rule, and 27.3% do not satisfy the Conditional Identification rule. All the test templates generated from the remaining 27.4% fail to compile, so Swami removes them, resulting in an effective precision of 100%.

Eaddy et al. [16] have constructed a ground-truth benchmark by manually mapping parts of the Rhino source code (v1.5R6) to the relevant concerns from ECMA-262 (v3). Research on information retrieval in software engineering uses this benchmark extensively [17], [25]. The benchmark consists of 480 specifications and 140 Rhino classes. On this benchmark, Swami’s precision was 79.0% and recall was 98.9%, suggesting that Swami will attempt to generate tests from nearly all relevant specifications, and that 21.0% of the specifications Swami may consider generating tests from may not be relevant

What does the benchmark number mean?

This looks ridiculous.

I think a more formal description of the steps will be useful.


Node executable, swallow exeception.

1000: total number of test templates generated 83.

Why 1000 tests?