GitHub - delph-in/erg: English Resource Grammar

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 1,630 Commits
ace		ace
agree		agree
dict		dict
educ		educ
etc		etc
lkb		lkb
mrs		mrs
openproof		openproof
pet		pet
rpp		rpp
singlish		singlish
tmr		tmr
tsdb		tsdb
ut		ut
www		www
LICENSE		LICENSE
METADATA		METADATA
Makefile		Makefile
README		README
Version.lsp		Version.lsp
abstract.vpm		abstract.vpm
auxverbs-for-pet.tdl		auxverbs-for-pet.tdl
auxverbs.tdl		auxverbs.tdl
bridge-lex.tdl		bridge-lex.tdl
bridges.tdl		bridges.tdl
constr-dict.tdl		constr-dict.tdl
constructions-for-pet.tdl		constructions-for-pet.tdl
constructions.tdl		constructions.tdl
ctype.tdl		ctype.tdl
delims.tdl		delims.tdl
dict.tdl		dict.tdl
eds.vpm		eds.vpm
educ.tdl		educ.tdl
english-for-pet.tdl		english-for-pet.tdl
english.tdl		english.tdl
essay.tdl		essay.tdl
fundamentals.tdl		fundamentals.tdl
generation.mtr		generation.mtr
gle-gen.tdl		gle-gen.tdl
gle.tdl		gle.tdl
idioms.mtr		idioms.mtr
inflr-for-pet.tdl		inflr-for-pet.tdl
inflr.tdl		inflr.tdl
irregs.tab		irregs.tab
letypes-for-pet.tdl		letypes-for-pet.tdl
letypes.tdl		letypes.tdl
lexdb.dfn		lexdb.dfn
lexdb.fld		lexdb.fld
lexdb.meta		lexdb.meta
lexdb.rev		lexdb.rev
lexicon-rbst.tdl		lexicon-rbst.tdl
lexicon.tdl		lexicon.tdl
lexrinst-for-pet.tdl		lexrinst-for-pet.tdl
lexrinst-tok-for-pet.tdl		lexrinst-tok-for-pet.tdl
lexrinst-tok.tdl		lexrinst-tok.tdl
lexrinst.tdl		lexrinst.tdl
lexrules.tdl		lexrules.tdl
lextypes-for-pet.tdl		lextypes-for-pet.tdl
lextypes.tdl		lextypes.tdl
lfr.tdl		lfr.tdl
mal.tdl		mal.tdl
mask.tdl		mask.tdl
mtr.tdl		mtr.tdl
open.tdl		open.tdl
paraphraser.mtr		paraphraser.mtr
parse-nodes.tdl		parse-nodes.tdl
ple.tdl		ple.tdl
roots-for-pet.tdl		roots-for-pet.tdl
roots.tdl		roots.tdl
semi.vpm		semi.vpm
singlish.tdl		singlish.tdl
syntax.tdl		syntax.tdl
tmt.tdl		tmt.tdl
trigger.mtr		trigger.mtr
Repository files navigation

Release notes for stable version "ERG 2023"

Highlights:
Improved overall syntactic coverage on Redwoods profiles to 93.77% on 100K items
Improved parse selection by about 1% using new redwoods.mem model.
Improved overall parsing efficiency by about 20%.

2021-12-14 - Added files for Singlish dialect, authored by Siew Yeng Chow based
on her Master's thesis at NTU.

2022-07 - Incorporated changes to enable chart-mapping in LKB-FOS, thanks to
John Carroll.

2022-10 - Adopted Emerson-Turing construction types for appending SLASH, with
thanks to Guy Emerson and John Carroll.

2022-11 - Improved Version.lsp, METADATA, and grammar-loading files for better
interface with LTDB, thanks to Francis Bond.

Because we now generate erg.hds file each time the grammar is loaded into LKB, 
discarded erg/etc/rules.hds.
------------------------------------------------------------------------------
Release notes for stable version "ERG 2020"

Punctuation marks now separate tokens
- Revised syntactic analysis to treat all punctuation marks as separate tokens
  instead of as affixes.  So syntactic rules combine a punctuation token either
  with the immediately preceding or following token, except for the possessive
  apostrophe which attaches to the preceding NP.  Thanks to Stephan Oepen for
  motivation, assistance, and guidance in making this conversion, enabling
  better consistency of ERG output with that of other NLP tools and conventions.
  Also thanks to Woodley Packard for engineering support to accommodate
  treebanking updates in the face of near-universal changes in token counts.

Full Redwoods treebank update
- All of the usual treebanked profiles, totaling 1.5M tokens, have now been
  updated using the full-forest treebanking tool fftb, and reflecting the
  changed analysis of punctuation.  An additional 1000 items from WSJ section
  23 have also been treebanked after the release was stable, to provide a new
  set of annotations for evaluation.

Documentation strings throughout the grammar
- Both ACE and the LKB, along with Pydelphin, now fully support the use of 
  triple-quote-marked documentation strings on types and instances, so these 
  have been added to most instances of leaf lexical types, constructions, and 
  lexical rules in the ERG. Thanks to Francis Bond for pushing this cause 
  forward, and to developers for accommodating the necessary formalism changes.
  Alas, PET does not yet provide full support, so this release of the grammar 
  includes variants of several grammar files ("...for-pet.tdl") where the doc 
  strings have been deleted.  For now, compile and run PET using these variant 
  files as follows:
    flop english-for-pet
    cheap -cm -repp -default-les=all -packing -verb=4 english

------------------------------------------------------------------------------
In trunk, as an interim update,
- refreshed support for openproof generation
- expanded coverage with mal-rules and types
- added full-forest treebank profiles for wsj06-09

------------------------------------------------------------------------------
Release notes for stable version "ERG 2018"

Annotations
- Supplied full-forest treebanks for Redwoods profiles, including the first five
  sections of the WSJ.
- Added profiles for WeSearch user-generated content (wlb03, wnb03), and for
  Sherlock Holmes story (sh-spec).
- Improved the well-formedness and consistency of the MRSs, aiming for more
  consistency with an updated version of the semantic algebra.

Token mapping
- Upgraded to use GML 1.0, particularly relevant for the WeScience corpus.
- Improved support for both `strong' brackets (manually inserted) and `weak' 
  delimiters (motivated by, for example, hyphens) to signal phrase boundaries
  that should not be crossed.

Syntax
- Enabled extraction from within NPs.
- Changed the attachment order of pre- and post-nominal modifiers, so now the 
  pre-modifiers attach before post-modifiers.
- Added two new types of modifiers of verbal projections: indefinite NPs as in
  |She walked out of the casino a rich woman.|; and gapped clauses of saying, as
  |They will, we suspect, leave early.|
- Added the `do-be' construction as in |the only thing we said she had to do was
  finish the assignment|

Semantics
- Moved information-structure constraints from RELS to ICONS in MRSs, including 
  for focus movement (`topicalization') and passiveization.
- Simplified the inventory of role labels, notably for conjunction relations, 
  where ARG1 and ARG2 replace the old L-HNDL, R-HNDL, L-INDEX, R-INDEX roles.
- Improved SEM-I consistency.

Platforms and applications
- Added support for ubertagging with PET and ACE, thanks to Rebecca Dridan.
- Added robust parsing mode for ACE using csaw PCFG, thanks to Woodley Packard.
- Added support for `agree' parser/generator
- Expanded the inventory of mal-rules for grammar-checking.
- Added `transfer' rules and support for generation from first order logic.
- Added support for manually inserted `strong' brackets which force phrasal
  boundaries, as in |we ⌊(⌋ saw a man ⌊)⌋ with a telescope|.
- Changed RNAME value on rules from string to type, to allow `weak' bracket
  rules to constrain which rules must apply, as for named entities such as
  |New York Stock Exchange|.
- Added support for robust `bridging' rules (disabled by default).

------------------------------------------------------------------------------
Release notes for trunk version 2016-09-27

Now underway in full-forest treebanking of Redwoods profiles and eventually
WSJ as well, and making minor grammar corrections along the way.

------------------------------------------------------------------------------
Release notes for trunk version 2015-06-19

[After a long hiatus, returning to commenting on trunk version changes.]

Tuned paraphrase rules both for educ and for openproof.  The educ set are
mostly for generating variant correct answers for the new Reading composition
exercises in the Redbird Language Arts course.  The openproof modifications
are aimed at reducing the remaining ambiguity in the generated English outputs.

------------------------------------------------------------------------------
Release notes for trunk version 2013-03-19

Added two constructions motivated by Sherlock Holmes corpus: 
(1) adverbial clauses with gaps and verbs of saying, as in 
    |You have, I presume, considered this.|.
(2) adverbial indefinite NPs as VP modifiers, as in
    |He arrived a hero and departed a villain|
Also improved treatment of present participles as adjectives, employing verb
predications for semantics.

------------------------------------------------------------------------------
Inflectional rules: instances made one-to-one with types

------------------------------------------------------------------------------
Release notes for version "ERG (1212)"

Stable tagged release, including updates of all tsdb/gold profiles.  This
release is also used for the treebanked profiles of DeepBank 1.0, the Wall
Street Journal corpus included in the Penn Treebank.

Details and an online demo can be found at
www.delph-in.net/erg

------------------------------------------------------------------------------
Release notes for version "ERG (1111)"

Stable tagged release, including updates of all tsdb/gold profiles, plus
the addition of two new profiles from the Tanaka corpus: rtc000 and rtc001.
Details on ERG coverage of all gold profiles can be found on the 
Redwoods web page: http://www.delph-in.net/redwoods.

------------------------------------------------------------------------------
Update of `trunk' version as of August 2011:

Added coverage for the following phenomena:
- pre-determiner adjective phrases, as in 
  |too tall a building|.
  |too strong an opponent to overcome|
- enough' + VP/NP complement, as in
  |We met a tall enough player to hire.|
- sentence-initial indefinite NP `depictives', as in
  |A happy cat, she purred.|
- extraposed relative clauses, as in
  |A cat appeared suddenly which had no tail.|
- gapping constructions, where the second head in a conjoined VP or S is missing
  |He persuades Kim to sing and Abrams to act.|
- `do-be' construction as in 
  |The only thing we didn't expect him to do was give himself a raise.|
- conditional inversion, as in
  |Were we to visit Paris, we would be happy.|
- more freedom in ordering of complements
  |The book was given to Kim by Sandy.|
  |The book was given by Sandy to Kim.|

Also made minor improvements for generation, including corrected trigger rules.

Gold profile updates are included only for csli, mrs, hike, cb, and jh1

------------------------------------------------------------------------------
Release notes for version "ERG (1010)"

Stable tagged release with full (manual) updates of all gold profiles 
including LOGON, WeScience, and (after a long hiatus) the Verbmobil and
ecommerce treebanks, along with the newly added SemCor (semantically
tagged portion of the Brown corpus - the first 3100 items so far).  Details
on current ERG coverage of these profiles can be found on the Redwoods 
web page: http://www.delph-in.net/redwoods.

------------------------------------------------------------------------------
Release notes for version "ERG (1007)"

Minor improvements for better coverage of WSJ corpus and of the education and
speech application corpora.

------------------------------------------------------------------------------
Release notes for version "ERG (1004)"

This is intended as a `stable' release, accompanied by a full manual update
of the `gold' treebanked profiles, and parse-ranking models trained on them.

------------------------------------------------------------------------------
Release notes for version "ERG (1003)"

- This release is essentially a pre-release of a proposed stable release
  next month ("ERG (1004)"), and will serve as the basis for final tuning,
  debugging, and updating of the various treebanks.

- At long last, the rule names have all been converted to conform to the
  naming scheme proposed in 2008, and described at 
  http://wiki.delph-in.net/moin/ErgTop

  PLEASE NOTE that all pre-existing treebanks constructed using the ERG
  will have to be converted before they can be used for treebank updates 
  with this new grammar version.  See
  http://wiki.delph-in.net/moin/ErgRules 
  for instructions to effect this conversion automatically.

- This release also includes adaptation of the arboretum files for current
  use in the EPGY grammar-checking application.

- The chart-mapping machinery includes a revised treatment of quote marks
  in their splendid variety, aiming for more normalization in preprocessing,
  thanks to Stephan Oepen.

- The grammar currently includes some temporary patches to support generation
  using unknown words, mostly recently for experiments in generating from
  DMRSs.  While largely functional, this should be considered work still in 
  progress, since at least the mechanism for assigning semantic predicate
  names to unknown words is far from ideal.

- The `gold' directory now contains an additional profile `petet' for the
  Evaluation by Textual Entailment trial data set.  In addition to this
  new profile, the following usual three `gold' profiles have been updated
  using this version of the grammar: csli, mrs, hike.  Expectations are
  that the full `gold' collection of profiles will be updated by April.

------------------------------------------------------------------------------
Release notes for version "ERG (1002)"

- Re-working of arboretum files to apply to error analysis in grammar checking

- In preprocessor, factoring out of treatment of quote marks.

- Better interim accommodation for unknown words in generation, consistent
  with current naming convention for unknown predicates (see erg/tmr/pos.tdl).

------------------------------------------------------------------------------
Release notes for version "ERG (0909)"

- Addition of EPGY-specific types and lexical entry constraints

------------------------------------------------------------------------------
Release notes for version "ERG (0907)" (the Barcelona release)

- Note that the attribute STEM has been renamed to ORTH, for more clarity.
  For those few who use the lexical database in connection with the ERG, it 
  will be necessary to reload the database, using the revised table 
  definitions in this version of the grammar.

- Improved coverage to admit some VP-modifying relative clauses, as in
  'Abrams hired us, which bothers Browne.'

- Further stabilizing of the chart-mapping machinery for preprocessing and
  for accommodation of unknown words.

- Extended support for generation with unknown words

- Experimental support for paraphrasing as an external LOGON MT-like task 
  which uses the external SEM-I (semantic interface) specification.  See 
  the file in $LOGONROOT/uio/enen/README for a quick introduction.

------------------------------------------------------------------------------
- Added treebanks for WeScience profiles 3 and 4.
- Added more support for generation with unknown words

------------------------------------------------------------------------------
Release notes for version "ERG (0902)"

- First version making use of the new chart-mapping machinery
  - please note that you will need a correspondingly new version of the LKB
    and PET (no older than 22-Feb-09; PET compiled off its `cm' branch).
  - for the LOGON tree, please use the `trunk' version and select appropriate 
    PET binaries (from the `cm' branch) as `flop -t' and `cheap -t'.
- Added the first four treebanked profiles in the WeScience corpus
- Updated other profiles in 'gold' subdirectory 
  - but note that a few still await updating, including the SensEval and 
    SemCor profiles).

------------------------------------------------------------------------------
Release notes for version "LinGO (July-08)"

- Elaborated the chart-mapping rules to accommodate the existing treebanked
  corpora, including more systematic treatment of POS-driven unknown word
  handling.
- Added additional treebanks for some corpus data from Senseval, SemCor, and
  ILIAD (Melbourne).
- Added syntactic coverage for some additional sentential modifier phrases
  ('as'+passiveVP, and NP predicatives like "His project fully funded, Abrams
  celebrated."), and for marked word order with PPs appearing before some VPs,
  and before some complement NPs.

---------------------------------------------------------------------------
Release notes for version "LinGO (Apr-08)"

- Include experimental chart-mapping preprocessor rules in inpmap-rules.tdl
  and lexmap-rules.tdl.
- Enriched the hierarchy of semantic predicates to support underspecification
  in translation, including abstract predicates for locative 'in, on, at'
  and for 'the, a, udef' quantifiers.
- Tuned lexicon for Semcor data to support treebanking.
- Added syntactic coverage for 'small clause' predicatives such as 
  'The dog barked, its heart beating wildly".

---------------------------------------------------------------------------
Release notes for version "LinGO (26-Jan-08)"

Final tuning for SciBorg's first treebank of six abstracts
Final tuning for LOGON/HandOn treebank update
---------------------------------------------------------------------------
Release notes for version "LinGO (24-Jan-08)"

A few corrections to lexical entries based on most recent HandOn fan-outs
---------------------------------------------------------------------------
Release notes for version "LinGO (23-Jan-08)"

Added a few missing lexical entries for degree specifiers
---------------------------------------------------------------------------
Release notes for version "LinGO (21-Jan-08)"

And still more tuning - maybe the final round - for HandOn

------------------------------------------------------------------------------
Release notes for version "LinGO (20-Jan-08)"

1. More tuning for HandOn driven by 'sti' and 'vei' fan-out logs

------------------------------------------------------------------------------
Release notes for version "LinGO (17-Jan-08)"

1. Minor adjustments to lexicon, grammar, and trigger rules for fine-tuning
   of HandOn system.

------------------------------------------------------------------------------
Release notes for version "LinGO (15-Jan-08)"

1. Added vocabulary for HandOn based on missing predicates from NoEn
2. Completed tuning of lexicon and preprocessing for HandOn English data
3. One recent change that may affect transfer:
   Decomposition of N-V compounds like "snow-covered" and "T-marked"
   - used to be multi-words with single predicate, but are now constructed
     via compound rule, with the two component EPs and an additional 
     linking EP with PRED |argument_rel| similar to |compound_rel|

------------------------------------------------------------------------------
Release notes for version "LinGO (Nov-07)"

1. Treebanks 
  - Updated all treebanks in erg/gold, but have not yet rebuilt jhpstg.mem file
2. MRS quality improvements / harmonization
  - Added type constraints on ARG1s for several classes of modifiers
  - Corrected missing semantic link in P-PP construction "from behind the hill"
  - Removed spurious pron_rel from infinitival subordinate constructions like
      "Kim sang to impress Sandy."
  - Made minor changes to title construction:
    - changed pred name for post-head titles to be consistent with pre-head one
    - corrected rule for number-headed phrases like "page 3"
  
------------------------------------------------------------------------------
Release notes for version "LinGO (Oct-07)"

Added lexical coverage for vocabulary in the English data for the HandOn
project, in this case keeping the large number of domain-specific proper 
names in a separate file 'handon-propers.tdl'.  Also made some repairs to
remaining inconsistencies in MRSs in the message-free universe.

In addition, did several bits of minor tuning of syntactic constructions 
in support of the DFKI Checkpoint project, and added first version of the 
token-mapping rules for PET's emerging support for this functionality.
This release also includes an additional settings file for PET, 'mrs.set',
to support development of generation capability for PET.

Note that only three of the 'gold' profiles (csli, hike, and mrs) have been
updated in this release; the rest will follow shortly.

------------------------------------------------------------------------------
Release notes for version "LinGO (Jul-07)"

Added lexical coverage for several additional treebanked data sets, including
Senseval 2-4, FraCaS, SciBorg, and Acrolinx (though the latter two data sets
are not distributable).  Also updated the full set of 'gold' profiles for
the existing data sets.

PLEASE NOTE that this version requires an up-to-date version of the LKB to
get correct behavior with the treebanked data in 'gold', since the derivation
trees are now augmented with a specification of which root constraint was
used to admit each tree.
------------------------------------------------------------------------------
Release notes for version "LinGO (21-Mar-07)"

The most significant change in this version of the ERG is the complete removal
of messages, as announced at the Fefor DELPH-IN meeting to follow the
completion of the LOGON demonstrator.  This version is a nearly exact non-msg
equivalent of the final LOGON version "LinGO (17-Mar-07)", so it should be
straightforward to compare and contrast the two variants.  In brief, the
distinction among propositions, questions, and commands is now made via the
value of the attribute SF ('sentence force' i.e., illocutionary force), a
property of events.  This attribute and its values are also used in the
most recent release of the Grammar Matrix.

In addition, this release contains the following modifications/improvements,
the first of which is also included in the final LOGON version:
- Adoption of Stephan Oepen's proposal for a more uniform treatment of
  properties of MRS events and indices
- Adoption of Berthold Crysmann's proposal for the full cross-product of
  subtypes for encoding person-number
- Addition of missing vocabulary for the Senseval 2 test data
- Addition of pragmatic EPs to encode focus (formerly referred to as 
  'topicalization') and promoted arguments in passive constructions.

------------------------------------------------------------------------------
Release notes for version "LinGO (17-Mar-07)"
(Final version with messages)

Added missing lexical entries for the known-vocabulary held-out portion of
the LOGON corpus (43 proper names and 5 common nouns)
------------------------------------------------------------------------------
Release notes for version "LinGO (20-Dec-06)" (Final LOGON version)

- A few more corrections for mixed case and preprocessor

------------------------------------------------------------------------------
Release notes for version "LinGO (19-Dec-06)"

- Added feature on 'index' called IND for 'individuated', to enable distinction
  in SEM-I among count nouns, mass nouns, and mass-or-count nouns.
- Added further improvements to mixed case orthography, here primarily for
  country-related adjectives and nouns like "Englishman" and "Norwegian"

------------------------------------------------------------------------------
Release notes for version "LinGO (15-Dec-06)"

- Corrected preprocessor to preserve mixed case for Norwegian 'special'
  characters - they were being downcased because PET doesn't like them for
  interactive parsing, but now that the lexical entries require mixed case
  for proper names, we have to keep it for batch processing, which works okay.
- Modified lexical entries per latest transfer requests: 
   - deleted bogus adjective entries for "U-shaped"
   - corrected "T-marked" to also work in predicative position
   - added entry for "noticeable that S"

------------------------------------------------------------------------------
Release notes for version "LinGO (14-Dec-06)"

- Added more consistent capitalization in orthography (and CARG) for ERG
  lexicon, to enable higher quality generation.
- Added spelling variants for two lexical entries used in training corpus:
  'Hedmarker/Hedemarker' and 'El Dorado/Eldorado'
- Added entries as requested by transfer: 'mackerel', 'arboretum', "vitamin C"

------------------------------------------------------------------------------
Release notes for version "LinGO (13-Dec-06)"

More adjustments for final LOGON integration:
- Added and corrected lexical entries as requested by Transfer
- Corrected generator 'black hole' errors so generator will always terminate
  at least on the usual test suite data.
- Incorporated new LNK feature which replaces old WLINK, for mapping from
  MRS relations to their corresponding surface form positions [oe]

------------------------------------------------------------------------------
Release notes for version "LinGO (01-Dec-06)"

Minor additions for final LOGON integration:
- Added remaining missing lexical entries for known-vocabulary held-out data
- Improved efficiency for generation with coordination, by collapsing
  near-duplicate lexical entries for conjunctions
- Made minor corrections guided by LOGON fan-out logs, to improve both
  coverage and quality

------------------------------------------------------------------------------
Release notes for version "LinGO (Nov-06)"

NOTE: Users of this version of the ERG are strongly encouraged to also 
      obtain a current version of the LKB and [incr tsdb()], in order to 
      benefit fully from recent enhancements.

- Since the last public release in July, the ERG's lexicon has been expanded
  to include about 3000 additional nouns and adjectives that occur with high 
  frequency in the British National Corpus (100 times or more).  
- Some additional technical vocabulary was added to accommodate a sample
  of data from the Cambridge SciBorg project; these lexical entries are
  also tagged with "SciBorg" in the lexical database.
- The remaining changes have focused on tuning the grammar and SEM-I for 
  generation in the near-final LOGON demonstrator.
- Updated treebank summary for LOGON data, in erg/gold.  Note that this
  version of the treebank benefitted from the welcome addition to PET by 
  Yi Zhang enabling the selective unpacking strategy used in the LKB.

        Profile   Items    Parsed  Treebank
        -----------------------------------
        JH0         261       248       226 
        JH1        1353      1302      1221 
        JH2        1307      1154      1058 
        JH3        1443      1367      1230 
        JH4        1603      1505      1416 
        JH5         464       420       398 
        PS          965       908       860 
        TG         2014      1875      1735 
        ROND       1290      1203      1133 
        -----     -----     -----     -----
        Totals    10700      9982      9277

-------------------------------------------------------------------------------
Release notes for version "LinGO (13-Oct-06)"

Added entries for digit-orthography cardinal adjectives to help generator.

-------------------------------------------------------------------------------
Release notes for version "LinGO (12-Oct-06)"

Maybe the final round of tuning for this integration:

1. Merged falsely ambiguous lexical predicates:

   NEW                    OLD
   "_fine_a_for_rel"      "_fine_a_1_rel" 
   "_good_a_at-for_rel"   "_good_a_for_rel"
   "_good_a_at-for_rel"   "_good_a_at_rel"
   "_good_a_at-for_rel"   "_good_a_1_rel",                               
   "_understand_v_by_rel" "_understand_v_1_rel"

2. Added missing trigger rules for it-cleft construction.

3. Corrected a few minor errors in grammar rules.

------------------------------------------------------------------------------
Release notes for version "LinGO (10-Oct-06)"

Still more minor tuning

1. Corrected entry for "_guess_" unknown-noun lex entry to work in compounds
2. Corrected NP fragment rules to allow fragments that are conjoined NPs
3. Enabled entry for prep "to" to also modify proper names.

------------------------------------------------------------------------------
Release notes for version "LinGO (09-Oct-06)"

More minor tuning for impending LOGON release:

1.  Fixed spelling of 'considerred' for dative passive form
2.  Enabled generation of implicit NP coordination
3.  Corrected lexical entry's PRED name for 'edge'
4.  Corrected modification of imperatives
5.  Added missing topmost message for sentence-initial conjunction
6.  Allowed Adj-N as title
7.  Added missing entries for 'follow' and 'transport': NP+PP-dir
8.  Corrected multiple SEM-I entries for 'choose' verb
9.  Added missing analysis of NP's + PP construction
10. Added lexical entry for 'mountain pasture' as title
11. Renamed inconsistent degree adverb preds:
        "_a+little_x_deg_rel"
        "_steeply_x_deg_rel"
        "_directly_x_deg_rel"
        "_shortly_x_deg_rel"
12. Added entry for adj 'so' ('true') with expl-it subj: "It is so that ..."

Note that only the gold profiles for 'csli', 'mrs', and 'hike' have been
updated for this release.

------------------------------------------------------------------------------
Release notes for version "LinGO (05-Oct-06)"

Minor tuning for upcoming LOGON release:
1. Corrected PRED name for "downstairs"
        Old SEM-I entry:
        "_downstairs_a_1_rel" : ARG0 e, ARG1 u.
        New:
        _downstairs_p_rel : ARG0 e, ARG1 u.          
2. Corrected nbar-fragment rule to also analyze measure-nouns like "centimeter"

------------------------------------------------------------------------------
Release notes for version "LinGO (27-Sept-06)"

- Further LOGON tuning for better harmonization with NorGram
- Used the newly available selective unpacking in PET to create the treebanks
  in the 'gold' subdirectory.
- Updated the treebanks for the full set of profiles in 'gold'

------------------------------------------------------------------------------
Release notes for version "LinGO (11-Sept-06)"

- Tuning for improved LOGON generation for 'vei' development corpus
- Added several thousand lexical entries based on frequency of use in
  the BNC, guided by the unigram and bigram error mining analysis of
  Yi Zhang.  In particular, added entries for (i) those words which were
  entirely missing and with BNC frequency of 100 or more; and all
  words with at least one entry already in the ERG, but with (ii) unigram
  error score of 0.00, or (iii) bigram score of 0.00.

------------------------------------------------------------------------------

Release notes for version "LinGO (18-Jul-06)"

- Minor tuning to improve coverage on LOGON 'vei' items

------------------------------------------------------------------------------
Release notes for version "LinGO (Jul-06)"

- Improvements in semantic composition (assisted by useful error analysis
  in utool) and additional lexical entries, as noted in internal LOGON
  release notes below, since last public release of January 2006.
- Converted leaf lexical type names to conform to new naming conventions,
  with mapping from old to new names provided in file "new-le-types.txt".
  See wiki.delph-in.net/erg for documentation of new LE types.
- Adopted use of new variable property mappings given in file "semi.vpm".
- Updated treebank summary for LOGON data, in erg/gold (these profiles
  were used to retrain the parse selection model in "jh.mem"):

        Profile   Items    Parsed  Treebank
        -----------------------------------
        JH0         261       233       197
        JH1        1254      1132      1043
        JH2        1185      1047       908
        JH3        1311      1197      1057
        JH4        1454      1336      1214
        JH5         464       408       371
        PS          965       892       833
        TG         2014      1831      1656
        ROND       1290      1196      1072
        -----     -----     -----     -----
        Totals    10198      9272      8351

------------------------------------------------------------------------------
Internal release notes for version "LinGO (08-Jun-06)"

Small corrections to semantics of title nouns, both alone and in compounds.
Note that only the 'gold' profiles for csli, mrs, and hike have been updated.

------------------------------------------------------------------------------
Internal release notes for version "LinGO (24-May-06)"

Added lexical entries needed for remaining LOGON development corpus (Turglede
and Preikestolen texts).
Made semantics for comparatives, superlatives, and much/many more consistent.
Reduced generation output of variants with commas for modification & coord.
NP-coord - Corrected semantics, adding qeq (more consistent, and more scopes)
Free rels - Made embedded message be prpstn_m_rel, not underspecified.
Corrected semantics errors throughout, using Utool
Added treebank profiles for ps (Preikestolen) and tg (Turglede) data.

------------------------------------------------------------------------------
Internal release notes for version "LinGO (13-Feb-06)"

Corrected semantics for quantifiers 'most' and 'the most', dropping the
predicate 'most_q_rel' in favor of decomposed semantics using the usual
"many-much_a_rel".

------------------------------------------------------------------------------
Internal release notes for version "LinGO (09-Feb-06)"

Minor improvements in SEM-I content, and correction of an item in gold MRS.

------------------------------------------------------------------------------
Internal release notes for version "LinGO (06-Feb-06)"

More harmony for depictives, now with same semantics as other subordinate
clauses.  Also corrections to SEM-I for directional PP verbs.

------------------------------------------------------------------------------
Internal release notes for version "LinGO (03-Feb-06)"

Improved harmony:
 - Comparative and superlative determiners now have decomposed semantics
   analogous to correponsding adjectives, consistent with NorGram
 - Comparative and superlative adjectives now present the ARG0 of the
   comp_rel/superl_rel as their INDEX, with one benefit being a better
   MRS for measured comparatives, as in 'Dogs are 5 cm taller than horses.'
 - Free relatives, like ordinary relatives, no longer introduce a TPC
   value for their message.   

Consistency:
 - Lexical entry type for named years ('2004') is now treated more like
   other named entities, undergoing a bare-NP rule to project a full NP.
 - Title compounds as in 'project manager Abrams' now have the compound
   relation take two ref-inds as arguments, like one would expect.

------------------------------------------------------------------------------
Release notes for version "LinGO (Jan-06)"

PLEASE NOTE: This version of the ERG requires up-to-date versions of both
the LKB and PET, since it takes advantage of improvements in the treatment
of morphology in the LKB, and also depends on a consistent treatment of
special characters like \?, \(, and \".

This version includes minor tuning adjustments to the lexicon and grammar,
to improve overall precision and coverage on the data sets included in the
Redwoods 6 (Norwegian Growth) treebank, which has been expanded to include 
about 5000 items from the LOGON development corpus on Norwegian back-country 
tourism.  The single-best-parse profiles for this additional data appear as
usual in the subdirectory 'gold', in the six directories jh0 - jh5.

In addition, the grammar now includes a semantic interface file 'erg.smi'
which currently specifies the minimal properties of each lexical predicate,
including its name and its arguments, their types, and their optionality.
This file should soon also include the grammar predicates (those
introduced by rules rather than by lexical entries), as well as the set
of abstract predicates which are intended as part of the external interface
to the grammar.

------------------------------------------------------------------------------
Release notes for version "LinGO (05-Dec-05)"

1. Punctuation - Eliminated the duplication in files that was formerly
   needed for minor differences between the LKB and PET, now resolved.
2. Lexicon - Added vocabulary needed for the LOGON development corpus
   on tourism in the Norwegian mountains.
3. Generation - Tuned the trigger rules for introducing semantically
   empty lexical entries, for improved efficiency.
4. Treebanks - There are now additional profiles jh* in the directory gold, 
   for several segment of the LOGON development corpus for the Jotenheimen 
   region.  In this release, only jh1 is updated; the other five sections
   will follow soon.  The other (non-LOGON) profiles are all up to date.

------------------------------------------------------------------------------
Release notes for version "LinGO (23-Nov-05)"

1. Corrected lexical entries for "write" and "unevaluated", as well as
   the preprocessor-related "twodigitdomersatz".  Also added entry for
   "untrafficked".
2. Repaired error in comma punctuation which was causing overgeneration.
3. Corrected error in lexical types for day-of-month entries which was
   producing ill-formed MRSs.

------------------------------------------------------------------------------
Release notes for version "LinGO (15-Nov-05)"

1. Added and corrected lexical entries and SEM-I
   - Most interestingly, added some entries for 'kind' readings, as for
     the noun "bear" in "they hunted bear."  The predicate names are
     distinct, since presumably these would be derived from some lexical
     rule producing a distinct sense, and take the form "_<noun>_n_kind_rel"
   - Changed the single entry for the adjective "born" so it is treated
     semantically more like the passive participle it once was, and now
     introduces the predicate "_bear_v_2_rel" with a distinct sense of
     the verb "bear" from that in "Kim can't bear to lose"
   - Made changes in response to requests from JTL for transfer.

2. Tuned grammar in minor respects to improve consistency in treebanking
   the JH corpus.

------------------------------------------------------------------------------
Release notes for version "LinGO (10-Nov-05)"

1. Corrected SEM-I and lexicon errors noted by JTL, and improved constraints 
   on lexical types with handle arguments so the SEM-I reflects these 
   (introducing e.g. [ ARG3 h ] instead of formerly [ ARG3 u ]).
2. Added a few more lexical entries needed for JH, and some minor syntactic
   additions for constructions like "Try it yourself" and "Kvame became
   sole owner".

------------------------------------------------------------------------------
Release notes for version "LinGO (05-Nov-05)"

Quick additional release to make improvements for treebanking Jotenheim

1. Punctuation - Cleaned out a few more temporary patches in preprocessor
   and lexicon, especially for |"|, |(|, |)| which had had substitutions.
2. Preprocessing - Added a few more cases revealed by Jotenheim data.
3. Lexicon - Added a few missing multi-words that emerged from initial
   treebanking, and changed a few more formerly relational nouns to just
   ordinary nouns, to avoid spurious ambiguity
      'top, bottom, side, front, back'
   Also (finally) corrected the pred names for "anybody", "someone", etc.
   to now use _any_q_rel rather than any_q_rel, and same for _some_q_rel.
4. Fixed TPC assignments in relative clauses and for 'wonder'.
5. Corrected nominalization, which became too constrained in an attempt to 
   avoid spurious ambiguity.

------------------------------------------------------------------------------
Release notes for version "LinGO (01-Nov-05)"

1. Tuned generation trigger rules to reduce overgeneration, improve efficiency
   Also attempted to make more consistent use of TPC, PSV, allowing underspec.
2. Revised morphology to benefit from improvements in LKB and later in PET,
   now that irregularly inflected words can co-exist with punctuation suffixes
   (so eliminated files inflr-pet.tdl, inflr-pnct-pet.tdl, robust.tdl, and
   robust-pnct.tdl).
3. Reduced inventory of scopal adverbs, and improved consistency for adverbs.
   Note in particular that most so-called discourse adverbs have been
   converted to scopal adverbs, and the conjunctions 'and, or, but' are now
   treated as such even when they are sentence-initial.
4. Corrected some errors in lexical types and in syntactic rules; in particular
   fixed type for mass_ppcomp, which was broken, and improved nbar-coordination
   whose semantics was not ideal.
5. Some other lexical changes:
   - 'both' determiner is now logically equivalent to "the two".
   - 'respect (for)' wasn't entered as a mass noun, now is.
   - 'cross_over_v1, _v2' removed from lexicon (now done compositionally)
   - various entries for cardinal "one" had CARG "01", now just CARG "1".

------------------------------------------------------------------------------
Release notes for version "LinGO (09-Sep-05)"

1. Repaired punctuation overgeneration for non-WH topicalization, by removing
   a licensing for constructions like "Who won? asked Kim."  (not frequent in
   our data set, though seen in Rondane).
2. Removed STATIVE from grammar, since no longer used
3. Removed spurious fragment rules only used for parsing dictionary definitions
4. Corrected lexical predicates in SEM-I
    _have_v_to_rel   => "_have_v_to_rel"   (from type to string)
    "_fail1_v_1_rel" => "_fail_v_1_rel"    (misspelling)
5. Added missing lexical entry for unaccusative (intransitive) "weaken"
6. Added lexical entries for "move" and "drive" analogous to "put", still
   using the same inventory of predicates in the SEM-I.
7. Split the lexical rule for prenominal verbal modifiers into two rules,
   one for present participles and one for passives, to avoid spurious
   verb-particle entries which should be disallowed as modifiers (since
   the particle can't be present).
8. Modified the types for raising verbs taking an infinitival VP complement 
   so they uniformly combine with the infinitival "to" which introduces a 
   message.
9. Added reentrancies for TPC and PSV so the appropriate values appear
   on messages in embedded clauses.
10. Improved generator efficiency by adding grammar-internal feature --TPC
   which new generator compliance rules assign a value based on the public
   feature TPC.
11. Also further refined trigger rules, and exploited the newly invented
   compliance rules which adjust the input MRS to comply with grammar-internal
   constraints (so far restricted to assigning value for --TPC based on TPC.
12. Again for efficiency, added constraints on events introduced by adverbs
   and degree specifiers so they will not trigger lexical entries in
   generation.
13. Once again corrected the reported failure to generate some examples like
   "Abrams could." which made use of ellipsis_rel as underspecification of
   ellipsis_ref_rel.

------------------------------------------------------------------------------
Release notes for version "LinGO (05-Sep-05)"

Improved generation with punctuation and fragments.  Updated Verbmobil 
section of Redwoods treebank, and filled in missing gold profiles.

------------------------------------------------------------------------------
Release notes for version "LinGO (02-Sep-05)"

Minor update: Modified trigger rules to use unification rather than 
subsumption, and added some abstractions over trigger rules, in mtr.tdl
Further reduced spurious commas preceding modifiers in generation.
Punctuation rules now compatible with current LKB morphology.  Infinitival
subjects no longer introduce nominalization (as in "To err is human.")

------------------------------------------------------------------------------
Release notes for version "LinGO (15-Aug-05)"

Minor update: The usual normalizing of predicate names, this time mostly for
expletive-it-taking predicates.  Also some futher tuning of trigger rules,
and change to verb_synsem to make sure uninflected lexical entries already
identify their INDEX and KEYREL.ARG0, for better generator initialization.

------------------------------------------------------------------------------
Release notes for version "LinGO (09-Aug-05)"

Minor update for yet more consistency in predicate names, especially for
relational nouns and adjectives, respectively, to get their related entries 
to match in predicate names.  Also corrected ordering error in prp_infl_rule
and added a few additional lexical entries for the LOGON development corpus.

------------------------------------------------------------------------------
Release notes for version "LinGO (05-Aug-05)"

Minor update to improve consistency in predicate naming conventions, and
to restore the 'chunking' roots in roots.tdl which are used experimentally
in trying to generate from fragmented MRSs.

Note that in this release, only the 'gold' profiles for 'csli', 'mrs',
and 'hike' have been updated.

------------------------------------------------------------------------------
Release notes for version "LinGO (Jul-05)"

This release incorporates several significant changes to the previous 
release, but at long last also includes a first step at documenting an
external semantic interface for the grammar.  The changes will soon be
described in a little more detail on the ERG Wiki, but in summary:

1. Punctuation as affixation

   Previous versions of the grammar implemented a treatment of punctuation
   adopting a standard but linguistically dubious strategy of using a
   preprocessor to make all punctuation marks distinct tokens, adding
   spaces around each one.  This version implements an analysis which
   leaves the input string unchanged with respect to punctuation (except
   for apostrophes), and treats the punctuation marks as spell-changing
   affixes.  This change creates backward incompatibilities with earlier
   treebanks because the tokenization for each sentence is now different.
   A few infelicities remain from making this change, including
     - minor inconsistencies in the readers of affixation rules for the
       LKB and PET (and even for previous and current versions of the LKB)
     - imperfect interaction of irregular inflected forms and punctuation
     - imperfect interaction of multi-words and punctuation
   There are work-arounds for some of these, awaiting better resolution.

2. Semantics
  
   a. Semantically empty prepositions no longer introduce an EP (they
      used to add an EP whose predicate name ended in "_sel_rel", for
      lexically 'selected').  So the generator trigger rules have been
      augmented to automatically introduce the necessary lexical entries
      for generation, currently based on predicate-naming conventions
      for the lexical entries that select empty prepositions.
   b. Messages now introduce an additional attribute, ARG0, whose value
      is the event of the highest-scoping verbal EP within the scope of
      the message.  The main motivation is to make it simpler for
      applications to identify the relevant event properties of a
      clause's semantics without looking 'inside' the clause's MRS.
   c. All lexical predicates now have some value in the 'sense' field
      of the predicate name (Background: by convention in the ERG, each
      lexical predicate name has the following form: _ORTH_POS_SENSE_rel
      where ORTH is the lexeme's orthography, POS is a coarse-grained
      sense distinction drawing from the vocabulary [v n a p x q c], and
      SENSE is an arbitrary sequence of characters (excluding |_|), and
      where each of the fields is separated by an underscore.  Earlier,
      the sense field could have been left empty.)  The default value for
      the sense field is now '1'.
   d. Relational nouns now specify in their sense field the orthography 
      of the preposition marking their oblique complement (usually 'of').
   e. Tag questions previously discarded the semantics of the tag phrase,
      contrary to the monotonicity assumption in the ERG.  This is now
      corrected, with the result that the semantics of sentences with
      tag questions is now rather more baroque.  The main benefit of the
      reanalysis is that lexical rules now properly always preserve the
      semantics of their input lexemes.
   f. Sentential subjects were previously analyzed via a nominalization
      rule.  This simplified the syntactic analysis of "That Abrams
      arrived annoyed Browne" since the "annoy" lexeme could always
      unify its ARG1 value with the semantic index of its subject.  But
      the resulting asymmetry for the 'extraposed' and non-extraposed
      variants of lexemes like 'annoy' was annoying.  This version of
      the grammar now provides the same MRS for both variants ('It
      annoyed Browne that Abrams arrived' and the above example), via
      a syntactic variant of an 'it-extraposition' lexical rule, with
      thanks to Ann Copestake for the suggested implementation.  One
      consequence is that the earlier treatment of examples like "The
      problem was that Abrams arrived" no longer works, since the 
      identity copula was being used, and requires its complement to
      supply a referential index.  So there is also yet another entry
      for the verb 'be', which supplies an EP similar to the identity 
      'be'.
   g. Verbal modifiers of nouns were being given an inconsistent
      semantics, with postnominal modifiers as in 'people singing arias'
      supplying a message for the modifier phrase, but with prenominal
      modifiers as in 'the singing people' not contributing a message.
      In this version of the grammar, verbal projections now always
      supply a message, making the world a little more consistent, but
      leaving a sharper contrast now between "the singing children"
      and "the interesting children" where 'interesting' is analyzed
      as an adjective and hence does not supply a message.
      
3. Lexicon

   New lexical entries have been added drawn from the Norwegian tourism
   domain of the LOGON development corpus, bringing the current number
   of lexemes to 22,750 for this release, of which about 2700 are proper
   names.

4. SEM-I

   A first draft of the semantic interface for the grammar is now 
   presented in the file erg-full.smi, including the predicate names and
   semantic arguments of all predicates introduced either by lexical
   entries or by the grammar (either via lexical/syntactic rules or via
   abstractions over more specific predicates).  Documentation of this
   file is under active development.

5. Naming conventions

   The feature name DIVISIBLE on referential indices has been shortened
   to DIV for better readability of MRSs.

6. LKB warnings on grammar loading

   The LKB's new and improved treatment of morphology offers several
   advantages, and the current version of the grammar benefits from
   these, but still results in some warning messages when loading.  
   Users can ignore these messages for now, while the developers resolve
   the underlying causes.  The first is about the 'punct_bang_rule',
   and the others warn of lexical rules that can feed themselves.

------------------------------------------------------------------------------
Release notes for version "LinGO (30-Apr-05)"

This is a minor update to the Apr-05 version, including some lexical 
additions, adjustments to the semantic predicate hierarchy, and tuning
of syntactic analyses, all designed to improve end-to-end translation
for LOGON.  The only substantive difference is in the analysis of 
possessive constructions, where the grammar now produces nearly
identical MRSs for the two noun phrases "our book" and "a book of ours",
using a new lexical entry for "ours" distinct from the ordinary "ours"
of "ours are not ready".  One consequence of this reanalysis, which
unifies the treatment of the two possessive constructions, is that
the two arguments in the old 'poss_rel' EP have been reversed: what was 
the ARG1 is now ARG2, and vice versa.

------------------------------------------------------------------------------
Release notes for version "LinGO (Apr-05)"

Overview of changes:

 - Lexicon size increased to 21000 entries
 - MRS quality improved
 - Unicode now used for lexicon: foreign proper names, archaic spellings
 - Coverage added for fragments, locative inversion, 'free' parentheticals
 - Changed analyses to allow PP-modif of PPs, APs; adverb-modif of APs
 - Support for new domains: 'shanghai', 'gcide'

--Lexicon--

BNC - Based on months of hard labor by former Stanford students Hansook Lee
and Mike Orme (with help from Ara Kim), the lexicon now contains all verb
subcat entries for the 2000 most frequent verb stems in the British
National Corpus. This should enable some interesting experimentation in
automated lexical acquisition, since there are fewer lexical types that
need to be hypothesized for non-verbs.

GCIDE - The lexicon now also contains entries for all words observed in the
first 10,000 definition 'sentences' in the GNU Contemporary International
Dictionary of English (GCIDE), to enable more precise evaluation of
syntactic coverage of these definitions.

Shanghai - Based on some 1500 entries constructed by Yi Zhang at CoLI in
Saarbruecken, the lexicon now also contains entries for most of the words
found in a Web-derived corpus on tourism in Shanghai, analogous to the
Rondane corpus built by Becky Neil for the LOGON project in Norway.

--MRS quality--

Based on a substantial implementation effort by Stefan Thater and
colleagues at CoLi, Saarbruecken, to check for well-formedness of MRSs
produced by the grammar for the Redwoods and Rondane corpora, many errors
were identified, enabling improvements in MRS construction in the ERG.
Further improvements were enabled by the systematic use of existing
capabilities in the LKB for diagnosing MRS errors in ERG analyses.  While
the current release still produces some flawed MRSs for these data sets,
they are largely confined to a small inventory of known and somewhat
problematic minor phenomena.

--Unicode--

Drawing on the combined expertise of Stephan Oepen and Francis Bond, the
ERG is now fully Unicode-compliant, including the PSQL database.  This
enables proper representation in the lexicon for orthography of non-English
proper names such as "østerbø", and archaic English spellings such as
"coöperation".  The necessary infrastructure for Unicode is admirably and
demonstrably in place in the LKB, PET, [incr tsdb()], and PostgreSQL.

--Coverage--

Fragments - Further work on the treatment of fragments has been motivated
largely by the effort to parse the definition sentences in GCIDE, and to
give them a consistent semantic representation.  New fragment types now
licensed include VPs and PPs with NP gaps, as in "To devour." or
"Relying on.".

Locative inversion - The grammar now analyzes some locative inversion
phenomena, currently restricted to sentences headed by the finite copula
'be' as in "Near the park is a large dog" but not (yet) "Near the park
stood a large tree".  These appear with some frequency in the Rondane data,
and have also been waiting patiently for twenty years in the CSLI test
suite.

'Free' parentheticals - Sentences containing some classes of parenthetical
material (which would not survive in situ without the parentheses) will now
be analyzed, though further work will be needed in designing the target
semantics.  Example now covered: "That dog (you should see its owner!)
barked."

--Changed analyses--

Modification - Based on more systematic analysis of phenomena found in the
Rondane corpus, and corroborated in the Shanghai corpus, the ERG now
permits more interesting modification structures.  Prepositional phrases,
formerly restricted to modifying only VPs and nominal phrases, can now also
modify adjective phrase and other PPs.  Similarly, adverbs can now also
modify adjective phrases, as in "the wildly happy dog barked", freeing the
grammar from its former requirement that duplicate degree-specifier lexical
entries be added for many adverbs.

--New domains--

The GCIDE corpus has been taken from the GCIDE web site, and carefully
prepared by Eric Nichols at NTT in collaboration with Francis Bond,
including identification of sentence breaks, normalization, and formatting,
all of which are now automated via Perl scripts converting the original
GCIDE data into, among other things, an 'item' file format for use with the
fine system.

The Shanghai corpus is being collected by Yi Zhang in Saarbruecken as part
of his thesis work, and consists of text on tourism in Shanghai, written in
English and mostly but not entirely by native English speakers.  The corpus
may still be revised, so a profile of this data is not (yet) being
distributed with the ERG.
License

delph-in/erg

Folders and files

Latest commit

History