Skip to content

msagca/syntax-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Syntax Scraper

This tool converts the formal syntax section of an IEEE language standard document into an ANTLR4 grammar file (.g4) format.

Installation

pip install git+https://github.com/msagca/syntax-scraper.git

Usage

scrape-ieee [-h] -n grammar_name [-s start_page] [-e end_page] [--split] input_file

positional arguments:
  input_file       IEEE language standard document (format: PDF)

options:
  -h, --help       show this help message and exit
  -n grammar_name  ANTLR4 grammar name
  -s start_page    formal syntax start page (default: first page)
  -e end_page      formal syntax end page (default: last page)
  --split          create a split grammar

Example

scrape-ieee -n SystemVerilog -s 1136 -e 1180 1800-2017.pdf

To complete the resulting grammar, please follow these steps:

  • Open the generated .g4 file in Visual Studio Code.
  • Install the ANTLR4 extension for Visual Studio Code.
  • Address the highlighted issues or errors indicated by the ANTLR4 extension.
  • Search for every occurrence of 'Error!' and manually define the corresponding rules.
  • Remove any title text, such as 'A.8.7' 'Numbers', from the rules.
  • Remove trailing numbers in rule identifiers, for example, change time_literal44 to time_literal unless the spec specifically mentions such rule.
  • Identify the rules that span multiple pages in the spec, and manually add any missing parts if necessary.
  • Make sure to append EOF to the start rule(s), such as library_text and source_text, to mark the end of input.
  • Locate the rules that describe lexical tokens like white space, comments, identifiers, numbers, etc., and convert them to lexer rules for proper tokenization.

If the --split option is specified, the tool will create a lexer rule for each keyword or punctuation symbol encountered during the parse tree walk. It will also generate rules for common lexical structures like identifiers, white spaces, and comments. These rules can be extended by the user later. However, due to limitations in the tool or parse errors, some of these automatically generated rules may be incorrect or invalid. For example, most tokens that begin with a capital letter need to be removed or modified. Additionally, certain symbols highlighted in bold in the specification document may not directly correspond to lexical tokens and require manual handling.

For instance, in the autogenerated lexer grammar for SystemVerilog, the rule LPASRP: '(*)'; does not accept white spaces. However, it is clear that the input ( * ) should also be valid since parentheses act as argument delimiters in this context. To address this, LPASRP should be rewritten as three separate rules: LP: '(';, AS: '*';, and RP: ')';. Consequently, occurrences of the character sequence (*) should be modified to '(' '*' ')' in the parser grammar to accommodate the changes made to the lexer grammar.