Fix CFamilyLexer preprocessor tokenization errors #1830

henkkuli · 2021-06-02T10:38:50Z

CFamilyLexer fails to tokenize preprocessor macros when they are preceded by a line break surrounded by spaces. This is the case because prerpocessor regex rule expects to start at the beginning of the line, but the space regex rule matches also the whitespace after the line break. Now the space rule has been refined not to match the line break. Because of this, the preprocessor regex rule correctly matches prerpocessor tokens even when they are preceded by spaces, at the cost of adding some more tokens in the token stream in some cases.

The main change is in pygments/lexers/c_cpp.py. The generic whitespace rule \s+ has been changed to [^\S\n] to avoid matching line breaks. As a consequence of this, many files under tests/examplefiles changed. All of the changes seem to be of the form

'      \n      ' Text

changed to

'      '      Text
'\n'          Text

'      '      Text

In addition to these, the PR adds three new tests under tests/snippets which test the behavior of the preprocessor tokenizer in different situations. The test tests/snippets/c/test_preproc_file5.txt can be controversial as it tests the behavior in situation where the code is invalid and hence the output contains an error token. I'll let the maintainers decide whether that should be included or removed.

Fixes #1820.

CFamilyLexer failed to tokenize preprocessor macros when they were preceded by line break surrounded by spaces. This was the case because prerpocessor regex rule expected to start at the beginning of the line, but the space regex rule matched also the whitespace after the line break. Now the space rule has been refined not to match the line break. Because of this, the preprocessor regex rule correctly matches prerpocessor tokens even when they are preceded by white spaces, at the cost of adding some more tokens in the token stream in some cases. This change preserves the behavior of invalid preprocessor usage failing to tokenize.

Anteru · 2021-06-20T09:38:06Z

Merged, thanks!

Anteru added the changelog-update Items which need to get mentioned in the changelog label Jun 20, 2021

Anteru self-assigned this Jun 20, 2021

Anteru merged commit fea1fbc into pygments:master Jun 20, 2021

Anteru added this to the 2.10 milestone Jul 18, 2021

Anteru added A-lexing area: changes to individual lexers and removed changelog-update Items which need to get mentioned in the changelog labels Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CFamilyLexer preprocessor tokenization errors #1830

Fix CFamilyLexer preprocessor tokenization errors #1830

henkkuli commented Jun 2, 2021

Anteru commented Jun 20, 2021

Fix CFamilyLexer preprocessor tokenization errors #1830

Fix CFamilyLexer preprocessor tokenization errors #1830

Conversation

henkkuli commented Jun 2, 2021

Anteru commented Jun 20, 2021