Supporting non-ASCII characters as markers #435

fabswt · 2016-10-06T13:54:25Z

Hi there,
Thank you for the library, as I was able to add support for custom markdown in a matter of minutes.

However, when trying to use curly quotes “” as markers, I get the following error:

Notice: Undefined index: � in /Users/fabien/Dropbox/AppData/XAMPP/web/bilingueanglais/public_html/blog/wp-content/plugins/WP-Fab-MarkDown/parsedown/Parsedown.php on line 1004
Warning: Invalid argument supplied for foreach() in /Users/fabien/Dropbox/AppData/XAMPP/web/bilingueanglais/public_html/blog/wp-content/plugins/WP-Fab-MarkDown/parsedown/Parsedown.php on line 1004

If I use ASCII characters for the marker (e.g.: the @ sign), it works just fine. Here is the code with both variants:

function __construct()
{
    $this->InlineTypes['@'] []= 'EnglishSpanAt';
    $this->inlineMarkerList .= '@';

    $this->InlineTypes['“'] []= 'EnglishSpan';
    $this->inlineMarkerList .= '“';
}


protected function inlineEnglishSpanAt($Excerpt)
{
    if ( ! isset($Excerpt['text'][0]))
    {
        return;
    }

    if ($Excerpt['text'][0] === '@' and preg_match('/^@(?=\S)(.+?)(?<=\S)@/', $Excerpt['text'], $matches))
    {
        return array(
            'extent' => strlen($matches[0]),
            'element' => array(
                'name' => 'span',
                'text' => $matches[1],
                'handler' => 'line',
                'attributes' => array(
                    'lang' => 'en',
                ),
            ),
        );
    }
}

protected function inlineEnglishSpan($Excerpt)
{
    if ( ! isset($Excerpt['text'][0]))
    {
        return;
    }

    if ($Excerpt['text'][0] === '“' and preg_match('/^“(?=\S)(.+?)(?<=\S)”/', $Excerpt['text'], $matches))
    {
        return array(
            'extent' => strlen($matches[0]),
            'element' => array(
                'name' => 'span',
                'text' => $matches[1],
                'handler' => 'line',
                'attributes' => array(
                    'lang' => 'en',
                ),
            ),
        );
    }
}

In other words, it seems non-ASCII characters are not supported as markers.

Is it a PHP issue?

My understanding is that preg_* function do support UTF8 and that it's fine to use UTF8 as array keys Indeed, this produces the expected result:

$foo['“'] = "bar";
var_dump( $foo ); # array(1) {["“"]=> string(3) "bar"}

Inside of Parsedown's line method, before the foreach, var_dump( $this->InlineTypes['“'] works fine (the double quote is displayed properly in the dump), but var_dump( $marker ); produces mojibake, instead of the expected curly quote character.

So correct me if I'm wrong but it seems to me this is an issue with the library itself.

Inside the library

What can we do to support UTF8?

I tried to look at the code.

Before the foreach (L. 1004), I could use $marker = mb_substr( $excerpt, 0, 1, 'UTF8'); instead of $marker = $excerpt[0];. And $markerPosition = mb_strpos ($text, $marker); instead of $markerPosition = strpos($text, $marker);.

But that's not enough and the Undefined index on the foreach persists.

It's not clear to me whether strpbrk() supports UTF8. I'm stuck there.

Can you please advise?

The text was updated successfully, but these errors were encountered:

Dave-Morton · 2016-12-25T00:21:55Z

I'm new to Parsedown, so this may well be off the mark, but is the HTML interface you're using encoded as UTF-8, some other encoding, or none at all? I've found in the past that if the HTML interface isn't set to UTF-8, the text sent to PHP is also not encoded to that character set when sent via either GET or POST. Something to look into, I think, no?

fabswt · 2016-12-25T10:02:42Z

Sorry, I meant to follow up on this as I ended up with a nice fix.

Support for UTF8 for several PHP functions is lacking, which would imply, as I understand it, the need to re-write core Parsedown functions to support UTF8.

I got a reply from Emanuil suggesting to work around the limitation by aliasing the UTF8 characters with ASCII ones. i.e.: perform a replacement of the UTF8 tokens with ASCII ones, then simply use those ASCII characters as tokens inside of the Parsedown definitions.

I needed to support curly quotes as tokens. It looks like this:

class FabParsedownExtension extends Parsedown {
	function __construct() {

		$this->InlineTypes['�'] []= 'EnglishSpan';
		$this->inlineMarkerList .= '�';

	}


	/** Hijacking the original method. i.e.: borrowed from the Parsedown class, and extended. **/
	function text($text) {

		/*** Support non-ASCII characters via replacement of them with other, ASCII, characters ***/
		$text = str_replace( array( '“', '”'), '�', $text );


		/*** BELOW: copy of original method ***/

		# make sure no definitions are set
		$this->DefinitionData = array();

		# standardize line breaks
		$text = str_replace(array("\r\n", "\r"), "\n", $text);

		# remove surrounding line breaks
		$text = trim($text, "\n");

		# split text into lines
		$lines = explode("\n", $text);

		# iterate through lines to identify blocks
		$markup = $this->lines($lines);

		# trim line breaks
		$markup = trim($markup, "\n");

		return $markup;
	}


	protected function inlineEnglishSpan($Excerpt) {
		if ( ! isset($Excerpt['text'][0])) {
			return;
		}

		if ($Excerpt['text'][0] === '�' and preg_match('/^�(?=\S)(.+?)(?<=\S)�/', $Excerpt['text'], $matches)) {
			return array(
				'extent' => strlen($matches[0]),
				'element' => array(
					'name' => 'span',
					'text' => $matches[1],
					'handler' => 'line',
					'attributes' => array(
						'lang' => 'en'
					)
				)
			);
		}
	}
}

?>

Note that I opted to use a non-visible ASCII character (in my case character 31) to save me the hassle of figuring out what would happen if anyone used them -- no one will.

(A list of non-visible (control) characters is available here https://en.wikipedia.org/wiki/ASCII#Control_characters though inserting them may prove a bit more tricky.)

I hope this helps.

Daniel-KM · 2017-06-23T07:29:31Z

Mmay be fixed by #513.

aidantwoods mentioned this issue Feb 27, 2018

Outstanding Issues aidanwoods/parsedown#10

Closed

aidantwoods added the bug label Mar 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting non-ASCII characters as markers #435

Supporting non-ASCII characters as markers #435

fabswt commented Oct 6, 2016

Dave-Morton commented Dec 25, 2016

fabswt commented Dec 25, 2016 •

edited

Daniel-KM commented Jun 23, 2017

Supporting non-ASCII characters as markers #435

Supporting non-ASCII characters as markers #435

Comments

fabswt commented Oct 6, 2016

Is it a PHP issue?

Inside the library

Dave-Morton commented Dec 25, 2016

fabswt commented Dec 25, 2016 • edited

Daniel-KM commented Jun 23, 2017

fabswt commented Dec 25, 2016 •

edited