Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting non-ASCII characters as markers #435

Open
fabswt opened this issue Oct 6, 2016 · 3 comments
Open

Supporting non-ASCII characters as markers #435

fabswt opened this issue Oct 6, 2016 · 3 comments
Labels

Comments

@fabswt
Copy link

fabswt commented Oct 6, 2016

Hi there,
Thank you for the library, as I was able to add support for custom markdown in a matter of minutes.

However, when trying to use curly quotes “” as markers, I get the following error:

Notice: Undefined index: � in /Users/fabien/Dropbox/AppData/XAMPP/web/bilingueanglais/public_html/blog/wp-content/plugins/WP-Fab-MarkDown/parsedown/Parsedown.php on line 1004
Warning: Invalid argument supplied for foreach() in /Users/fabien/Dropbox/AppData/XAMPP/web/bilingueanglais/public_html/blog/wp-content/plugins/WP-Fab-MarkDown/parsedown/Parsedown.php on line 1004

If I use ASCII characters for the marker (e.g.: the @ sign), it works just fine. Here is the code with both variants:

function __construct()
{
    $this->InlineTypes['@'] []= 'EnglishSpanAt';
    $this->inlineMarkerList .= '@';

    $this->InlineTypes['“'] []= 'EnglishSpan';
    $this->inlineMarkerList .= '“';
}


protected function inlineEnglishSpanAt($Excerpt)
{
    if ( ! isset($Excerpt['text'][0]))
    {
        return;
    }

    if ($Excerpt['text'][0] === '@' and preg_match('/^@(?=\S)(.+?)(?<=\S)@/', $Excerpt['text'], $matches))
    {
        return array(
            'extent' => strlen($matches[0]),
            'element' => array(
                'name' => 'span',
                'text' => $matches[1],
                'handler' => 'line',
                'attributes' => array(
                    'lang' => 'en',
                ),
            ),
        );
    }
}

protected function inlineEnglishSpan($Excerpt)
{
    if ( ! isset($Excerpt['text'][0]))
    {
        return;
    }

    if ($Excerpt['text'][0] === '“' and preg_match('/^“(?=\S)(.+?)(?<=\S)”/', $Excerpt['text'], $matches))
    {
        return array(
            'extent' => strlen($matches[0]),
            'element' => array(
                'name' => 'span',
                'text' => $matches[1],
                'handler' => 'line',
                'attributes' => array(
                    'lang' => 'en',
                ),
            ),
        );
    }
}

In other words, it seems non-ASCII characters are not supported as markers.

Is it a PHP issue?

My understanding is that preg_* function do support UTF8 and that it's fine to use UTF8 as array keys Indeed, this produces the expected result:

$foo['“'] = "bar";
var_dump( $foo ); # array(1) {["“"]=> string(3) "bar"}

Inside of Parsedown's line method, before the foreach, var_dump( $this->InlineTypes['“'] works fine (the double quote is displayed properly in the dump), but var_dump( $marker ); produces mojibake, instead of the expected curly quote character.

So correct me if I'm wrong but it seems to me this is an issue with the library itself.

Inside the library

What can we do to support UTF8?

I tried to look at the code.

Before the foreach (L. 1004), I could use $marker = mb_substr( $excerpt, 0, 1, 'UTF8'); instead of $marker = $excerpt[0];. And $markerPosition = mb_strpos ($text, $marker); instead of $markerPosition = strpos($text, $marker);.

But that's not enough and the Undefined index on the foreach persists.

It's not clear to me whether strpbrk() supports UTF8. I'm stuck there.

Can you please advise?

@Dave-Morton
Copy link

I'm new to Parsedown, so this may well be off the mark, but is the HTML interface you're using encoded as UTF-8, some other encoding, or none at all? I've found in the past that if the HTML interface isn't set to UTF-8, the text sent to PHP is also not encoded to that character set when sent via either GET or POST. Something to look into, I think, no?

@fabswt
Copy link
Author

fabswt commented Dec 25, 2016

Sorry, I meant to follow up on this as I ended up with a nice fix.

Support for UTF8 for several PHP functions is lacking, which would imply, as I understand it, the need to re-write core Parsedown functions to support UTF8.

I got a reply from Emanuil suggesting to work around the limitation by aliasing the UTF8 characters with ASCII ones. i.e.: perform a replacement of the UTF8 tokens with ASCII ones, then simply use those ASCII characters as tokens inside of the Parsedown definitions.

I needed to support curly quotes as tokens. It looks like this:

class FabParsedownExtension extends Parsedown {
	function __construct() {

		$this->InlineTypes['�'] []= 'EnglishSpan';
		$this->inlineMarkerList .= '�';

	}


	/** Hijacking the original method. i.e.: borrowed from the Parsedown class, and extended. **/
	function text($text) {

		/*** Support non-ASCII characters via replacement of them with other, ASCII, characters ***/
		$text = str_replace( array( '“', '”'), '�', $text );


		/*** BELOW: copy of original method ***/

		# make sure no definitions are set
		$this->DefinitionData = array();

		# standardize line breaks
		$text = str_replace(array("\r\n", "\r"), "\n", $text);

		# remove surrounding line breaks
		$text = trim($text, "\n");

		# split text into lines
		$lines = explode("\n", $text);

		# iterate through lines to identify blocks
		$markup = $this->lines($lines);

		# trim line breaks
		$markup = trim($markup, "\n");

		return $markup;
	}


	protected function inlineEnglishSpan($Excerpt) {
		if ( ! isset($Excerpt['text'][0])) {
			return;
		}

		if ($Excerpt['text'][0] === '�' and preg_match('/^�(?=\S)(.+?)(?<=\S)�/', $Excerpt['text'], $matches)) {
			return array(
				'extent' => strlen($matches[0]),
				'element' => array(
					'name' => 'span',
					'text' => $matches[1],
					'handler' => 'line',
					'attributes' => array(
						'lang' => 'en'
					)
				)
			);
		}
	}
}

?>

Note that I opted to use a non-visible ASCII character (in my case character 31) to save me the hassle of figuring out what would happen if anyone used them -- no one will.

(A list of non-visible (control) characters is available here https://en.wikipedia.org/wiki/ASCII#Control_characters though inserting them may prove a bit more tricky.)

I hope this helps.

@Daniel-KM
Copy link
Contributor

Mmay be fixed by #513.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants