Bug 551622 - don't assert when formatting php code with inline html code
Summary: don't assert when formatting php code with inline html code
Status: CLOSED FIXED
Alias: None
Product: z_Archived
Classification: Eclipse Foundation
Component: PDT (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows 10
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: PHP Core CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-30 05:31 EDT by Thierry BLIND CLA
Modified: 2020-05-14 11:24 EDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Thierry BLIND CLA 2019-09-30 05:31:29 EDT
Hi,
while working on the formatter JUnit tests for PHP 7.4, I've seen that 2 tests always failed with following errors:


java.lang.AssertionError
	at org.eclipse.php.formatter.core.CodeFormatterVisitor.visit(CodeFormatterVisitor.java:2772)
	...

The 2 files are org.eclipse.php.formatter.core.tests/workspace/formatter/php5/default_formatter_configuration/phpWithHtml1.pdtt and org.eclipse.php.formatter.core.tests/workspace/formatter/php5/new_default_formatter_configuration/phpWithHtml1.pdtt

Strangely this (long standing) bug was never problematic for our PDT gerrit builds, so it stayed unnoticed...
Also an "assert" will not be "seen" by the users, so it was never reported.

The reason that we hit "assert false;" in class CodeFormatterVisitor is pretty simple to understand: between 2 PHP code regions we must/should only have one inline-html region. We should never have multiple successive InLineHtml objects, 
since we don't analyze (and split) HTML content, except with class PHPTokenizer that isolates PHP code sections inside HTML files.

So why do we have multiple successive inline-html regions?
It comes from 2 rules inside the ast_scanner.flex files:

<YYINITIAL>(([^<]|"<"[^?%s<])+)|"<s"|"<" {
	...
}
<YYINITIAL>"<?"|"<script"{WHITESPACES}"language"{WHITESPACES}?"="{WHITESPACES}?("php"|"\"php\""|"\'php\'"){WHITESPACES}?">" {
	...
}

The lexers are trying to detect if inline-html regions contain "<script language='php'>" text. In this case it will considered to be a php region, otherwise we create a new T_INLINE_HTML symbol each time we find "<s" text !!!
This "bad" side-effect is pretty easy to reproduce since we have some very common html tags starting with "<s", like <span>, <select>, <section> or <strong> tags.

Here is an example taken from the phpWithHtml1.pdtt file that will show the problem (in debug mode):
<?php if (isset($this->previous)):  ?>
<?php else :    ?>
  <span class="disabled">&lt; previous</span> |
<?php endif;   ?>

This 2 rules seem clumsy. We only handle <?, <?=, <?php (and obsolete <%, <%=) in class PHPTokenizer as PHP region delimiters, but we additionaly handle "<script language='php'>" tags in the PHPAstLexer classes. We anyway don't look for closing "</script>" tags, so I'm not sure those script tags are correctly handled...

For sure, PHP code inside "<script language='php'>" tags is badly highlighted and recognized by PDT. And this syntax is deprecated since PHP 7.0 :

https://www.php.net/manual/en/language.basic-syntax.phptags.php

I don't want to write specific code everywhere in PDT to correctly handle those deprecated script tags (that are removed since PHP 7.0) but do a minimal fix:
- remove those "<script language='php'>" rules from the ast_scanner.flex files starting from PHP 7.0
- add a workaround in class CodeFormatterVisitor to gracefully handle multiple successive InLineHtml objects for PHP < 7.0 code

Maybe we should even totally remove "<script language='php'>" from all ast_scanner.flex files, since we never really supported it, it would avoid clumsy workarounds...

@Dawid, what do you think?

Thierry.
Comment 1 Dawid Pakula CLA 2019-09-30 06:12:30 EDT
See also: https://bugs.eclipse.org/bugs/show_bug.cgi?id=196562 ;)

We decided to forgot about <script> tags, and stop fixing them. For me we can <script> can be removed.

BTW as far as I remember in theory you can even mix syntaxes. 
For example:
<script language='php'> echo 'something'; ?></body> 
Is perfectly valid on PHP 5.6 :P
Comment 2 Thierry BLIND CLA 2019-09-30 06:20:11 EDT
(In reply to Dawid Pakula from comment #1)
> See also: https://bugs.eclipse.org/bugs/show_bug.cgi?id=196562 ;)
> 
> We decided to forgot about <script> tags, and stop fixing them. For me we
> can <script> can be removed.
> 
> BTW as far as I remember in theory you can even mix syntaxes. 
> For example:
> <script language='php'> echo 'something'; ?></body> 
> Is perfectly valid on PHP 5.6 :P

Ok, great! It will then be an easy fix :)
Comment 3 Eclipse Genie CLA 2019-09-30 08:41:30 EDT
New Gerrit change created: https://git.eclipse.org/r/150373
Comment 5 Thierry BLIND CLA 2019-09-30 09:52:24 EDT
Fixed.

I also update my initial bug report:
even after removing "<script language='php'>" support, it's still possible to get multiple successive inline-html regions, so we still need to fix class CodeFormatterVisitor.

This will happen in very rare cases, when html code contains opening PHP short tags or opening asp tags (<?, <?=, <% or <%=) and the user don't want them as valid PHP code delimiters. In this case these "undesired" tags will be represented as individual InLineHtml statements, this is due to how the ast_scanner.flex files actually work.

Thierry.
Comment 6 Eclipse Genie CLA 2019-10-10 13:01:33 EDT
New Gerrit change created: https://git.eclipse.org/r/150896