Skip to content

Fix language-pack generation in lancheck.php (v2.3.5+ regressions)#5703

Open
Kanonimpresor wants to merge 5 commits into
e107inc:masterfrom
Kanonimpresor:fix/lancheck-language-pack-generation
Open

Fix language-pack generation in lancheck.php (v2.3.5+ regressions)#5703
Kanonimpresor wants to merge 5 commits into
e107inc:masterfrom
Kanonimpresor:fix/lancheck-language-pack-generation

Conversation

@Kanonimpresor

@Kanonimpresor Kanonimpresor commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Closes #5652

What this does

Admin → Tools → Language Check → Generate Pack could not produce a usable
language pack on installs that use the new $config = [...] format or modern
language files (const / return [...]). This PR repairs the generator.

Per the discussion in #5652, bug 8 (ver.php / online packs) is intentionally
out of scope
— it was already fixed server-side by @Deltik (languagepacks.xml
now matches by major.minor with a latest-tag fallback), so the client-side
fallback is no longer needed.

Commits (semantic, in the order agreed on the issue)

  1. Blockers / criticalFix(lancheck): repair language-pack generation blockers on v2.3.5+

    • Bug 7 pre-flight path check: fall back to $e107_paths when the
      $LANGUAGES/$THEMES/$HELP_DIRECTORY globals are empty under the new
      $config = [...] format (was aborting with the LANG_LAN_26 label).
    • Bug 9 XML manifest temp path: write to e_SYSTEM."temp/lancheck/"
      (created on demand) instead of the removed e_FILE."public/", so the
      <lang>/<lang>.xml manifest actually ships in the ZIP.
    • Bug 5 XML escaping: htmlspecialchars(..., ENT_QUOTES|ENT_XML1) on the
      name/email/url attributes — a quote in USERNAME/USEREMAIL no longer
      produces invalid XML the importer rejects (and no longer poisons the
      on-disk <lang>.xml).
    • Bug 10 duplicate manifest: exclude the on-disk <lang>.xml from the file
      list before adding the freshly generated manifest.
    • Bug 6 modern-syntax parsing: fill_phrases_array() now also recognises
      const KEY = '...'; and return [ 'KEY' => '...' ]; in addition to
      define(), so modern packs are no longer reported as File missing! /
      n missing phrases.
  2. Minor correctnessFix(lancheck): minor correctness fixes in language-pack flow

    • Bug 1 fclose($fp) instead of the undefined $writeit.
    • Bug 2 show $status['message'] instead of the boolean $status['error']
      (user was seeing a literal 1).
    • Bug 3 include const and = in findLocale()'s strip list so
      const CORE_LC = "es"; resolves correctly.
  3. UXImprove(lancheck): actionable error breakdown for translators

    • Bug 4 explicit breakdown of missing files / phrases / BOM / non-UTF8
      instead of a bare Errors found: [x] missing, plus a hint pointing to the
      Verify tab and the E107_DEBUG_LEVEL bypass.

Validation

  • php -l e107_admin/lancheck.php clean on every commit.
  • Generated Spanish and Portuguese packs on a v2.3.5 install with the
    new $config format: each ZIP contains a single, well-formed
    <lang>/<lang>.xml; the importer accepts them.
  • All simplexml_load_string / file_put_contents warnings gone.
  • Parser tested against legacy define()-only packs and fully modernised
    (return [...]) packs.

Scope

  • Single file: e107_admin/lancheck.php (+83 / −11).
  • No public API change, no new DB columns, no new language strings.
  • Branch based on current upstream/master.

Admin > Tools > Language Check > Generate Pack could not produce a usable
language pack on installs using the new $config = [...] format or modern
language files. This addresses the blocking/critical regressions:

- Pre-flight path check (e107inc#7): fall back to $e107_paths when the
  $LANGUAGES/$THEMES/$HELP_DIRECTORY globals are empty (new config format),
  instead of aborting with the LANG_LAN_26 preference label.
- XML manifest temp path (e107inc#9): write to e_SYSTEM."temp/lancheck/" (created
  on demand) instead of the removed e_FILE."public/" location, so the
  <lang>/<lang>.xml manifest is actually shipped in the ZIP.
- XML escaping (e107inc#5): htmlspecialchars(..., ENT_QUOTES|ENT_XML1) the
  name/email/url attributes so a quote in USERNAME/USEREMAIL no longer
  produces invalid XML that the importer rejects.
- Duplicate manifest (e107inc#10): exclude the on-disk <lang>.xml from the file
  list before adding the freshly-generated manifest, avoiding two entries
  for the same archive path.
- Modern syntax parsing (e107inc#6): fill_phrases_array() now also recognises
  `const KEY = '...';` and `return [ 'KEY' => '...' ];` in addition to
  define(), so byte-perfect modern packs are no longer reported as
  "File missing!" / "n missing phrases".

Refs e107inc#5652
- fclose() (e107inc#1): close the actual open handle ($fp) instead of the
  undefined $writeit variable.
- Error message (e107inc#2): show $status['message'] rather than the boolean
  $status['error'] flag, so the user no longer sees a literal "1".
- Locale parsing (e107inc#3): include `const` and `=` in findLocale()'s strip
  list so modern `const CORE_LC = "es";` declarations resolve to the
  correct locale instead of an empty string.

Refs e107inc#5652
)

When a pack fails validation the message was just "Errors found: [x] missing",
giving no indication of what to fix. Build an explicit breakdown of missing
files, missing/invalid phrases, files with BOM/illegal characters and
non-UTF8 phrases, plus a hint pointing to the Verify tab and the
E107_DEBUG_LEVEL bypass.

Refs e107inc#5652
Comment thread e107_admin/lancheck.php Outdated
Comment thread e107_admin/lancheck.php Outdated

@e107help e107help Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Olá de novo, Martin, and thanks for the clean follow-through on #5652.

I traced this against the live code and tested the parser branches locally (no live install, so "traced" not "reproduced"). Tidy scope: one file, +83/-11, the commit split matches what we agreed, and php -l is clean on 8.5.

What checks out:

  • Parsers (bug 6): I ran the three branches against real core LAN files. Legacy define() still parses exactly (forum English_global.php -> 9, siteinfo -> 11), so the back-compat path survives. The new branches pick up the modern files: lan_user.php -> 107 keys (the commented-out define()s inside the /* */ block are stripped first), admin/lan_admin.php -> 420, plus const CORE_LC/CORE_LC2. The $retloc[$type] = array() init is what stops a phrase-less modern file reading as File missing!.
  • bug 3 resolves const CORE_LC = "es" -> es_ES; bug 4 reads the real session keys; bug 9 writes under e_SYSTEM."temp/lancheck/" with a recursive mkdir; bug 10 filters the on-disk <lang>.xml before re-adding the fresh manifest.

I've left one inline note on the bug 7 path resolution. @Deltik has already flagged the regex (tokenizer) and the attribute building (toAttributes()), and I agree on both. Worth knowing: the toAttributes() switch also absorbs the bug 5 escaping, so you won't need the manual htmlspecialchars(..., ENT_XML1) there once it's in.

One minor, non-blocking thing: the error breakdown and the Tip: line are hardcoded English, in a tool that's all about translation. Shipping no new strings is a fair call for a hotfix, but a follow-up to fold them into LANG_LAN_* would let translators have them too. :-D

Comment thread e107_admin/lancheck.php Outdated
Comment on lines +671 to +678
$paths = is_array($e107_paths) ? $e107_paths : array();
$handlersDir = $HANDLERS_DIRECTORY ?: ($paths['handlers'] ?? ($paths['HANDLERS_DIRECTORY'] ?? 'e107_handlers/'));
$languagesDir = $LANGUAGES_DIRECTORY ?: ($paths['languages'] ?? ($paths['LANGUAGES_DIRECTORY'] ?? 'e107_languages/'));
$themesDir = $THEMES_DIRECTORY ?: ($paths['themes'] ?? ($paths['THEMES_DIRECTORY'] ?? 'e107_themes/'));
$helpDir = $HELP_DIRECTORY ?: ($paths['help'] ?? ($paths['HELP_DIRECTORY'] ?? 'e107_docs/help/'));
$pluginsDir = $PLUGINS_DIRECTORY ?: ($paths['plugins'] ?? ($paths['PLUGINS_DIRECTORY'] ?? 'e107_plugins/'));

if(($HANDLERS_DIRECTORY != "e107_handlers/") || ( $LANGUAGES_DIRECTORY != "e107_languages/") || ($THEMES_DIRECTORY != "e107_themes/") || ($HELP_DIRECTORY != "e107_docs/help/") || ($PLUGINS_DIRECTORY != "e107_plugins/"))
if(($handlersDir != "e107_handlers/") || ($languagesDir != "e107_languages/") || ($themesDir != "e107_themes/") || ($helpDir != "e107_docs/help/") || ($pluginsDir != "e107_plugins/"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This $e107_paths fallback never actually fires. class2.php unsets $e107_paths immediately after initCore() (class2.php#L239), so on any admin page global $e107_paths is null and $paths becomes array(). Every lookup here then collapses to the hardcoded literal.

On a default install that still lands on the right answer, so the reported bug is fixed. But on the new $config format the non-standard-layout guard just below can no longer fire: the themes/languages/help globals are empty and $e107_paths is gone, so all three resolve to the defaults regardless of what the site actually uses.

getFileList() already does this the canonical way a few lines down, via e107::getFolder() (L712-714). That reads e107_dirs, which is populated on both config formats:

$handlersDir  = e107::getFolder('handlers');
$languagesDir = e107::getFolder('languages');
$themesDir    = e107::getFolder('themes');
$helpDir      = e107::getFolder('help');
$pluginsDir   = e107::getFolder('plugins');

That drops the dead $e107_paths branch and keeps the guard meaningful on both formats.

@Kanonimpresor Kanonimpresor requested a review from Deltik June 4, 2026 12:21
- Bug 7: resolve language/theme/help paths via e107::getFolder() instead of the dead $e107_paths fallback (class2.php unsets it after initCore()), so the non-standard-layout guard works on both the flat and $config=[...] formats.

- Bug 5: build the XML manifest attributes with e_parse::toAttributes(), which handles escaping internally and drops the manual htmlspecialchars(ENT_XML1) concatenation.

- Bug 6: parse locale files with the PHP tokenizer (token_get_all) instead of preg_match_all(); recognises define()/const/return[] plus setlocale(), ignores commented-out code natively, and adds decodeStringToken() for quote/escape handling.
Adds cases for const, return-array (long + short syntax with escapes), commented-out statements that must be ignored, and phrase-less files that must still register their type. Validates the new token_get_all()-based parser requested in the e107inc#5703 review.
Comment thread e107_admin/lancheck.php
$pluginsDir = $PLUGINS_DIRECTORY ?: ($paths['plugins'] ?? ($paths['PLUGINS_DIRECTORY'] ?? 'e107_plugins/'));
// Patch (lancheck-config-paths): resolve directory names via e107::getFolder(),
// which reads e107_dirs and is populated on BOTH the legacy flat config and the
// new $config = [...] format. The previous $e107_paths fallback was dead code:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not make sense in context, since whoever reads this code won't know what "previous $e107_paths" were. Please remove or trim the comments to the bare necessities.

@Kanonimpresor

Kanonimpresor commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review, @Deltik and @e107help! I've pushed two commits
addressing all three blocking points.

09cae28 — refactor(lancheck): address review on #5703

  • Bug 7 (dead $e107_paths fallback): you're right, class2.php unsets
    $e107_paths right after initCore(), so the guard could never fire on the
    $config = [...] format. Switched to the canonical e107::getFolder() for all
    five paths (handlers/languages/themes/help/plugins), which reads
    e107_dirs and is populated on both config formats. The dead branch is gone.
  • Bug 5 (XML attributes): now built with e_parse::toAttributes() instead of
    manual concatenation. As you noted, this absorbs the escaping too, so the
    htmlspecialchars(..., ENT_XML1) calls are removed.
  • Bug 6 (regex → tokenizer): fill_phrases_array() no longer uses
    preg_match_all(). It now walks token_get_all() output and recognises
    define(), const KEY = '...', 'KEY' => '...' (array form), and
    setlocale(). The tokenizer ignores commented-out statements natively, so the
    separate /* */ pre-strip pass is gone. Added a small decodeStringToken()
    helper for single/double-quote escape handling.

bccf93b — test(lancheck): expand fill_phrases_array() coverage

Expanded lancheckTest::testFill_phrases_array() per your suggestion, with cases
for const, return-array (long and short [] syntax, including escaped quotes),
commented-out statements that must be ignored, and a phrase-less file that must
still register its type (so it isn't misreported as File missing!).

I validated the parser branches locally against real core LAN files and the new
fixtures — define() counts are unchanged (back-compat preserved) and the modern
branches resolve as expected. php -l is clean on both files.

On the non-blocking note about the hardcoded English error breakdown / Tip:
line: agreed — happy to follow up by folding those into LANG_LAN_* so
translators get them too, either in this PR or a separate one, whichever you
prefer.


Disclosure: drafted with AI assistance (Claude Opus 4.8 by VSCode) (per CONTRIBUTING.md). I read, ran, and
validated every change myself — the tokenizer branches were exercised against a
standalone harness and the new test fixtures before pushing.

@e107help e107help Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ficou bem mais limpo, Martin. I re-traced both new commits (09cae28, bccf93b) and ran the new tokenizer against the real core LAN files and your fixtures. The headline: this isn't just regression-free, it's more correct than the regex it replaces.

Parser parity, with two latent bugs quietly fixed. I diffed the harvested key set (old regex vs new tokenizer) across 14 core LAN files. 13 come out identical. The one that differs is English.php, and it differs in the right direction: the old regex reported 130 keys, the tokenizer 126. The four it stops emitting are CHARSET, TEXTDIRECTION, CORE_LAN2, CORE_LAN3, every one of them commented out with // (English.php#L18-L22). The old /* */-only pre-strip let those phantom phrases leak into the generated pack; token_get_all() drops them for free. Your ignoresComments fixture pins exactly that.

The second fix is in the values: phrases now decode to their real runtime form. CORE_LAN1 used to come through as raw source with doubled backslashes (...missing.\\n\\n...); decodeStringToken() returns the true \n. The pack was subtly wrong there before too.

The define() counts on the legacy files are unchanged (lan_admin 420, lan_signup 109, lan_user 107) and all five fixtures pass when I exercise the function directly, so the back-compat path is solid.

Two inline notes below, neither a merge blocker: an optional $pure polish on the toAttributes() switch, and a test fixture that opens with return array( and closes with ];. That one is a parse error as real PHP and only goes green because token_get_all() is lenient (no TOKEN_PARSE flag), so it's worth tightening to valid syntax.

Minor heads-up while I had the harness out: the setlocale() capture changed shape as a side effect. it stored the raw argument list before, now it keeps the first decoded locale (LC_ALL => 'en_GB.UTF-8'). Harmless for a language pack, just flagging it in case it surprises you.

@Deltik already caught the leftover $e107_paths comment, so I'll leave that lane to him. Bom trabalho on the turnaround. ;-)

Comment thread e107_admin/lancheck.php
Comment on lines +729 to +738
$langAttributes = $tp->toAttributes(array(
'name' => $language,
'compatibility' => $ver,
'date' => date("Y-m-d"),
));
$authorAttributes = $tp->toAttributes(array(
'name' => USERNAME,
'email' => USEREMAIL,
'url' => SITEURL,
));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right call, and it absorbs the bug 5 escaping cleanly: toAttribute() runs htmlspecialchars(..., ENT_QUOTES), whose output (&#039; and friends) is well-formed XML, so dropping the manual ENT_XML1 pass is correct.

One defensive tweak worth considering: pass $pure = true. By default toAttributes() sends each value through replaceConstants() (e_parse_class.php#L1999-L2002), and these are literal manifest values (SITEURL, USEREMAIL, the language name) that have no business being path-substituted. It happens to be a no-op today (with the empty $mode, replaceConstants() only forward-replaces literal {e_XXX} tokens, none of which appear in a URL or an email), so this is about stating intent rather than fixing a live bug.

Same helper, one behavioural nuance to keep in mind: it skips empty-valued attributes (L2031). So if USEREMAIL is ever blank, the old email="" now drops out of the tag entirely instead of emitting empty. The reader tolerates a missing attribute, so this is a heads-up, not a blocker.

Comment on lines +90 to +93
'return array('."\n".
"\t'LAN1' => 'Főadminisztrátor',"."\n".
"\t'LAN2' => 'Hői',"."\n".
'];';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixture is mismatched: it opens with return array( but closes with ];, which is a parse error as real PHP (Unclosed '(' ... does not match ']'). It only goes green because fill_phrases_array() calls token_get_all() without the TOKEN_PARSE flag, so the tokenizer never checks that the brackets balance. it emits the 'KEY' => 'value' tokens regardless of the closing bracket.

Since this is the case meant to exercise the long array() form (the next test covers short []), close it with ) so the fixture is valid PHP:

		$strings =
			'return array('."\n".
			"\t'LAN1' => 'Főadminisztrátor',"."\n".
			"\t'LAN2' => 'Hői',"."\n".
			');';

Worth a quick guard in general: a fixture that the parser would reject is only proving the tokenizer's leniency, not the syntax it claims to cover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Language Pack generator is broken on v2.3.5+ (10 related bugs in lancheck.php)

2 participants