We'll process HTML files to:
- Add missing meta tags
- Improve image accessibility
- Standardize heading structure
- Inject analytics scripts
python3 -m alterx.html [-h] [--depth-first] [--follow-symlinks] [--exclude GLOB] [--include GLOB] [--sizes min..max]
[--depth min..max] [--paths-from FILE] [--pretty | --no-pretty] [--ns-clean] [--recover] [--strip-ws]
[--strip-comments] [--strip-pi] [--xml-declaration | --no-xml-declaration] [-m] [-d NAME=VALUE] [-x SCRIPT]
[-o FILE] [--encoding USE_ENCODING] [-n]
[PATH ...]
positional arguments:
PATH
options:
-h, --help show this help message and exit
--pretty, --no-pretty
Save pretty formated
--ns-clean Try to clean up redundant namespace declarations
--recover Try hard to parse through broken XML
--strip-ws Discard blank text nodes between tags
--strip-comments Discard comments
--strip-pi Discard processing instructions
--xml-declaration, --no-xml-declaration
Add xml declaration
-m Modify flag
-d NAME=VALUE Define some variable
-x SCRIPT Extension script
-o FILE Output to FILE
--encoding USE_ENCODING
Encoding to use when saving
-n No modifiaction will happend
Traversal:
--depth-first Process each directory's contents before the directory itself
--follow-symlinks Follow symbolic links
--exclude GLOB exclude matching GLOB
--include GLOB include matching GLOB
--sizes min..max Filter sizes: 1k.., 4g, ..2mb
--depth min..max Check for depth: 2.., 4, ..3
--paths-from FILE read list of source-file names from FILEmkdir -p website
cat > website/index.html <<EOF
<!DOCTYPE html>
<html>
<head>
<title>Home Page</title>
</head>
<body>
<h2>Welcome</h2>
<img src="logo.png">
<div class="content">
<p>Main page content</p>
</div>
</body>
</html>
EOF
cat > website/about.html <<EOF
<!DOCTYPE html>
<html>
<head>
<title>About Us</title>
<meta name="description" content="Learn about our company">
</head>
<body>
<h3>Our Story</h3>
<img src="team.jpg" width="300">
</body>
</html>
EOFfrom lxml import html
def init(app):
# Configuration parameters
app.defs.update({
'SITE_NAME': 'My Website',
'ANALYTICS_ID': 'UA-1234567-1',
'DEFAULT_META_DESC': 'Default description for pages without one'
})
def process(doc, stat, app):
root = doc.getroot()
# Ensure proper HTML structure
if root.tag != 'html':
return False
# HEAD section processing
head = root.find('head')
if head is not None:
# Add missing charset meta
if not head.xpath('//meta[@charset]'):
meta = html.Element('meta', charset='UTF-8')
head.insert(0, meta)
# Add missing viewport meta
if not head.xpath('//meta[@name="viewport"]'):
meta = html.Element('meta', name='viewport',
content='width=device-width, initial-scale=1')
head.insert(1, meta)
# Add default description if missing
if not head.xpath('//meta[@name="description"]'):
meta = html.Element('meta', name='description',
content=app.defs['DEFAULT_META_DESC'])
head.append(meta)
# Add canonical link if missing
if not head.xpath('//link[@rel="canonical"]'):
link = html.Element('link', rel='canonical',
href=f"https://example.com/{stat.path.name}")
head.append(link)
# BODY section processing
body = root.find('body')
if body is not None:
# Add alt text to images
for img in body.xpath('//img[not(@alt)]'):
img.set('alt', '')
# Convert width/height attributes to CSS
for img in body.xpath('//img[@width or @height]'):
style = img.get('style', '')
if img.get('width'):
style += f"width: {img.get('width')}px;"
del img.attrib['width']
if img.get('height'):
style += f"height: {img.get('height')}px;"
del img.attrib['height']
img.set('style', style)
# Standardize heading hierarchy
first_h = next((e for e in body.iter() if e.tag in ('h1','h2','h3','h4','h5','h6')), None)
if first_h and first_h.tag != 'h1':
first_h.tag = 'h1'
# Inject analytics before closing body
if not body.xpath('//script[contains(text(), "GoogleAnalytics")]'):
script = html.Element('script')
script.text = f"""
window.ga=window.ga||function(){{(ga.q=ga.q||[]).push(arguments)}};
ga('create', '{app.defs['ANALYTICS_ID']}', 'auto');
ga('send', 'pageview');
"""
body.append(script)
def end(app):
print(f"Optimized {app.total.Altered}/{app.total.Files} HTML files")
print(f"Added {getattr(app.total, 'MetaTags', 0)} meta tags")
print(f"Fixed {getattr(app.total, 'Images', 0)} images")# With HTML pretty printing
python -m alterx.html -mm -x html_optimizer.py website
# Alternative with more aggressive HTML cleaning
python -m alterx.html -mm --strip-comments --strip-pis -x html_optimizer.py websiteindex.html:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Home Page</title>
<meta
name="description"
content="Default description for pages without one"
/>
<link rel="canonical" href="https://example.com/index.html" />
</head>
<body>
<h1>Welcome</h1>
<img src="logo.png" alt="" />
<div class="content">
<p>Main page content</p>
</div>
<script>
window.ga =
window.ga ||
function () {
(ga.q = ga.q || []).push(arguments);
};
ga("create", "UA-1234567-1", "auto");
ga("send", "pageview");
</script>
</body>
</html>about.html:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>About Us</title>
<meta name="description" content="Learn about our company" />
<link rel="canonical" href="https://example.com/about.html" />
</head>
<body>
<h1>Our Story</h1>
<img src="team.jpg" style="width: 300px;" alt="" />
<script>
window.ga =
window.ga ||
function () {
(ga.q = ga.q || []).push(arguments);
};
ga("create", "UA-1234567-1", "auto");
ga("send", "pageview");
</script>
</body>
</html>- HTML5 Structure: Ensures proper document structure
- SEO Optimization: Adds meta tags and canonical links
- Accessibility: Handles image alt text and heading hierarchy
- Modern Practices: Converts deprecated attributes to CSS
- Analytics Injection: Adds tracking scripts non-invasively
- Change Tracking: Detailed statistics about modifications
This example shows how to use alterx for professional HTML processing tasks, with specific optimizations for modern web development best practices. The same pattern can be adapted for other HTML processing needs like template injection, CSS/JS bundling, or content migration.