# TeXML specification

TeXML is an XML syntax for TeX. A processor translates TeXML source into TeX.

The Document Type Definition (DTD) for TeXML can be found in a TeXML distribution package.

## Root element: TeXML

<?xml version="1.0" encoding="..."?>
<TeXML>
</TeXML>


The root element of a TeXML document is the element TeXML.

## Encoding commands: cmd

TeXML:
<cmd name="documentclass">
<opt>12pt</opt>
<parm>letter</parm>
</cmd>
TeX:
\documentclass[12pt]{letter}

The TeXML cmd element encodes TeX commands.

• To add options to a command, add opt children to the cmd element. The processor places opt children within square braces, as LaTeX style options.
• To add parameters to a command, add parm children to the cmd element. The processor places parm children within TeX groups, that is, curly braces.

The TeXML cmd can have several parm or opt elements.

## Encoding environments: env

TeXML:
<env name="document">
...
</env>
TeX:
\begin{document}
...
\end{document}

The element env is a convenience for expressing LaTeX environments.

## Encoding groups: group

TeXML:
<group><cmd name="it"/>italics</group>
TeX:
{\it italics}

The group element is a convenience for encoding groups. The processor will supply an opening brace at the beginning, and a closing brace at the end of the group.

## Encoding math groups: math and dmath

TeXML:
$a+b$
<dmath><cmd name="sqrt"><parm>2</parm></cmd></dmath>
TeX:
$a+b$
$$\sqrt{2}$$

Elements math and dmath are conveniences for encoding math groups. The processor inserts the appropriate math shift symbol at the beginning and end of the group and also switches mode to math inside the group.

## Encoding control symbols: ctrl

TeXML:
line1<ctrl ch="\"/>line2
TeX:
line1\\line2

The ch attibute of the ctrl element encodes a control symbol.

## Encoding special symbols: spec

TeXML:
<spec cat="vert"/>l<spec cat="vert"/>
TeX:
|l|

The attribute cat of the element spec creates the corresponding symbol verbatim, without escaping.

cat attribute values
description cat value output
escape character esc \
begin group bg {
end group eg }
math shift mshift alignment tab align & parameter parm # superscript sup ^ subscript sub _ tilde tilde ~ comment comment % vertical line vert | less than lt < greater than gt > ## PDF literals: pdf TeXML: <pdf>τεχ</pdf> TeX: \003\304\003\265\003\307 Content of the element pdf is converted to UTF16BE encoding and represented using escaped octal codes. The result is a PDF unicode string. ## Advanced topics ### Characters Characters are processed as follows: • If a character has a special meaning for TeX, then the character is translated as shown in the table below. • If the character belongs to an output encoding, then the character is output as-is. • If the character exists in a LaTeX unicode mapping table, then a corresponding substitution for the character is used. • Otherwise the character is output as \unicodechar{NNNNN} where NNNNN is the decimal code for the character. To leave specials as is, without escaping, use the TeXML attribute escape: <TeXML escape="0">...</TeXML> Mapping of the special symbols symbol text mode math mode \ \textbackslash{} \backslash{} { \{ \{ } \} \} \textdollar{} \\$
& \& \&
# \# \#
^ \^{} \^{}
_ \_ \_
~ \textasciitilde{} \~{}
% \% \%
| \textbar{} |
< \textless{} <
> \textgreater{} >

The LaTeX mapping table for unicode characters is automatically generated from the file unicode.xml. This file is an appendix for the W3C MathML specification.

If a replacement of an unicode character a) is valid only in math mode and b) the current mode is text, then the replacement is wrapped by the command “\ensuremath”. Likewise if a replacement a) is valid only in text mode and b) the current mode is math, then wrapper “\ensuretext” is used.

LaTeX does not have the command “\ensuretext” so you should define it yourself. One of the approaches is:

\def\ensuretext{\textrm}

### Empty lines

Empty lines have a special meaning for TeX. They cause automatic generation of the TeX command \par. To avoid this, the processor outputs a line with the one symbol % (TeX comment) instead of a empty line.

To leave empty lines as is, use the TeXML attribute emptylines:

<TeXML emptylines="1">...</TeXML>

### Ligatures

The TeXML processor disconnects well-known ligatures “--”, “---”, “”, “''”, “!” and “?”. These ligatures are converted into “-{}-”, “-{}-{}-”, “{}”, “'{}'”, “!{}”, and “?{}” respectively.

To leave ligatures as is, use the TeXML attribute ligatures:

<TeXML ligatures="1">...</TeXML>

### Modes

There are two modes: text and math. Modes only affect the translation of characters.

The default mode is text. In order to change mode, use the mode attribute of the element TeXML. The possible values for this attribute are math and text. If the element TeXML is used without attribute mode, then the mode is not changed.

<TeXML mode="math">
... math mode here ...
<TeXML mode="text">... text mode here ...</TeXML>
</TeXML>

Elements math and dmath also change mode to math.

### Whitespace processing

The TeXML processor performs advanced whitespace processing. The program

• removes what can be regarded as insignificant whitespace, and
• introduces its own whitespace which would look reasonable from a human point of view.

If you find that something goes wrong you can switch off whitespace elimination using the ws attribute of the TeXML tag.

<TeXML ws="1">
... whitespace is verbatim here ...
</TeXML>

If the TeXML elements ctrl or spec have any content (including whitespace) then the TeXML processor reports an error.

The program deletes any whitespace that is located directly in the TeXML element cmd.

Insignificant whitespace is whitespace around any opening or closing tag, for example, whitespace around “... <TeXML> ...” and “... </TeXML> ...”. The XML reader converts insignificant whitespace into the weak space.

Another source of weak spaces is TeX commands. When the processor converts “<cmd name="it"/>” into “\it ”, the space after “\it” is a weak space.

The TeX writer processes weak spaces in the following manner:

• repeated weak spaces are interpreted as one weak space,
• a weak space at the beginning of a line is ignored,
• a weak space at the end of a line is ignored,
• otherwise the usual space symbol (or new line, see below) is written.

### Tuning layout

The resulting documents are usually very good, but after some tuning they can be even better. This section describes how whitespace is handled and introduces some hints to make resulting documents look as good as handcrafted.

#### Empty group after a command

If a command has no parameters and options then the TeXML processor adds an empty group “{}” after the command name: “\smth{}”. Without the empty group, the following whitespace is ignored by TeX, but sometimes it is exactly what you need. In this case set attribute “gr” (shortcut for “group”) to “0”.

TeXML:
<cmd name="it"/> once, <cmd name="it" gr="0"/> twice
TeX:
\it{} once, \it twice

#### Automatic line breaks

It's difficult to work with documents that are one long line as a result of transformation, so the TeXML processor performs automatic line breaking.

• TeX commands for beginning and ending an environment are placed on separated lines.
• If a weak space appears far enough from the beginning of line then a new line is started.

By default “far enough” is 62. You can set another value by using command line parameter “-w” or “--width”. This setting is not strict: a line can be much longer than a specified width, if there are no spaces in it.

#### Whitespace around commands

Attributes nl1 and nl2 can be used to force a new line before (nl1) or after (nl2) TeX command.

#### Whitespace around environments

The TeXML processor automatically creates new lines around the beginning and the end of an environment. You can change this behaviour using four attributes: nl1 (before the beginning), nl2 (after the beginning), nl3 (before the end) and nl4 (after the end).

#### Forced whitespace

You can affect whitespace output by using special categories of the element spec: nl, nl?, space and nil.

• nl stands for a new line.
• nl? is a conditional version of the nl. A new line is created unless it is already created.
• space stands for a space. You can use it to output several consequent spaces or to create a space at the beginning or end of a line.
• nil stands for nothing. The only purpose of it is a side effect: whitespace around it is collapsed.

### TeXML namespace

TeXML namespace is http://getfo.sourceforge.net/texml/ns1.

<TeXML xmlns="http://getfo.sourceforge.net/texml/ns1">
...
</TeXML>

### ConTeXt support

In the ConTeXt mode, the element env creates ConTeXt environments.

TeXML:
<env name="document">
...
</env>
TeX:
\begindocument
...
\enddocument

To activate ConTeXt mode, give the command line option -c or --context to the TeXML processor.