This is a work-in-progress and will be periodically revised to reflect changes in SymPy. Last updated: 2014-6-21
One of the annoyances of entering mathematics on the computer is the
rigidity of the format the computer generally expects. In SymPy,
won’t accept any of the following, though a human would:
5 sin x
sin x^3 + y
ln sin x
SymPy’s implicit parsing aims to fix this. Implicit multiplication takes
care of statements like
2x, symbol splitting allows for
sin x, and function exponentiation allows
But first, we have a problem: some of these statements are ambiguous. For
sin x^3 + y be interpreted as
sin(x^3) + y,
sin(x)^3 + y? The last one, to a human, wouldn’t make much sense,
but the other two are perfectly valid. For SymPy, we’ve arbitrarily chosen
the first interpretation.
That’s not all. Let’s look at the last case more:
xyz. To a human, this
x*y*z, so naturally to handle this SymPy should simply split
the name into three. But we can’t split every token; consider these:
Splitting these names would not match user expectations at all.
So taking these considerations into account, how should we implement implicit parsing? We have a few options:
- Transform the input string (regular expression). This is what Sage does.
- Transform the token stream. This is what SymPy does.
- Use a custom parser.
(AST transforms aren’t an option because this syntax isn’t valid Python syntax.)
Let’s look at the SymPy parser. If you just want to read about implicit parsing, jump to the section about token transformations.
The SymPy Parser
Note: all data here is based on SymPy 0.7.3
Overall, SymPy follows these steps. You can follow along in
- Tokenize the input (which is not necessarily valid Python) using a modified Python 2 tokenizer.
- Run the tokens through a series of transformations.
- Untokenize the tokens, generating valid Python code.
- Evaluate the string.
Yes, we use
eval(). It’s not safe. (SymPy Live and SymPy Gamma deal with
this by ignoring the problem: they run on Google App Engine, which is
The SymPy Tokenizer
The tokenizer handles some syntax not in vanilla Python. In particular, we’d like to be able to parse expressions like these:
x!!( double factorial)
Note that the last example is valid Python—it’s equivalent to
(0.). But to make it easier to recognize and transform, we modified
the tokenizer to handle it as a special case:
# Standard library >>> tokenize.tokenize(StringIO('0.').readline) 1,0-1,2: NUMBER '0.' 1,2-1,3: OP '[' 1,3-1,6: NUMBER '123' 1,6-1,7: OP ']' 2,0-2,0: ENDMARKER '' # SymPy >>> sympy.parsing.sympy_tokenize.tokenize(StringIO('0.').readline) 1,0-1,7: NUMBER '0.' 2,0-2,0: ENDMARKER ''
!! are simply operators now:
>>> sympy.parsing.sympy_tokenize.tokenize(StringIO('x!!').readline) 1,0-1,1: NAME 'x' 1,1-1,3: OP '!!' 2,0-2,0: ENDMARKER '' >>> sympy.parsing.sympy_tokenize.tokenize(StringIO('x!').readline) 1,0-1,1: NAME 'x' 1,1-1,2: OP '!' 2,0-2,0: ENDMARKER ''
Unfortunately, since this is based on an older tokenizer, it doesn’t support Python 3 features, such as bytestrings, and it’ll incorrectly parse longs in Python 3, not to mention it doesn’t support Unicode.
Say you’re making a SymPy calculator/grapher/what-have-you. The user inputs this expression:
sin(x) + 3
and your app spits out this error message:
NameError: name 'x' is not defined
Not a very good calculator. SymPy handles this by transforming the token
stream; in this case, undefined variables are wrapped in
Symbol() calls to
turn them into SymPy symbols. By default, parsing uses these
- Undefined variables become
- Complex numbers become SymPy complex numbers (
3 + 4jbecomes
3 + 4*I)
- Repeating decimals become
- Floating-point constants become
- Integer constants become
- Factorial notation becomes calls to
SymPy also provides other useful transformations:
^, becomes exponentiation.
- Floats become rational numbers.
- Implicit parsing.
Note that SymPy’s transformations always produce valid Python code, since we’re not modifying the parser.
The API for this is a bit hidden. Here’s an example:
>>> from sympy.parsing.sympy_parser import parse_expr >>> parse_expr("1/2") 1/2 >>> type(_) <class 'sympy.core.numbers.Half'> >>> from sympy.parsing.sympy_parser import standard_transformations,\ ... implicit_multiplication_application, convert_xor >>> transformations = (standard_transformations + ... (implicit_multiplication_application,)) >>> parse_expr("2x", transformations=transformations) 2*x >>> transformations = (standard_transformations + (convert_xor,)) >>> parse_expr("x^3", transformations=transformations) x**3
How do the implicit parsing transformations work? They’re split into four transformations: symbol splitting, multiplication, application, and exponentiation, which must be run in that order, though not all need to be run.
Symbol splitting is quite simple: for each name in the token stream, the
transformation checks if it 1) is part of a
Symbol, so that it doesn’t
split function names 2) is not a Greek letter, and 3) does not contain an
underscore. If so, it makes one
Symbol for each letter in the original
Implicit multiplication scans the tokens two at a time, looking for one of these conditions:
- Closing parenthesis followed by opening parenthesis (
(x + 2)(x + 3))
- Closing parenthesis followed by variable name (
(x + 2) sin(x))
- Variable name followed by opening parenthesis (
pi (x + 3))
- Variable name followed by variable name (
pi E EulerGamma)
Application (and multiplication, though it’s not necessary) depends on an
intermediate step that groups function calls into one token—
represented with seven tokens normally
['sin', '(', 'Symbol', '(', "'x'", ')', ')']), is first grouped into one
AppliedFunction in the source). Then it follows these
- Keep track of how many parentheses we have to insert and how many tokens to skip.
- For each token,
- If it’s a function name not followed by an opening parenthesis, increment the counter and insert an opening parenthesis.
- Else, if the parenthesis counter is nonzero,
- If the next token is a multiplication or exponentiation, increment the skip counter.
- Else, if the skip counter is nonzero, decrement it.
- Else, insert a closing parenthesis after this token and decrement the counter.
sin x^2 gets parsed as
sin(x)^2. There’s some
more logic to make life easier for the function exponentiation
transformation as well.
To implement function exponentiation, think about how the token stream would
look at this point. All functions are applied, so for
sin**2 x, we would
have something like this:
['sin', '**', 'Integer', '(', '2', ')', '(', 'Symbol', '(', "'x'", ')', ')']
If you have implicit multiplication enabled, it’ll actually look like this (note the extraneous multiplication):
['sin', '**', 'Integer', '(', '2', ')', '*', '(', 'Symbol', '(', "'x'", ')', ')']
The transformation has to figure out what constitutes the exponent and what
constitutes the function call. The rule SymPy uses is, essentially, the
exponent is everything from and including the exponentiation operator to the
first closing parenthesis it sees (this rules out anything but simple
integer and symbolic exponents;
sin**(x+3)(x+5) won’t be parsed
correctly). It then parses the following tokens, discarding the extraneous
multiplication if it exists, to find the closing parenthesis of the function
call (this correctly handles nested parentheses), and moves the tokens for
the exponent to the end. So what the parser ends up seeing is
['sin', '(', 'Symbol', '(', "'x'", ')', ')', '**', 'Integer', '(', '2', ')']
which is equivalent to
One final note: SymPy uses a an evaluation trick for the final result. SymPy
Symbol.__call__ so that this works:
>>> spam = sympify('f(x)') f(x) >>> x = Symbol('x') >>> eggs = Function('f')(x) f(x) >>> spam == eggs True
However, it’s a point of contention whether this should be done at all.
So there you have it. SymPy’s parser.