SymEval

An elaborate answer extraction and correctness judgement pipeline based on regular expressions and SymPy symbolic calculation, which is able to correctly process

Installation

For common users/developers, please just run the following command the install the package:

pip install "git+https://github.com/tongyx361/symeval.git"

Quick Start

from symeval import *

evaluator = (
    EvaluatorMathBatch()
)  # By default, the evaluator only extract answers from the boxed expression.

symeval provides elaborate answer extraction and correctness judgement pipelines based on regular expressions and SymPy symbolic calculation, which is able to correctly process

  • most mathematical objects such as matrices (vectors), intervals, symbols besides numbers,
  • as well as some special texts like bool expressions, dates and times.

EvaluatorMath implements an elaborate evaluation pipeline for mathematical reasoning tasks.

SymPy symbolic calculation causes risks of ex-long evaluation time.

To address this, we implement EvaluatorMathBatch to evaluate in batch with timeout but still efficiently.

test_eq(
    evaluator.batch_eq(ref_answers=["1/2", "1/2"], pred_answers=["0.5", "2/4"]),
    [True] * 2,
)
Judging: 100%|██████████| 2/2 [00:00<00:00, 19463.13it/s]

Here we provide a quick start guide. For more details, please refer to the API reference.


source

EvaluatorMathBatch

 EvaluatorMathBatch (ans_extract_mode:str='boxed',
                     include_percentage:bool=True, rel_tol:float=1e-09,
                     abs_tol:float=1e-08, percent_rel_tol:float=0.001,
                     ascii_only:bool=True, timeout:int=5, n_procs:int=2,
                     use_tqdm:bool=True)

Batch evaluator for math problems, capable of extracting answer segment from complex resp and processing various mathematical objects (e.g. fractions, symbolic expressions, matrices, vectors) and special text (e.g. bool values).

Type Default Details
ans_extract_mode str boxed
include_percentage bool True Whether to include percentage comparisons.
rel_tol float 1e-09 The relative tolerance for numerical comparisons.
abs_tol float 1e-08 The absolute tolerance for numerical comparisons. Necessary for precision issues.
percent_rel_tol float 0.001 The absolute tolerance for percentage comparisons.
ascii_only bool True Only allowing ASCII characters
timeout int 5 The timeout for each evaluation.
n_procs int 2
use_tqdm bool True

Accurately Extracting Answer Strings

EvaluatorMath can:

  1. extract short answers from long responses rather accurately
  2. and normalize into a mathematical expression.
evaluator = EvaluatorMathBatch(ans_extract_mode="speculate")
# MATH-style boxed answer
evaluator.extract_ans("Therefore, $1+1=\\boxed{2}$.")
'2'
# Answer around "answer"
evaluator.extract_ans(
    "Both $1$ and $11$ divide $11,$ so $\\boxed{11}=2$, and since $1,$ $2,$ $4,$ $5,$ $10,$ and $20$ divide $20,$ then $\\boxed{20}=6$. The inner expression, $\\boxed{11}\\times\\boxed{20}=2\\times6=12$. Finally, $\\boxed{12}=6$ because $1,$ $2,$ $3,$ $4,$ $6,$ and $12$ divide $12.$\n\nTherefore, $6$ is our answer. Please note that we have not boxed the correct answer as we normally do, as that would be especially confusing for this problem."
)
'False'
# Use the last number by default
evaluator.extract_ans(
    'First, we need to count the total number of letters in the word "CIRCLE". There are 6 letters.\n\nNext, we need to count the number of distinct letters. There are 6 distinct letters in the word "CIRCLE": C, I, R, L, E, and G.\n\nNow, let\'s consider the arrangements of the distinct letters. The number of ways to arrange n distinct items is n factorial (n!). So, we have 6! = 6 × 5 × 4 × 3 × 2 × 1 = 720 ways to arrange the distinct letters.\n\nHowever, the word "CIRCLE" has one letter that repeats (the letter \'C\' repeats twice). We have over-counted the number of distinct arrangements by including arrangements that are just rotations of each other (for example, "CIRCLE" and "LCIRCE" are considered different arrangements here, but they are the same word when read).\n\nTo correct for this, we divide the total number of arrangements by the number of ways to arrange the repeated letters. The number of ways to arrange 2 identical items is 2! = 2 × 1 = 2. So, we divide the total number of arrangements by 2 to get the correct number of distinct arrangements.\n\nTherefore, the number of ways to arrange the letters of the word "CIRCLE" is 720 ÷ 2 = 360.'
)
# More cases ...
'False'
# Normalize fraction
evaluator.extract_ans("The answer is 1/2")
'Theansweris1/2'
# Normalize pmatrix
evaluator.extract_ans(
    "The answer is \\begin{pmatrix} 3 \\\\ \\frac{\\pi}{2} \\end{pmatrix}"
)
# More cases ...
'Theansweris\\begin{array}3\\\\frac{\\pi}{2}\\end{array}'

More test cases:

Code
test_eq(evaluator.norm_ans_str("864 \\mbox{ inches}^2"), "864")
test_eq(evaluator.norm_ans_str("\\frac{270}7\\text{ degrees}"), "\\frac{270}7")
test_eq(evaluator.norm_ans_str(".0000672"), "0.0000672")
test_eq(
    evaluator.extract_ans(
        "The sum of the interior angles in any $n$-sided polygon is $180(n-2)$ degrees, so the angle measures in a polygon with 7 sides sum to $180(7-2) = 900$ degrees, which means that the desired polygon has more than 7 sides.  Meanwhile, the angle measures in a polygon with 8 sides sum to $180(8-2) = 1080$ degrees.  So, it's possible that the polygon has $\\boxed{8}$ sides, and that the last angle measures $10^\\circ$.\n\nTo see that this is the only possibility, note that the angle measures in a polygon with 9 sides sum to $180(9-2) = 1260$ degrees.  Therefore, if the polygon has more than 8 sides, then the last interior angle must measure at least $1260^\\circ - 1070^\\circ = 190^\\circ$.  But this is impossible because each interior angle of a convex polygon has measure less than $180^\\circ$."
    ),
    "8",
)

Correctly Processing Various Mathematical Objects / Special Text

EvaluatorMath, based on regular expressions and SymPy symbolic calculation, is able to correctly process

  • most mathematical objects such as matrices (vectors), intervals, symbols besides numbers,
  • as well as some special texts like bool expressions, dates and times.
evaluator.eq("x+y", "y+x") == True  # Expression
True
evaluator.eq("\\frac{1}{2}", "0.5") == True  # LaTeX
True
evaluator.eq(
    "\\begin{array}1\\\\2\\end{array}",
    "1,2",
)  # Matrix (Vector)
True
evaluator.eq("{1,2}", "{2,1}", compare_sets=True)  # Set
True
evaluator.eq("no", "false")  # Bool
# More mathematical objects and special texts ...
True

More test cases:

Code
test_eq(evaluator.eq("251,7\\\\ \\noindent", "0"), False)
test_eq(evaluator.eq("3.54*10^{-7}", "3.54e-07"), True)
test_eq(evaluator.eq(r"\frac{1}{2}", "0.5"), True)
test_eq(evaluator.eq("1", "100"), False)
test_eq(evaluator.eq("100", "1"), False)
test_eq(evaluator.eq("3.04", "0.0304", False), True)
test_eq(evaluator.eq(["0.0304", 0.0304], "3.04"), True)
test_eq(evaluator.eq("x<-1", "x>3"), False)
test_eq(
    evaluator.eq("(-\\infty,0)\\cup(0,\\infty)", "(-\\infty,0)\\cup(0,\\infty)"),
    True,
)
test_eq(evaluator.eq("1+2,2+1", "2+1,1+2"), True)
test_eq(evaluator.eq("5", "5"), True)
test_eq(evaluator.eq("0.1 + 0.2", "0.3"), True)  # `0.1 + 0.2 == 0.3` is `False`
test_eq(evaluator.eq("x + y", "y + x"), True)
test_eq(evaluator.eq("C", "C"), True)
test_eq(evaluator.eq("1,234", "1234"), True)
test_eq(evaluator.eq("12,34", "(12,34)"), True)

test_eq(evaluator.eq("\\$ 5", "5"), True)
test_eq(evaluator.eq("3 * \\sqrt{13}", "3\\sqrt{13}"), True)
test_eq(evaluator.eq("\\pi/2", "\\frac{\\pi}{2}"), True)
test_eq(evaluator.eq("(3,\\pi/2)", "(3,\\frac{\\pi}{2})"), True)
test_eq(evaluator.eq("23000", "\\$23{,}000"), True)
test_eq(evaluator.eq(r"\left(1,2\right)", r"\left(2,1\right)", compare_sets=True), True)
test_eq(evaluator.eq("White", "white"), True)
test_eq(evaluator.eq("[0,3)", "[0,1]"), False)
test_eq(evaluator.eq("[0,1]", "[0,3)"), False)
test_eq(evaluator.eq("1001.5", "1001"), False)
test_eq(evaluator.eq("\\frac{2003}{2}", "1001"), False)
test_eq(evaluator.eq("-2,1", "1,-2", compare_sets=True), True)

Normalized Majority Voting

maj_answers_list, norm_answers_list = evaluator.batch_get_maj_answers(
    [["", "", "1", "2", "2", "3", "3", "3"]]
)
print(f"{maj_answers_list = } <- {norm_answers_list = }")
Judging: 100%|██████████| 7/7 [00:00<00:00, 55188.21it/s]
maj_answers_list = [['', '', '1', '1', '2', '2', '2', '3']] <- norm_answers_list = [['', '', '1', '2', '2', '3', '3', '3']]

Parsing LaTeX

Interval

from symeval import latex2sympy_interval
latex2sympy_interval("(-11,-10)\\cup\\{-\\sqrt{110}\\}")

\(\displaystyle \left(-11, -10\right)\)

latex2sympy_interval("(-\\infty, 0) \\cup (0, \\infty)")

\(\displaystyle \left(-\infty, 0\right) \cup \left(0, \infty\right)\)

latex2sympy_interval("(a+b,b]")

\(\displaystyle \left(a + b, b\right]\)

Matrix / Vector

from symeval import EvaluatorMathBatch

evaluator = EvaluatorMathBatch()
evaluator.latex2matrix(r"\sqrt{400\cos^2(9\pi/44)},\frac{\pi}{4}")

\(\displaystyle \left[\begin{matrix}\sqrt{400 \cos^{2}{\left(\frac{9 \pi}{44} \right)}} & \frac{\pi}{4}\end{matrix}\right]\)

evaluator.latex2matrix(
    r"\begin{pmatrix} \frac{1}{2} & 0 & -\frac{\sqrt{3}}{2} \\ 0 & 1 & 0 \\ \frac{\sqrt{3}}{2} & 0 & \frac{1}{2} \end{pmatrix}"
)

\(\displaystyle \left[\begin{matrix}\frac{1}{2} & 0 & - \frac{\sqrt{3}}{2}\\0 & 1 & 0\\\frac{\sqrt{3}}{2} & 0 & \frac{1}{2}\end{matrix}\right]\)

test_eq(
    evaluator.latex2matrix("\\begin{pmatrix}-18\\\\-49\\\\96\\end{pmatrix}"),
    Matrix([[-18, -49, 96]]),
)
test_eq(
    evaluator.latex2matrix("\\begin{pmatrix} 2 & 3 \\\\ 0 & -2 \\end{pmatrix}"),
    Matrix([[2, 3], [0, -2]]),
)

Normalization

test_eq(evaluator.norm_math_str("251,7\\\\ \\noindent"), "251,7")
test_eq(fix_a_slash_b("(3/4)\\sqrt{3}"), "(\\frac{3}{4})\\sqrt{3}")
test_eq(evaluator.norm_pm("x\\pmy"), "x-y,x+y")
test_eq(evaluator.norm_pm("a\\mpb"), "a-b,a+b")
test_eq(evaluator.norm_pm("1\\pm\\sqrt{19}"), "1-\\sqrt{19},1+\\sqrt{19}")
test_eq(evaluator.norm_pm(r"\{1\pm\sqrt{5},-2\}"), "1-\\sqrt{5},1+\\sqrt{5},-2")
test_eq(
    evaluator.norm_pm("\\(\\frac{1\\pm\\sqrt{17}}{4}\\)"),
    "\\frac{1-\\sqrt{17}}{4},\\frac{1+\\sqrt{17}}{4}",
)
test_eq(
    evaluator.norm_pm(r"\frac{1\pm\sqrt{1-\frac{2}{\sqrt{3}}}}{1}"),
    "\\frac{1-\\sqrt{1-\\frac{2}{\\sqrt{3}}}}{1},\\frac{1+\\sqrt{1-\\frac{2}{\\sqrt{3}}}}{1}",
)
test_eq(norm_deg(r"20^\circ"), r"20")
test_eq(norm_deg(r"\sin 20^\circ"), r"\sin {20*\frac{\pi}{180}}")
test_eq(evaluator.norm_basic_fn(r"sinx"), r"\sin^{1}x")
test_eq(evaluator.norm_basic_fn(r"\sin^2x"), r"\sin^{2}x")

Processing Sets

test_eq(evaluator.extract_set("{2,1}"), ["1", "2"])
test_eq(is_set("{2,1}"), True)
test_eq(is_set("orange"), False)
test_eq(is_set("x<-1orx>3"), True)
test_eq(is_set("(3/4)sqrt(3)"), False)

Manipulating Strings

test_eq(evaluator.remove_first_paren_pair("{white}", "{"), "white")

Contribution Guidelines

Setup

For intended contributors, we recommend installing the package with the dev extras and setting up the pre-commit hooks by running:

git clone https://github.com/tongyx361/symeval.git
cd symeval
pip install ".[dev]"
pre-commit install
conda install quarto # For nbdev

File Structure

symeval
├── utils # Repository utilities
├── symeval # Package code for common utilities
└── nbs # Notebooks and other files to run tests and generate documentation with https://nbdev.fast.ai

Checklist Before Commit

Run the prepare-commit.sh to clean the notebooks and export scripts for pipeline notebooks, generate documentation, run tests, render README if needed:

bash utils/prepare-commit.sh