Execution-Based Evaluation for Open-Domain Code Generation

1Language Technologies Institute, Carnegie Mellon University
2Inspired Cognition

ODEX is an open-domain 📖, multilingual 🌍, execution-based 🛠
natural language to code generation 💻 data benchmark


To extend the scope of coding queries to more realistic settings, we propose ODEX, the first Open-Domain EXecution-based natural language (NL) to code generation dataset. ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution. Our NL-Code pairs are harvested from StackOverflow forums to encourage natural and practical coding queries. Moreover, ODEX supports four natural languages as intents, in English, español (Spanish), 日本語 (Japanese), and Pусский (Russian).
ODEX unveils intriguing behavioral differences between top-performing Code LMs: Codex performs better on open-domain queries, yet CodeGen captures a better balance between open- and closed-domain. ODEX corroborates the merits of execution-based evaluation over metrics without execution but also unveils their complementary effects. Powerful models such as CodeGen-6B only achieve an 11.96 top-1 pass rate, suggesting plenty of headroom for improvement.

Coverage of Open-domain Libraries

ODEX (left) features a wide coverage of 79 open-domain libraries. Other existing benchmarks (right) either focus on built-in grammars or are limited to a small set of libraries in the data science domain.

Results Summary

Plenty of improvement headroom even with state-of-the-art Code LMs.

Open Domain vs. Closed Domain

Even the best model (Codex, code-davinci-002) shows significant gaps in multiple languages between open- and closed-domain coding problems.


Examples in all four natural languages from the ODEX dataset are shown below. ODEX is also available on Huggingface 🤗 Dataset Hub


Replace all occurrences of a string `\n` by string `
` in a pandas data frame `df`
# Library import
import pandas as pd
# Function-wrapped code solution
def function(df):
	return df.replace({'\n': '
'}, regex=True) ​ # Test case df = pd.DataFrame(['klm\\npqr', 'wxy\\njkl'], columns = ['val']) expected = pd.DataFrame(['klm
pqr', 'wxy
jkl'], columns = ['val']) assert pd.DataFrame.equals(candidate(df), expected) ​

español (Spanish)

¿Cómo escoger cuatro valores aleatorios `L1`, `L2`, `L3`, `L4` de una cadena `S` sin repetir? (How to pick four random values `L1`, `L2`, `L3`, `L4` from a string `S` without repeating?)
# Library import
import random
# Function-wrapped code solution
def function(S):
	L1, L2, L3, L4 = random.sample(S, 4)
	return L1, L2, L3, L4
# Test case
L1, L2, L3, L4 = function([1,2,3,4,5,6,7,8])
s = set([L1,L2,L3,L4])
assert len(s) == 4
assert s < set([i+1 for i in range(8)])

日本語 (Japanese)

(Read the return value when submitting from the browser object `br`)
# Library import
import mechanize
import urllib.request
from unittest.mock import Mock
# Function-wrapped code solution
def function(br):
	return br.submit().read()
# Test case
br = mechanize.Browser()
x = urllib.request.urlopen('https://www.wikipedia.org')
br.submit = Mock(return_value = x)
assert b'Wikipedia' in function(br)

Pусский (Russian)

Проверить есть ли числа в строке `s` (Check if there are numbers in string `s`)
# Library import
# Function-wrapped code solution
def function(s):
	return any(map(str.isdigit, s))
# Test case
assert function('124dhe5') == True
assert function('absbf') == False


    url = {https://arxiv.org/abs/2212.10481},
    author = {Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham Neubig},
    title = {Execution-Based Evaluation for Open-Domain Code Generation},
    publisher = {arXiv},
    year = {2022}