Software Engineering and Software Systems Researcher

Jon Bell

About + Job Market Materials CV
home

About Jon

Jon is a PhD candidate at Columbia University researching Software Engineering and Software Systems with Prof. Gail Kaiser. His research makes it easier for developers to create reliable software. Jon’s recent work in accelerating software testing has been recognized with an ACM SIGSOFT Distinguished Paper Award (ICSE ’14 – Unit Test Virtualization with VMVM), and has been the basis for an industrial collaboration with Electric Cloud. His research interests bring him to publish at venues such as ICSE, FSE, ISSTA, OOPSLA, OSDI and EuroSys. Jon actively participates in the artifact evaluation program committees of ISSTA and OOPSLA, and has served several years as the Student Volunteer chair for OOPSLA. His most recent publications are Efficient Dependency Detection for Safe Java Test Acceleration (FSE ’15), Pebbles: Fine-Grained Data Management Abstractions for Modern Operating Systems (OSDI ’14) and Phosphor: Illuminating Dynamic Data Flow in Off-The Shelf JVMs (OOPSLA ’14).

A very long time ago, he co-founded RentPost and served as an intern at Codestreet, Sikorsky Aircraft and MediaMerx.

His other interests include photography and cycling.

Jon is currently on the academic job market. You can download his Research Statement, Teaching Statement, Curriculum Vitae, and Publications List directly from his site.

research

Research Overview

My research interests are in software engineering and software systems, focusing on approaches and tools to make it easier for developers to create reliable software. As software becomes more pervasive, I believe that it is important for everyday developers to be supported with tools and approaches to make their software more reliable.

In my thesis work, I studied millions of lines of code in thousands of repositories (both open source and proprietary), finding that the core cause of why software builds can be slow and unreliable is in fact because the test suites in these builds are themselves slow and flaky. By making software testing faster and more reliable, I make it easier to run tests more often, helping developers catch bugs sooner. By combining this scientific software engineering analysis with systems design, I can ensure that my work is highly applicable to developers in the real world. In the future, my vision is that developers will be actively supported by new tools and approaches to efficiently reproduce failures in deployed software, debug why these failures occur, and create new test cases to ensure that these bugs do not re-occur.

My most recent research contributions lead directly to faster and more reliable builds (receiving a Distinguished Paper Award at ICSE [VMVM], and also published at IEEE Software [VMVMVM], and FSE [ElectricTest]). I have also investigated new approaches for dynamic data flow analysis (at OOPSLA [Phosphor] and OSDI [Pebbles]), for lightweight record and replay techniques for Java (at ICSE [Chronicler]), and for web application architectures (at EuroSys [Synapse]); [I have also made contributions in SE education (at GAS, SSE and CSEE&T)

I am on the academic job market. You can download my Research Statement, Teaching Statement, Curriculum Vitae, and Publications List directly from this site.

You can read more about specific research projects below:

Slow builds remain a plague for software developers. The frequency with which code can be built (compiled, tested and packaged) directly impacts the productivity of developers: longer build times mean a longer wait before determining if a change to the application being built was successful. We have discovered that in the case of some languages, such as Java, the majority of build time is spent running tests, where dependencies between individual tests are complicated to discover, making many existing test acceleration techniques unsound to deploy in practice. Without knowledge of which tests are dependent on others, we cannot safely parallelize the execution of the tests, nor can we perform incremental testing (i.e., execute only a subset of an application’s tests for each build). The previous techniques for detecting these dependencies did not scale to large test suites: given a test suite that normally ran in two hours, the best-case running scenario for the previous tool would have taken over 422 CPU days to find dependencies between all test methods (and would not soundly find all dependencies) — on the same project the exhaustive technique (to find all dependencies) would have taken over 1e300 years. We present a novel approach to detecting all dependencies between test cases in large projects that can enable safe exploitation of parallelism and test selection with a modest analysis cost. Read more about our tool, ElectricTest in our ESEC/FSE 2015 paper.

Testing large software packages can become very time intensive. To address this problem, researchers have investigated techniques such as Test Suite Minimization. Test Suite Minimization reduces the number of tests in a suite by removing tests that appear redundant, at the risk of a reduction in fault-finding ability since it can be difficult to identify which tests are truly redundant. We take a completely different approach to solving the same problem of long running test suites by instead reducing the time needed to execute each test, an approach that we call Unit Test Virtualization. With Unit Test Virtualization, we reduce the overhead of isolating each unit test with a lightweight virtualization container. We describe the empirical analysis that grounds our approach and provide an implementation of Unit Test Virtualization targeting Java applications. We evaluated our implementation, VMVM, using 20 real-world Java applications and found that it reduces test suite execution time by up to 97% (on average, 62%) when compared to traditional unit test execution. We also compared VMVM to a well known Test Suite Minimization technique, finding the reduction provided by VMVM to be four times greater, while still executing every test with no loss of fault-finding ability. For more information about VMVM, please see our paper, and github project.

Support for fine-grained data management has all but disappeared from modern operating systems such as Android and iOS. Instead, we must rely on each individual application to manage our data properly – e.g., to delete our emails, documents, and photos in full upon request; to not collect more data than required for its function; and to back up our data to reliable backends. Yet, research studies and media articles constantly remind us of the poor data management practices applied by our applications. We have developed Pebbles, a fine-grained data management system that enables management at a powerful new level of abstraction:  application-level data objects, such as emails, documents, notes, notebooks, bank accounts, etc. The key contribution is Pebbles’ ability to discover such high-level objects in arbitrary applications without requiring any input from or modifications to these applications. Intuitively, it seems impossible for an OS-level service to understand object structures in unmodified applications, however we observe that the high-level storage abstractions embedded in modern OSes – relational databases and object-relational mappers – bear significant structural information that makes object recognition possible and accurate. Our OSDI 2014 paper describes Pebbles in more detail.

Dynamic taint analysis is a well-known information flow analysis problem with many possible applications. Taint tracking allows for analysis of application data flow by assigning labels to data, and then propagating those labels through data flow. Taint tracking systems traditionally compromise among performance, precision, soundness, and portability. Performance can be critical, as these systems are often intended to be deployed to production environments, and hence must have low overhead. To be deployed in security-conscious settings, taint tracking must also be sound and precise. Dynamic taint tracking must be portable in order to be easily deployed and adopted for real world purposes, without requiring recompilation of the operating system or language interpreter, and without requiring access to application source code.

We present Phosphor, a dynamic taint tracking system for the Java Virtual Machine (JVM) that simultaneously achieves our goals of performance, soundness, precision, and portability. Moreover, to our knowledge, it is the first portable general purpose taint tracking system for the JVM. We evaluated Phosphor’s performance on two commonly used JVM languages (Java and Scala), on two successive revisions of two commonly used JVMs (Oracle’s HotSpot and OpenJDK’s IcedTea) and on Android’s Dalvik Virtual Machine, finding its performance to be impressive: as low as 3% (53% on average; 220% at worst), using the DaCapo macro benchmark suite. Our OOPSLA 2014 paper describes our approach toward achieving portable taint tracking in the JVM, which is released under an MIT license via GitHub.

As software continues to grow in complexity, bugs remaining after deployment have become increasingly challenging to resolve. Resolving such bugs begins with reproduction, which is time consuming and difficult when the “root cause” involves nondeterministic factors such as timing and interleaving — insidious on multicore hardware. Because conventional bug reports are inadequate for reproduction of tricky bugs, deterministic record-replay has been developed to capture bugs in the field and replay them in the lab, but most previous approaches require special hardware or add very high overhead. More concerning, these approaches typically transmit too much information back to developers, including sensitive information in bug reports. We seek to create systems to efficiently reproduce field failures while maintaining user privacy. For information on early efforts in this direction, please see Chronicler.

More Jon Bell