I am a final year Ph.D. student in the Programming Language Group at Penn, advised by Prof. Mayur Naik. I received my M.S. in computer science from Vanderbilt University in 2014, before which I obtained my B.E. (with Honors) from Nankai Unversity in 2011. I spent the summer of 2019 as a research scientist intern at DeepMind working with Pushmeet Kohli in the Deep Learning team.
My research interests span programming languages, software engineering, security, and artificial intelligence.
My thesis research focuses on developing deep learning and reinforcement learning techniques for challenging problems in program reasoning that span verification, synthesis, and testing.
I am also interested in using numerical, statistical, and probabilistic methods to analyze and synthesize programs.
I enjoy building usable artifacts that validate my research ideas. All tools and datasets I have developed are open-source and available here.
My research to date has focused on improving software quality by advancing the state-of-the-art in static analysis, verification, synthesis, and testing using AI-based techniques ranging from symbolic reasoning, constraint solving, statistical, and probabilistic models, to recent approaches for deep learning and reinforcement learning. At the same time, these program reasoning challenges serve as a rich resource of inspirations for me to advance AI techniques, e.g. scalable inference, logical rule learnging, representation learning, and reinforcement learning with extremely sparse rewards.
Real-world programs are large and complex, which involves many third party libraries that may lack specification or even source code, and new features or patches are being introduced regularly. Plus well-known limitations such as undecidability, static analysis tools are forced to make approximations, which in turn result in large numbers of false alarms. My research combines statistical and logical methods to address these complexities and uncertainties.
Automatically learning API specifications from "big code". [USENIX Security'16]
This work uses statistical methods to infer API specifications from complex software systems like Linux kenerl, OpenSSL library, PHP and Python. The resulting open-source prototype APISan has found 76 previously unknown API misuse bugs in these systems. APISan does not need access to API's source code or binary; instead, it observes how and in which context APIs are used.
Improving analysis accuracy by learning from user feedback.
A sound static analysis produces false alarms that are unavoidable by the analysis designer. The analysis user, who usually implements the code, has much more information than the analysis designer. Incorporating feedback from the users opens up a new dimension to improve the analysis. However, user can occasionally make mistakes. The insight is to reduce rule-based static analysis into probabilistic models, such as Markov logic network and Bayesian network, so that even noisy feedback can improve analysis accuracy significantly.
Inference and learning are two key components of machine learning applications. After reducing static analysis into probabilistic models, user feedback is incorporated by performing Maximum A Posteriori probability estimate, or MAP inference. Off-the-shelf inference engines cannot scale to static analysis applications. My research improves inference by a systematic constraint solving process. Furthermore, inference helps to learn rule weights but not rules that usually designed by experts. To learn logical rules from data, my research proposes a scalable logic program synthesis framework, which not only improves static analysis but also applies to other domains like big-data analytics and software defined networks.
Scalable inference via systematic constraint solving.
MAP inference of Markov logic network is essentially a MaxSAT problem, which can be solved in two phases: first, ground logical rules into a set of weighted constraints; then solve weighted constraints by a MaxSAT solver. However, naively doing so will be extremely inefficient. Both phases can be optimized. From the perspective of application, constraints do not have to be solved entirely, which allows us to have a combined top-down (exploits laziness) and bottom-up (exploits eagerness) grounding. From the perspective of the underlying solver, which is invoked many times instead of only once, so that incremental solving is feasible and efficient.
Synthesizing rich declarative logic programs.
Declarative logic programs, in particular Datalog programs, have witnessed promising applications in various domains including static analysis due to its rich expressivity, and also for the exact same reason, its synthesis is a fairly challenging task. This work proposes a systematic template augmentation technique, which enables synthesis of arbitrary rules but without significantly enlarging the search space. Furthermore, we synergistically combines a bi-directional search over version space with active learning so that efficient search is achieved and the number of required examples is minimized.
Scalable synthesis via numerical relaxation.
Inspired by the fact that numerical relaxation technique is widely used and remarkably successful in optimization problems, this work proposes a numerical relaxation technique for logic programs, which is fundamentally different from traditional search based approaches. This work demonstrates that gradient-based and stochastic approaches can be used to learn logic programs very efficiently.
Deep learning sparks a remarkable revolution in many challenging fields of science and engineering. Deep learning combined with reinforcement learning significantly outperforms human experts on video and board games by directly learning from raw pixels and self-play. However, can this technique help to solve programming challenges? My thesis work aims to answer this very question, which involves addressing two fundamental AI challenges: representation learning of structured data with rich semantics, and the interplay between neural networks and symbolic reasoning.
- Program verification by deep reinforcement learning. [NeurIPS'18]
This work concerns attacking the fundamental challenge of automated software verification, which is finding a strong enough loop invariant for a given loop. We develop an end-to-end learning framework, Code2Inv, which takes a piece of code as input, then automatically learns and improves a neural policy by interacting with a checker, and finally produces an invariant so that the verification can be done successfully. To achive so, we prose a novel representation learning technique for programs so that the semantics (rather than syntax) can be captured by neural networks, and two reward mechansims so that reinforcement learning could make progress from extremely sparse reward.
- Meta-learning on synthesis. [ICLR'19]
This work investigates the capability of deep learning and reinforcement learning in a much broader domain, namely syntax guided program synthesis (SyGuS), where each synthesis task has its own domain specific language, making manually designing heuristics for each task prohibitive. This also makes learning across different tasks extremely difficult as generalizing across different languages (i.e. learn from one language and then test on a different language) is near the limit of machine learning and has been rarely explored so far. Our attempt in this meta-learning direction and demonstrates quite promising results — the learned neural representation as well as policy can be transferred to (i.e. help to accelerate) synthesis tasks that use similar but different grammars.
- Learning-aided reasoning.
Though end-to-end learning has significant potential as shown by my recent research, it is impractical to replace classic approaches with end-to-end learning for various reasons including interpretability, training difficulty, and perhaps more importantly data inefficiency. Instead, deep learning should be integrated with classic approaches so that there is no need to learn everything from scratch. I envision that the future program reasoning techniques and the underlying constraint solving techniques (e.g. SAT, SMT) will be equipped with a learning component, which can be automatically and continuously improved. One of my ongoing work is introducing learning-based techniques into the state-of-the-art software verification and testing tool chain such as SeaHorn and KLEE.
Continuously Reasoning about Programs via Differential Bayesian Inference.
Kihong Heo, Mukund Raghothaman, Xujie Si, Mayur Naik
PLDI 2019 [paper, slides, artifact, video abstract ]
ACM SIGPLAN Distinguished Paper Award
Difflog: Beyond Deductive Methods in Program Analysis
Mukund Raghothaman, Sulekha Kulkarni, Richard Zhang, Xujie Si, Kihong Heo, Woosuk Lee, Mayur Naik
Machine Learning for Programming Workshop 2018 [paper]
Combining the Logical and the Probabilistic in Program Analysis
Xin Zhang, Xujie Si, Mayur Naik
ACM SIGPLAN Workshop on Machine Learning and Programming Languages 2017
APISan: Sanitizing API Usages through Semantic Cross-checking
Insu Yun, Changwoo Min, Xujie Si, Yeongjin Jang, Taesoo Kim, Mayur Naik
USENIX Security 2016 [paper, slides, code]
CSAW Best Applied Research Paper Finalist
ICFP'19, CAV'19, CAV'20, Artifact Evaluation Committee
PLDI'18, '19, Student Volunteer Co-Chair
POPL'16, Student Volunteer