The prospects of software intelligence for small and medium-sized enterprises

Data is generated by every aspect of our day-to-day business life. Regardless if the data is created explicitly through our direct actions, or implicitly by the software tools we utilize, some aspects of this data are used within companies. Many companies utilize business intelligence (BI) to calculate metrics and trends from data updated on a daily basis, allowing them to acquire new information that can support the decision-making process. BI collects and analyzes the data, often presenting it in the form of dashboards or other types of reports. In a similar fashion, software engineering data can be used to gain insights into and support the development activities. The term software intelligence (SI) thus describes a set of methods and techniques for collecting raw data from the software development cycle and converting it into useful information in support of software development decision-making, quality management and resource management.

SI encompasses not only the derivation of simple metrics for decision-making (file sizes, degree of complexity), but also more sophisticated techniques that require prior consolidation and processing of the data. In light of the constantly growing volumes of historical data, machine learning can help to acquire information and improve the software development life cycle. Many researchers are developing new techniques in order to apply machine learning to software development. ML4SE (machine learning for software engineering) is one of the trends in SI that shows new ways of supporting developers in their daily activities by using machine learning models.

What are typical SI applications?

The goal of software intelligence is to guide and support developers in their daily activities. Among other things, SI supports the analysis, monitoring and optimization of the code quality. Dashboards are frequently utilized to display the results of SI methods. Some tools also help to integrate SI techniques directly into the workflow and display information directly in the corresponding application (such as part of continuous integration runs) or send notifications to the inbox or communications tools (i.e. Slack) in order to learn about new information. Following is a list of the activities along the software development life cycle and a description of the SI techniques that should support them. Some of these techniques are already well-established, while others are still part of active research efforts and have only been adapted to industry environments to a certain degree. We’d like to provide an outlook on these techniques and describe new developments:

Requirements engineering

Tracing
Collecting the requirements is the first step in planning a software system. The requirements describe the desired functionality (features) and the behavior (acceptance criteria) of the system. With solid traceability, the status and work steps can be tracked through multiple systems (VCS, ITS, communications channels), making it easier to discover undesired behavior and technical culpability. That’s why the traceability of requirements in all phases of software development is a firmly-established task that can drive down maintenance costs and simplify resource management.
Modeling and verification
The modeling of business application scenarios can help developers understand the impact of the desired functionality and provide a guideline for the implementation. Requirements modeling is a complex activity however. Acting as a guide, SI can help detect anomalies in such workflows. SI thus helps developers understand the tasks and uncover the requirements documents that potentially require further attention and modification.
Prioritization
The prioritization of requirements once they have been collected can be accomplished with SI by means of earlier requirements descriptions and prior development activities. Providing an overview of all requirements and their automatic classification helps save some of the effort and better visualize the meta information related to the requirements.
Resource estimate
The ability to make reliable predictions about the development effort and risk of a delay helps software companies improve customer satisfaction and at the same time reduce costs and optimize the delivery speed. Today most software companies operate in environments where the development teams are called on to respond quickly to change requests and constantly deliver business value. Machine learning has recently gained popularity in the areas of resource estimates and risk forecasts. To do that, information from prior resource estimates is used to evaluate current development tasks.

Software design

Modeling
Software design modeling can help developers understand errors in the software design and the module dependencies. SI methods analyze these models, an approach that can reveal code duplicates, as well as the excessive use of inherited structures or the improper utilization of software patterns. These techniques can help to detect error-prone areas of the code and guide restructuring measures for improving software quality and maintainability.
Pattern forecasting
Determining the best design of a software component is not a trivial development task. ML4SE recommends detecting and learning design patterns in existing code inventories. Using the current code structure or the requirements documentation, researchers have developed models that are trained to predict the design pattern best suited for the application. Quite new however, this field of research is not yet established in industry.

Implementation

Code generation
Code generation approaches can be used to create individual statements or entire classes based on the current development context. These methods are aimed at faster and more efficient development. Multiple studies have shown that autocomplete functions are one of the most important features of integrated development environments. While traditional approaches utilize alphabetic or use-based recommendation systems for the completion of the code, new approaches suggest using machine learning to map new code to already-known code structures and propose generating the code based on code written under similar circumstances. Although such models are highly promising, current approaches are often not usable without modification.
Logging and monitoring
Logging is an important task in software development since the software logs can be used for error detection (i.e. debugging or monitoring in production environments), for uncovering performance issues (i.e. call tracking) or for determining the cause of unexpected behavior during development (i.e. logging of the call tracking hierarchy). Logging can also be a challenge however. In practice, developers often spend considerable amounts of time maintaining the logging statements and their infrastructure. Without a thorough plan, logging can furthermore produce either too much information or information of no value. SI techniques can help developers decide where to place logging statements, which variables are worth logging and which logging stages should be used for specific scenarios. Researchers are also examining methods for the automated analysis and aggregation of logging notifications so that the monitoring resources can be applied to more important problems instead of manually analyzing the logs.
Static code analysis
Established static code analysis tools highlight technical flaws or security gaps in the code based on defined rules. These tools can increasingly be combined with machine learning models that more dynamically analyze such metrics and utilize historical software data to create and apply new rules. This helps with the management of technical culpabilities and the resource management of software development and maintenance activities.
Error prognosis
Using information from the version control history, error prognosis assigns an error risk to modified files or newly-added transfers. This can help to detect errors soon after they occur. In this way SI can help improve code quality and reduce the resources needed for maintenance activities. Error prognosis can be easily integrated into known workflows such as continuous integration and monitoring processes in order to provide developers direct support and feedback and to focus the effort on more error-prone areas. There are nonetheless few studies that utilize and evaluate these methods in industry projects.

Software testing

Test gap analysis
Software testing is an important task for ensuring quality and avoiding regression issues. Deciding what and how specific parts of the software should be tested is not a trivial job however. Using information pertaining to code changes and test coverage, a technique referred to as test gap analysis can be applied to software repositories to find untested code. The analysis makes it possible to prioritize test resources and detect error-prone software modules in the early stages of the development cycle.
Regression test selection
Regression testing is an important but costly activity that is performed each time a program is modified to ensure that the changes do not lead to new problems or flaws. The test suites can be very comprehensive and the execution after each change can be extremely time-consuming. One of the important research issues therefore involves the selection of a relevant number of test cases that minimize the test time and test effort without impacting the validity of the test process.

Integration and workflow

Integration support
It can be difficult to maintain an overview of pending integration and desired changes in active software projects. SI can help to present and automatically integrate completed integration inquiries provided that defined criteria are fulfilled. These applications can also track the internal development of functions and automatically execute commit message adjustments or squashing and rebasing activities.
Pre-build and test provisioning
Not all bugs are discovered during development. SI can support quality assurance by making developed functions available before they are integrated into the production environment. This reduces the effort involved in setting up an environment for manual testing. Furthermore, the system under test can be monitored in a production-like setting. Making the functions available on a staging system before the changes are integrated and deployed in production is a proven practice. Problematic changes can be detected earlier if they are still easy to undo, thus saving manual effort.

Team management and process quality

Flow and workload monitoring
Open error reports and pull requests can get out of control over time. Some error reports can even be irrelevant. By automatically analyzing open pull requests and development tasks, you can maintain an overview and at the same time achieve solid progress and prioritize the workload. Dashboard solutions can point to out-of-date error reports and tasks for which no progress has been reported or which are stuck in development.
Work allocation
By analyzing prior development activities, code ownership and the implicit experience of developers for specific areas of the software, SI can be used to select suitable persons for various functions and find experts for specific issues. Even if the development activity is complete, developers can be recommended for manual reviews based on familiarity with the code structures and similar functionality requirements. This not only helps with quality assurance, but also fosters knowledge exchange.

Which data is suitable for SI?

For all of these application scenarios, suitable data is required that not only contains information about the problem, but also allows conclusions to be drawn from newly-acquired data. Data has to be continuously collected and analyzed from all possible software development activities. Depending on the application scenario and the tools used for data monitoring in the software development cycle, the quality of the SI techniques that are employed can vary. Having a comprehensive and detailed plan for collecting and processing the data is an important building block in utilizing software intelligence. Some of the methods introduced above require historical development data in order to make concrete and accurate predictions regarding the status of the software development activity. We categorize data from the software development life cycle into three areas: process-related, product-related and code-related data.

Process data
Process-related data refers to data collected during management of the project life cycle. Collected by analyzing the process, the data might include metrics related to the workload, team size and speed, or activity resource utilization. Process-related data is also implicitly collected through

error tracking systems and tools for requirements documentation (Jira, Bugzilla, Github or Confluence) when planning the software development activities
tools for communications (Slack, e-mail, comment areas) and reviews (Gerrit, Github) when development activities are completed
automated processes such as continuous integration runs (jenkins, circle ci, travis, teamcity) containing information regarding test (coverage, errors), lint and build results

Code data
At a lower abstraction level, code-related data pertains to data adjustments to the code, tests and software architectures. Static code analyses, such as Sonarqube or Infer, analyze the code directly from the software repository to acquire new information about the code. Architectural data can be obtained from the modeling activities, such as UML diagrams. Software artifacts can be analyzed to not only obtain information about code metrics, but to retrieve the commit history (including commits, branches, author experience and code ownership) from the version control system (i.e. git, svn, mercurial). Data pertaining to code changes and patches, dependencies and code evolution can be collected as well.
Product data
Product-related data is generated more on the operative side of the software development process. Usage statistics from open source tools can be found on code management platforms (i.e. Github, Gitlab, Bitbucket) that describe the community and the most recent activities. Monitoring the production instances in software is an important task. Data from the logging and monitoring activities (i.e. elk-stack), such as server load and outage information, can be called up via cloud or incident reporting tools. If the deployment in a continuous pipeline (CircleCI, Travis) is automated, data about the deployment process and possible errors can be called up.

Summary

With the steadily growing volumes of data in software engineering processes (collected either explicitly or implicitly), more and more decision support can be offered by analyzing this data. SI or ML4SE are gaining increasing attention from research and industry. The utilization of SI techniques can help save time and resources when developing and provisioning software systems.

Exploiting the full potential of SI also requires giving thought to the available data and how it can be collected and organized. Data can be scattered around and difficult to access. While raw data from individual sources can already be utilized for simple application scenarios, a combination of various sources is often necessary to generate meaningful data sets and train machine learning models. Tracking activities during the entire software development process is especially important.

CCE is planning to introduce a detailed series of individual topics surrounding SI and more and more interesting applications in various sectors of software engineering, from error prognosis and health, to team management.

Name	Purpose	Lifetime	Type	Provider
_pk_id	Used to store a few details about the user such as the unique visitor ID.	13 months	HTML	Matomo
_pk_ref	Used to store the attribution information, the referrer initially used to visit the website.	6 months	HTML	Matomo
_pk_ses	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_cvar	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo
_pk_hsr	Short lived cookie used to temporarily store data for the visit.	30 minutes	HTML	Matomo

Center for Code Excellence

What are typical SI applications?