The growth of data intensive university research
Infrastructure reliability and usability is key to making data intensive research work, says Aalto University's Mikko Hakala
Doctor of Technology, Mikko Hakala, from Aalto University’s computing sciences services unit looks at the growth of data intensive research, and how to meet and exceed the requirements for today and tomorrow’s research needs.
Aalto University was founded in 2010 when the Helsinki University of Technology, the Helsinki School of Economics, and the University of Art and Design were merged. We have six schools with almost 20,000 students and 4,500 members of staff.
In the science fields we’ve got hundreds of active projects, which include data-intensive research such as physics, computer science, neuroscience, economics, and signal processing. As such, the need for research data is growing with the increase of computing and modeling, as well as the growth of users. Quite simply, research is becoming more data-intensive as the digital economy and the Internet of Things are advancing.
The amount of data needed for research has grown tenfold within the last four years, so storage capacity and computing power is a big issue for us
The amount of data needed for research has grown tenfold within the last four years, so storage capacity and computing power is a big issue for us. Our goal at the university’s computing sciences services unit is to provide high-class IT services so that researchers and users can access and utilise the data and resources they need.
We also cooperate and collaborate with other organisations and universities, both in Finland and abroad. To put the research data in some context – the majority of Aalto University’s stored data is in active research usage. We want to ensure that our researchers can access the data whenever and wherever they need, something that our storage and compute infrastructure has, in the past, struggled with.
I know the problems all too well; my own experience is that of a researcher interested in data-driven computational materials science. Specifically, I look at computational research on the structure and electrochemistry of catalyst materials, using data for predictive modeling. My point is, I know the challenges firsthand for both our researchers and the University.
The solution to the ever-growing demand for data and research can be complex. Buying enough storage to meet current demand doesn’t tend to be the best option. As new researchers, projects, and data sets come to us, we’d quickly fill up our storage infrastructure.
Similarly, with compute, buying to meet current demand for HPC services means the technology would quickly need further investment as our university continues to grow. Investing in new storage and compute requires you to plan for growth, both in the numbers of staff, students and researchers, as well as growth in the volumes of data.
I don’t know of many universities that have very large IT teams, so I believe it’s important that you don’t just look at the technology you need to meet research and data demands, you also need to look at the company or companies delivering their solutions to you. And, you need to ask how they will help with the design, build, integration and delivery of what you need.
Choosing the right solutions
Performance, as well as reliability are very important for us as we provide a very critical service – problems with the system would end practically all research work.
It was through competitive tendering that we chose to work with HPE for its servers, and DDN for its storage systems. Public cloud services weren’t an option for us to store the data being used for research. With HPE and DDN, we ended up with a solution that has four times more capacity and a fivefold increase in performance compared to our previous system.
In addition, it’s built to last too – the lifecycle is five years, but we can easily add more storage capacity if needed.
One of the key features that a University HPC solution needs to have is to be reliable and efficient. We specifically needed a storage environment that is totally redundant, including all hardware and software components, to ensure operational reliability.
On top of that, we want to be able to see the data we’re holding in storage. The ability to recognise which data is in active use and which is not allows us to improve the cost-efficiency of our storage. Moving the inactive data to cheaper discs for long term storage makes a lot of sense, so having the ability to see which data can be transferred is really important.
But in the end, it’s about having an infrastructure that is reliable and easy to use for our researchers.
In almost all fields of scientific research, the amount of data is growing, and although we’ve only recently invested in new systems, the utilisation rate of our supercomputer is already very high. Having the capacity and power for production environments, as well as having the ability to scale up our infrastructure is vital for our University.