The value of scalable storage in supporting HE research

By Mike Shuey, Research Infrastructure Architect at Purdue University

I’ve worked at Purdue University for more than 15 years, spending the last 12 of those designing high-end research compute systems that enable scientific research to take place, providing consultation for computational research demands, as well as leading the team that implements and operates these systems for the University. 

Through computer simulation, theories can be explored and experiments can be conducted that were unthinkable just a few years ago. High performance computing (HPC) is the engine that realizes such simulations – climate change predictions, molecular simulations for drug design, and exploration of new theories of particle physics, to name just a few. 

Three of Purdue University’s HPC systems are currently listed in the internationally known Top500 list of most powerful supercomputers, which includes the US’ largest academic distributed computing grid, called BoilerGrid, and the largest collection of science and medical online hubs. Our researchers are using these HPC machines to store and analyse massive amounts of data. This means Purdue’s infrastructure to support scientific research needs to be maintained and upgraded regularly. Machines get old, storage fills up, and it’s a challenge that has to be addressed constantly. We need to be able to provide uninterrupted access to centralised compute, network and storage resources that can support upwards of 1,000 researchers working on several hundred concurrent research projects. 

‘With such a vast array of research efforts to support, we’ve developed one of the nation’s largest campus-wide cyber infrastructures for research.’

With such a vast array of research efforts to support, we’ve developed one of the nation’s largest campus-wide cyber infrastructures for research. It now also includes a robust data repository, called the Data Depot – a fast, reliable, high-capacity data storage service for researchers across all fields of study. The Depot facilitates file sharing across research groups while enabling transfers of large amounts of data between local systems as well as to and from national labs, meaning we can collaborate with researchers based outside of the University on impressive national research projects.

We were looking for one large, site-wide file system for our latest storage upgrade – one that could be accessed from multiple HPC systems. We have such a diverse set of research stakeholders, all needing fast, unrestricted access to the Data Depot, which has caused our data volumes to increase sharply in recent years. This only intensified the need to improve the infrastructure to include high-performance, scalable storage.

Being able to determine how much storage was actually needed was the first challenge – we initially looked at the needs of several of the top research areas at the university. This included computational nanotechnologies, aeronautical and astronomical engineering, mechanical engineering, genomics, structural biology as well as several large projects in the life sciences discipline.  Second, the system’s design must lend itself to drastically different use cases. Both massively parallel data access and millions of independent file requests must be allowed, without imposing undue performance penalties on any research group. The way different researchers retrieve their data was also a consideration, as we didn’t want to create any performance bottlenecks – we needed to sustain the highest levels of performance for all researchers, regardless of the size or demands of their research project. Additionally, the system must be extremely reliable; in addition to safeguards against traditional data loss, the system should withstand a host of facilities failures and possible “accidental disruption” scenarios. 

So, with that need in mind, we set out on a thorough analysis and detailed procurement process to find a solution that met our criteria. This included performance, scalability and price/performance as well as the ability to handle the diverse workloads I mentioned above. The eventual solution we chose came from DDN Storage. In total, the new storage solution had 6.4 petabytes of raw storage capacity to provide block storage for the university’s GPFS parallel file system. In addition to an extremely cost-effective solution, DDN was able to provide their new SFX data acceleration technology. Recently accessed data can be cached on a large pool of solid-state storage, built into the same array as our primary data storage.

The improvements have been impressive; 900% improvement in read capability at a low cost, as compared to the same system without SFX. What that means for us, and our research community, is that projects can access millions of small files held on dedicated solid-state modules while continuing to stream very large data files at the same time. Simple data queries that used to take two minutes now take two seconds! 

Currently, about 90% of data-intensive research on campus uses the Data Depot, with growth occurring across traditional and new research areas. Interestingly, we’ve seen more data growth than anticipated coming out of liberal arts and sociology departments, as well as from a group conducting ethnographic research across multiple continents. 

Data underpins almost all research today, so having a scalable storage platform that can meet the varying demands of many diverse communities is vital to any modern university. 

Images courtesy of Purdue University.

Send an Invite...

Would you like to share this event with your friends and colleagues?