Azure for Executives

Next-Level High Performance Computing on Azure with Tim Carroll and Evan Burness

Episode Summary

In this episode, we talk about high-performance computing (HPC). HPC is important in many industries and lets us work with data at scale, operate on large media files, work with large AI and ML workloads, and run big batches of worker jobs.

Episode Notes

Tim Carroll, Director for HPC & AI for Research at Microsoft, and Evan Burness, Principal Program Manager for HPC & Big Compute in Microsoft Azure tell us what high-performance computing is and why it’s unique, how it is being used in the industry today, benefits of the new HPC VM over what people have used in the past on Azure, and for people who are not using HPC today, how they can get it up and running!

Episode Links:

Episode Transcript
Microsoft Azure HPC
Azure HPC by industry
More performance and choice with new Azure HBv3 virtual machines for HPC

Guests:

Evan Burness is the Principal Program Manager for HPC & Big Compute in Microsoft Azure.

Follow him on LinkedIn.

Tim Carroll is the Director for HPC & AI for Research. He is also a HPC Global Black Belt at Microsoft.

Follow him on LinkedIn.

Hosts:

Paul Maher is General Manager of the Marketplace Onboarding, Enablement, and Growth team at Microsoft. Follow him on LinkedIn and Twitter.

David Starr is a Principal Azure Solutions Architect in the Marketplace Onboarding, Enablement, and Growth team at Microsoft.

Follow him on LinkedIn and Twitter.

Episode Transcription

DAVID: Welcome to the Azure for Executives podcast, the show for technology leaders. This podcast covers trends and technologies in industries and how Microsoft Azure is enabling them. Here, you'll hear from thought leaders in various industries and technologies on topics important to you. You'll also learn how to partner with Microsoft to enable your organization and your customers with Microsoft Azure.

DAVID: Hello, listeners. Today we're talking about high-performance computing or HPC, and this is important in a lot of different industries. HPC lets us work with data at scale, operate on large media files, work with large AI and ML models, things like that, or even run big batches of worker jobs. We have two folks from Microsoft HPC teams here to talk with us today about the latest in high-performance computing on Azure. So with that, I'll introduce our guests. Firstly, we have Evan Burness who is the Principal Program Manager for HPC & Big Compute in Microsoft Azure. Evan, welcome to the show.

EVAN: Thank you and good morning to you and all the listeners.

DAVID: And we also have Tim Carroll who is Director for HPC & AI for Research. He is also an HPC Global Black Belt at Microsoft. Tim, welcome to you.

TIM: Thanks for having me.

DAVID: You bet. I'll start off by saying something [laughs] kind of trite I guess, and that is that I recently built myself this hugely powerful new desktop with crazy specs. But I know that HPC is a completely different ball of wax. So, can you tell us a little bit about high-performance computing and why it's such a unique beast? What is HPC?

EVAN: Great question. And that's a great question actually to start off with. Normally when we meet with folks who are trying to understand what this space is, this is one of the first questions they ask. High-performance computing fundamentally is when you have a computational problem you're trying to solve that conventional or commodity computing such as a desktop, as you just mentioned, or a laptop or a smartphone is either incapable of solving at all or is capable of solving it but it would take so long to do so that it is pragmatically unusable. So think about massive, massive simulations that require hundreds of terabytes of data to reside in main memory. Your desktop probably has 16 gigabytes, 64 gigabytes. So it’s a general matter, you're talking about problems that might be orders of magnitude greater than a standard computer can solve on its own. And so what you have to bring together is series of computers working in tandem with one another orchestrated with various software tools, sometimes with things like specialized networking or specialized storage to keep the data flowing in the right ways to solve these really, really advanced problems that eventually commodity computing might get too, but they might not get to for 20 years at a time. And so for organizations and researchers that have a pressing need to solve these problems right now, high-performance computing is one of the few tools in existence that enables them to get into this sort of time machine and solve problems today that otherwise, they would not be able to solve for many years to come.

TIM: Yeah. And David, I'll give you just a quick example in terms of how people can understand that this impacts everybody every single day. There are really two pieces of what we do. And if you think of high-performance computing problems as the weather problem, there's one thing to be able to write the software that forecasts the weather. And so if I can write a piece of software that forecasts tomorrow's weather with 100% accuracy, that's incredibly valuable. But if it takes me a week to do the computation so that I get my 100% accurate forecast a week after it happened, then that value goes down dramatically. So what we do in our segment of Azure is that we ensure that our customers are going to have the ability when they have come up with that weather forecast piece of software, metaphorically speaking, that we give them the infrastructure that they need to be able to deliver that answer in the time that matters so that they can create value out of it, whether it's on the public sector and health side, or whether it's in financial services or manufacturing or anything else.

PAUL: Fantastic. Well, thank you for that, both Evan and Tim. So let's explore a little bit. So we've done the preamble of what HPC is and I think we're starting to get a feel for the scale and the complexities and the opportunity for all listeners out there. Evan and Tim, super excited to have you on the show. You're coming from an extensive background within HPC. So perhaps I'll ask you Evan to share a little bit about yourself. And then maybe we could double-click a little bit in terms of when we think about HPC as it pertains to industry, maybe you could talk a little bit about how HPC is being used and feel free to share some real-world examples as well just so we can share with our listeners the use cases. So, Evan, over to you.

EVAN: Sure. So my background is principally coming from the National Center for Supercomputing Applications at the University of Illinois. NCSA is one of the five original supercomputing centers created by the U.S. National Science Foundation in the 1980s with a specific charter to go after some of the biggest basic and applied research problems in existence. And then because NCSA got really good at doing all things high-performance computing and the value of it for solving advanced problems became readily apparent to commercial industry, NCSA also became one of the leading sites in the country for working with private industry to take the lessons learned from the basic and applied research spaces and push it into the private sector. And so I helped run a program there called the Private Sector Program for the better part of a decade. And my role there was really designing and operating large-scale supercomputers tailored to the needs of private industry to essentially accelerate the knowledge transfer we were getting on the basic and applied research side and inject into the learnings and the innovation agendas of a lot of the commercial companies we were working with but doing so from a supercomputing lens.

Tim really hit the nail on the head of why my role at NCSA would have existed, why that kind of program would have existed and why a lot of that maps to what we're doing now in Azure, just in this new era of cloud which is that high-performance computing it sounds like a niche thing, but it really is everywhere. I mean, every industry vertical in existence does enormous amounts of high-performance computing these days. This is not an extreme niche capability. Every industry at this juncture, the world now is fundamentally dominated by problems that you need to throw massive amounts of data at or massive amounts of compute at. And so the practice of high-performance computing basically applies to every category of customer we in Azure are going to be engaged with. And that can be anything as wide-ranging as very large-scale partnerships like those we have with open AI who are right at the bleeding edge of artificial general intelligence, which is a field that has no visible upper bound. The amount of compute and data that is needed to solve some of those highly advanced problems around making AI that can think like a human being to use cases that have existed for a very long time are still very, very far away from being able to be resolved in the way that people want to.

So, as an example, think about something very common like your car that you get into or at least many of us got into before the last year or so. Car companies when they want to test and validate that their product is safe, they need to do computational simulations of cars in action. And the act of simulating that car with all the physics involved, all the material science, how different things such as pressures from wind or the earth underneath it are exerting forces on the car, those are things that they're extremely difficult to solve. We can only simulate still probably a small fraction of the time that the car will be in existence whereas car companies want to simulate how their car is going to operate for the full lifetime of the car. So HPC it's been around in all these different verticals for a long time. It is growing rapidly within those spaces. But the exciting thing about the space is that it grows and evolves over time. A lot of those lessons learned and practices are now being adopted in other areas such as fields that want to turbocharge their industry with AI.

PAUL: Yeah, that makes a lot of sense, Evan. And just adding to what you've said then, so HBC has been around for a while. So for those of you who are familiar, of course, you've made probably large investments on-premise. And I think most recently, we've seen trends, of course, in cloud, so cloud is obviously providing the ability to -- when you talk about supercomputing, cloud is really helping enable HPC at those supercomputing levels being able to really sort of scale in that pay as you use model. So I think we've seen some great migrations or greenfield projects starting out on cloud. So I think cloud is being super transformational in terms of folks who are coming from HPC investments and being able to take that forward but also, I think making HPC much more accessible to the masses, which is great.

Adding to a couple of things you were saying, Evan, as well, whether you know it or not, there's lots of stuff happening around driving. As you said, when you think about HPC, it's being able to leverage multiple resources; it could be driving complex calculations. And we're seeing that across a diverse set of industries and that could be anything from things like financial modeling and financial services and doing financial projections to as you mentioned, in the manufacturing space lots of good stuff going on there, for example, things like fluid dynamics and even healthcare. Healthcare is super interesting obviously given everything that's going on around things like medical research.

And so one final question for you, Evan, and Tim feel free to chime in if you have some additional comments as well. We mentioned the word supercomputing and both yourself and Tim are coming from obviously a great HPC background. Give me some numbers to wow the audience. So when you talk about supercomputing and large-scale compute, what kind of magnitude are we talking about in terms of the number of calls and what's happening? Just so we set the scene of going from, as David said earlier on, moving from your personal computer to HPC supercomputing scale.

TIM: Paul, I can perhaps provide some context hopefully for the folks who do understand the HPC side of it and get wowed by numbers, of course, but even folks who don't necessarily but they understand that this is going to be an increasingly relevant part of their business. There's a software program called WRF, which is a weather modeling code. It's an open-source weather modeling code. And there's a community of about 40,000 people around the world that contribute to this open-source weather model. What's significant about it is that it has traditionally been -- because it was able to be universally run, people could run it on anywhere from a few hundred cores up to several thousand cores. And it was only some of the very largest supercomputers, multi-million dollar supercomputers, around the world that people even had enough cores, 10,000, 15,000, 20,000 cores to be able to run that model at scale. And remember what I had talked about before that the value of running weather models is the faster you can run the model, the more accurate it is, and the more timely it is, both of which you want out of weather.

So this is over a year ago, one of the folks on our team ran that at over 80,000 cores and that number has only gone up since then. But at the time that we ran it, we were unaware of anybody else globally on an on-prem system who was able to run that code at scale and be highly performant at 80,000 cores. But that wasn't the significant piece of this. The significant piece was not hey, we were able to do something that you wouldn't be able to do on an on-prem system or that somebody had not done. It was more…if you remember that 40,000 person user base that I mentioned, that model because of the way that we ran it could have been replicated by any of that user base of WRF users worldwide who had access to Azure and access to the public files that were the hurricane Maria data set that we used in order to simulate it.

So the real value of cloud is yes, that we have the ability to throw core counts and numbers at it that people just didn't think were possible before. But what's more important about it is that we give anybody the ability to do that. And what's fundamentally different about the space that we're in now versus say 15 years ago or certainly 20 years ago is that before it was a relatively small community of people who could get access to the number of cores to run at a massive scale in order to be meaningful. And what Azure has really done is they've opened that up. Any researcher with a good idea and with access has the ability to run at scales that were just not possible last year, the year before, the year before. And that's only going to get easier and more performant as we go forward.

DAVID: That was a great trip through the reality of HPC. And to go along with that, you guys are here to announce a great, new exciting HPC VM that's available on Azure, the HBv3-series of VMs. So tell us about some of the benefits of this new HPC VM over what people have been using on Azure so far.

EVAN: Sure. So HPv3-series is best understood in the context of our two or three-year journey we've been on at Azure to radically accelerate the velocity at which we bring the latest value-adding technology to our HPC customers and the level of performance and value for that performance that Azure can bring. That sounds almost like something that'd be obvious to go do, but it's important to understand that high-performance computing has been a space that specialized infrastructure run onsite by expert teams inside organizations has been something that the cloud has had difficulty matching for a long time. And one thing we've really tried to focus on over the last few years at Azure is flipping that narrative and actually having products and services and value propositions that are at least as good as what customers are doing on-premises. And HBv3 is kind of this milestone for us where now all of a sudden we're actually ahead of where the on-premises HPC market is.

HBv3-series is our fastest product launch ever. It went into general availability on March 15th and was announced by Microsoft Executive Vice President, Jason Zander, at AMD's launch event for its EPYC 7003 Series data center processor, otherwise known as Milan. Milan is the same CPU technology that will power the world's first exascale supercomputer later this year, an exaFLOP is a billion billion calculations per second. So if you want to be a processor in that computer, you really got to have your A-game. And we have it in Azure today about eight, nine months before it will debut in that world's largest supercomputer. And this is an important milestone for us in terms of being available day-and-date when these technologies launch because as we mentioned, our customers have problems where they have an insatiable appetite for performance, scalability, and the value they get for that. So it's great to launch a product for HPC but if you launch it too late, a lot of that product is diminished. So the time to which you bring it to market is supercritical for our partners.

HBv3 from a performance standpoint does some really, really cool things for key workloads that a lot of our customers have been very interested in. For relatively small jobs such as those where we have customers who have software licensed bound applications where the cost of the software actually outweighs the cost of the hardware and therefore they actually want to restrict the number of processor cores the application runs on, HBv3 is up to three times faster. Then our last…let's call it small job optimized HPC VM, a 3x uplift inside of a fixed number of cores is a huge amount of performance uplift especially when most of our customers’ TCO is going to be dominated by those software licensing costs.

And then the other thing I would point customers to is really big, scalable jobs, jobs that are running like those that Tim mentioned 20,000 cores, 30,000 cores, 80,000 cores. The Milan processor, otherwise known as EPYC 7003 Series, does some things that take even what Tim mentioned before that we did on the HBv2-series, the predecessor, at 80,000 cores, and extends it a ton further. The new processor does some really, really good things for scalable jobs for large-scale HPC workloads, especially those that are tightly coupled. We see performance improvements as high as 90% over the previous-gen HBv2-series. Now, to put that in context, Azure was, even before HBv3, scaling HPC workloads to more than 12 times higher than what has been demonstrated by any other public cloud provider. And the reason why that's relevant is because, as Tim mentioned, we're not aiming to be, let's say, good among the cloud providers. To be a good partner to our Azure's extensive network of partners, to be the right hybrid cloud provider for our customers, we have resources that mimic and extend the capabilities that they have on-premises so that they can get the same things or even better in the public cloud. So our design point, our reference point is really what these customers are already doing on-premises and what they'd like to be able to do on-premises but maybe can't either because of budgetary issues, constraints around their data center capacity, things of that nature. And so HPv3-series is when you can launch something that delivers performance and value leadership for small jobs to the largest jobs that customers have to run. That full spectrum coverage delivered time to market ahead of anyone else is a really big deal. So we're really excited about the launch. The customer response has been awesome, and we're really excited to start rolling it out to more regions and more customers.

PAUL: Fantastic news, Evan. So now that we have the new HPv3 VM series available, let's start with if people are using HPC on Azure today, how should they think about this new series of VMs versus perhaps what they're using today? What guidance would you give them?

TIM: What is most important, Paul, whether you are an established HPC person and you’ve been doing this for the last 15 years or whether your organization is now just starting to identify and build workflows that take advantage of this kind of infrastructure, the important part of it, what cloud has done is it gives you the ability to take your workflow and define the right infrastructure for it as opposed to what we did traditionally where we had on-prem based systems where we took our workflow and we tried to figure out how to optimize our workflow for the static infrastructure that was sitting in the data center. And that's a fundamentally different process. And so the best way to understand how to take advantage of the new skew is really to understand your own workflow or even better, let us help you understand your workflow characteristics and how it is that we can take that and match it to the right infrastructure within Azure. Because part of this may be that HBv3 is the right platform to go to; the right new platform to go to is that platform that didn't exist before. It might also be that one of our existing skews from a price-performance perspective is a better way or a better place to run it.

It's funny, but in this space, because things move so fast, only in cloud and only in Azure could you think of a part or a skew that is less than a year old as being the older skew or the older part. Because traditionally, in on-prem-based systems, older meant four years or five years whenever that was spec’d and deployed. So the pace of change is so rapid within Azure that it's really more about understanding your workflow and how to take advantage of the many choices that we offer so that you can find the exact right fit for the workflow you've got, which is then going to provide you better business results coming out the backside of it.

PAUL: That's great, Tim. And so carrying on then the conversation here for folks that perhaps aren’t on HPC today or are not using HPC on Azure, how do they think about getting up and running and making those decisions to move to the cloud and use Azure on the cloud?

EVAN: Along that same trajectory, whether you are experienced and you have a full portfolio of workflows that you've been running on your on-prem system and we have to think about what we need to do in order to migrate and orchestrate those workflows, or whether you're in a position to be able to start net-new and figure out how to build these workflows in what the industry vernacular is cloud-native, Microsoft has not just invested in the infrastructure, the chips, and memory, and interconnect. Also, an extensive and thoughtful strategy has been put in place in order to build out the software environment that gives users the ability to either migrate and orchestrate existing workflows from their on-prem which is predominantly what CycleCloud is. And then we also have tools like Azure Batch which often are used for people who are building net-new, where they want to get Platform as a service where they're even abstracting out which infrastructure they go get. So in terms of helping someone start from scratch, the answer is actually the same as it is for somebody who's existing: let's figure out the problem for what you're trying to solve. And there's a very high likelihood that we have a tool that we have built that's going to enable you to speed the time from hey, here's the problem I would like to go solve until we've got a prototype up and running in Azure so that you can see what the real implications of that are and then begin to tweak and modify. And that's what we and my team do is help customers through that process.

DAVID: So I'm curious, how do folks on this journey of getting started, get in touch with you guys so that they can get that consultative service and understand what their next steps can be?

TIM: Well, within Microsoft, if a customer has a Microsoft account team assigned to them, I would certainly reach out to that Microsoft account team. We spend the vast majority of our time working with the account teams who are working with the customers in order to do that. And if somebody doesn't have an existing relationship with Microsoft, you can go to microsoft.com, and you'll find the Azure-based HPC solutions and AI solutions that we've got. And there are any number of methods to look for more help and to get contacted back by someone directly.

DAVID: That's perfect. And we're going to link some of those resources up in the show notes, so people will be able to find things really easily. And I just want to thank you both for being on the show. We're getting to a point where we need to wind it down in terms of timing. But I want to make sure that people are able to get in touch with you guys for anything they might want to follow up on. We'll put some links in the show notes and include your social on LinkedIn for both Tim and Evan. And with that, I just want to thank you both for being on the show. Really appreciate your time. And it was great to hear about this exciting, new VM opportunity people have.

TIM: Thanks a bunch, David. I'll let Evan close it out, but I'll say that HPC people have always been a slightly different breed. We're a bit of a niche in this space, but everybody who does this for a living is passionate about it. That's why we do it. And certainly, within Microsoft, we really think of ourselves as HPC and AI folks who happen to be doing these solutions in Azure as opposed to cloud folks trying to figure out how to do it. And so we love the opportunity not only to talk about this in this kind of venue, but we really welcome any customers that want to extend the conversation.

EVAN: I just want to echo Tim and say thank you for having us on. And this is obviously a space that Tim and I are passionate about. But we're passionate about it because the impact of high-performance computing to so many people is so great. And it's a pleasure to be working for an organization that recognizes that and really prioritizes that as something that Azure has to be a global leader in if we are to be as good of a partner and a provider to our customers and partners as we can be. So thanks for having us on. And, as you said, customers have many ways of reaching out to us, and we're happy to engage with them.

DAVID: Thank you for joining us for this episode of the Azure for Executives podcast. We love hearing from you, and if you have suggestions for topics, questions about issues discussed on the show, or other feedback, contact the show host, David Starr or Paul Maher through the social media links included in the show notes for each episode. We look forward to hearing from you.