WEBVTT

00:00:00.000 --> 00:00:05.760
 Okay, Kolen, please, please take it away.

00:00:05.760 --> 00:00:15.080
 Hello, oh, wait a second.

00:00:15.080 --> 00:00:25.080
 Oh, sorry, sorry.

00:00:25.080 --> 00:00:26.080
 Can you hear me?

00:00:26.080 --> 00:00:27.080
 Yeah, yeah.

00:00:27.080 --> 00:00:30.200
 Okay, okay, I will just do it like this.

00:00:30.200 --> 00:00:33.840
 So can you all see the slides?

00:00:33.840 --> 00:00:35.580
 Yeah.

00:00:35.580 --> 00:00:41.760
 Okay, so I will start with a big picture.

00:00:41.760 --> 00:00:45.720
 So basically, this is an introduction to the SO:UK Data Centre.

00:00:45.720 --> 00:00:48.300
 And I will start with a big picture.

00:00:48.300 --> 00:00:51.860
 So first of all, what is HPC and HTC?

00:00:51.860 --> 00:00:55.640
 So let's define this terminology first.

00:00:55.640 --> 00:00:57.080
 And this is according to the...

00:00:57.080 --> 00:01:02.200
 The so-called European Grid Infrastructure (EGI), High Throughput Computing (HTC) is a computer computing

00:01:02.200 --> 00:01:07.880
 paradigm that focuses on the efficient execution of a large number of loosely coupled tasks.

00:01:07.880 --> 00:01:13.360
 Given the minimal parallel communication requirements, the task can be executed on cluster, et cetera.

00:01:13.360 --> 00:01:16.700
 So it's loosely coupled, minimally parallel.

00:01:16.700 --> 00:01:21.860
 And on the other hand, HPC, high-performance computing, is focusing on the efficient execution

00:01:21.860 --> 00:01:26.300
 of compute-intensive, tightly-coupled tasks.

00:01:26.300 --> 00:01:26.960
 Given the high...

00:01:26.960 --> 00:01:27.080
 Okay.

00:01:27.080 --> 00:01:33.080
 parallel communication requirements usually involve low latency interconnects that

00:01:33.080 --> 00:01:39.400
 can share data very rapidly. So I made a table on the right hand side that shows some of the

00:01:39.400 --> 00:01:48.440
 differences between an HTC kind of computer and HPC kind of computer. So HTC loosely coupled,

00:01:48.440 --> 00:01:55.720
 HPC tightly coupled, and then interconnect low bandwidth, let's say 10 gigabit ethernet,

00:01:56.840 --> 00:02:03.400
 versus high bandwidth low latency like InfiniBand interconnects. And then in terms of

00:02:03.400 --> 00:02:10.680
 computational capability, HTC is a subset of HPC in the sense that any workflow that you can execute

00:02:10.680 --> 00:02:18.120
 on HTC should be computable on HPC, but not vice versa. In terms of cost,

00:02:18.120 --> 00:02:26.600
 then HTC has a lower cost per node, hence it has higher throughput per system budget

00:02:26.600 --> 00:02:31.960
 because each node is cheaper. You can have more nodes, right? So HPC on the other hand is more

00:02:31.960 --> 00:02:36.520
 expensive. You need to spend more money on interconnects, et cetera. So for example,

00:02:36.520 --> 00:02:42.760
 in SO:UK Data Centre at a point like we try to estimate how much it will cost for us to build

00:02:42.760 --> 00:02:49.560
 a high performance, only the storage system for the capacity we need, right? So we need to spend

00:02:49.560 --> 00:02:56.360
 basically all of our budget just to create the storage system. So it's much more expensive that way.

00:02:56.360 --> 00:03:03.720
 So in terms of parallelism, HTC is technically possible. You can run MPI jobs there,

00:03:03.720 --> 00:03:08.840
 but it won't scale very well beyond one node, right? So it means like 10 nodes, 100 nodes,

00:03:08.840 --> 00:03:17.320
 probably fine. But for HPC, you can scale up to 10,000 nodes without much problem. The biggest

00:03:17.320 --> 00:03:26.120
 supercomputer on earth is HPC, basically. So homogeneity, HTC is very forgiving in a hetero

00:03:26.120 --> 00:03:33.560
 homogeneous nodes, which is what we are going to have in SO:UK Data Centre, but HPC is highly

00:03:33.560 --> 00:03:40.880
 homogeneous. MPI support in HTC often is afterthoughts. In HPC, often it's first class or

00:03:40.880 --> 00:03:47.180
 maybe the only way you can launch a multi-nodes job, something like that. HTC sometimes is also

00:03:47.180 --> 00:03:54.740
 known as the grid computing, technically, which is like a subset of HTC are the grid,

00:03:55.000 --> 00:04:01.720
 and it's usually used by people like high energy physicists, like from people from CERN, et cetera.

00:04:01.720 --> 00:04:12.200
 Any questions so far? So examples of HPC and HTC in CMB data analysis. So to oversimplify,

00:04:12.200 --> 00:04:19.000
 to think about like what kind of compute system we need to deploy our scientific application,

00:04:19.000 --> 00:04:23.880
 we can think in terms of how much memory your application needs, right? So if the amount of

00:04:23.880 --> 00:04:31.080
 memory is very, very large, beyond one single node, then probably you need HPC. So on the one hand of

00:04:31.080 --> 00:04:35.960
 the spectrum, which is not on the diagram on the right-hand side, there's something relatively new,

00:04:35.960 --> 00:04:41.960
 something called so-called Cosmoglobe that some of you already know. So basically it's a full Bayesian

00:04:41.960 --> 00:04:49.240
 analysis on CMB data, which requires you to load all the TOD in memory, including all the different

00:04:49.240 --> 00:04:55.240
 frequencies of maps, right? So I estimated basically, if you want to do this kind of analysis

00:04:55.240 --> 00:05:02.280
 using SO data, you require basically the whole NERSC computer that might barely fit in all the

00:05:02.280 --> 00:05:06.440
 data or it might actually doesn't. So you might actually need something even bigger.

00:05:06.440 --> 00:05:12.200
 So, but for typical other kind of like CMB analysis, like maximum likelihood or

00:05:12.200 --> 00:05:19.640
 medium-based mapmaking, then you need to load all the, at least like per frequency map into the data,

00:05:19.640 --> 00:05:24.040
 into memory. Then for example, this kind of analysis is done in Planck and will be done in

00:05:24.040 --> 00:05:33.240
 SO LAT mapmaking. So these are more like HPC kind of workload. So for naive or sometimes called filter

00:05:33.240 --> 00:05:42.040
 bin or biased mapmaking, this suits the HTC kind of workflow very well because of something called

00:05:42.040 --> 00:05:48.520
 the MapReduce paradigm. So each mapmaking is a single, can be, can be a single, very small,

00:05:48.520 --> 00:05:53.400
 I mean, it takes a small chunk of TOD and make a single map out there without

00:05:53.400 --> 00:05:57.480
 involving any other data. So it's well suited for HTC kind of workflow.

00:05:57.480 --> 00:06:02.840
 I think I might have missed something here.

00:06:02.840 --> 00:06:10.520
 Yes. So the thing I want to mention in this slide is actually it has been demonstrated in CMB that

00:06:10.520 --> 00:06:18.120
 the HTC kind of workflow already is well suited in SPT-3G. So they have done, they actually released

00:06:18.120 --> 00:06:24.760
 a paper that I tried to cite here. I put it in the last slide actually. And then they demonstrated

00:06:24.760 --> 00:06:31.480
 they can use the Open Science Grid (OSG) in the US to use HTC kind of resource to complete the analysis

00:06:31.480 --> 00:06:37.640
 of SPT-3G, right? So there's something already demonstrated to work well. And the next slide is,

00:06:37.640 --> 00:06:40.200
 I just call that the NERSC problem.

00:06:40.200 --> 00:06:54.180
 So, you know, you probably, most of you has already received this email in early this year, talking about like the allocation of NERSC resources to the CMB community, right?

00:06:54.180 --> 00:07:05.340
 So the way I summarize the situation is that typically, if you want to use NERSC resource, you need to make a certain kind of request, right, to request some kind of allocation.

00:07:05.340 --> 00:07:10.020
 But this is done by someone else for us already for like 25 years.

00:07:10.020 --> 00:07:16.220
 They routinely, every year, they dedicate 1% of the resources to the whole CMB community.

00:07:16.220 --> 00:07:24.640
 And then it's also very generously, any CMB-like researcher want to have, I mean, use that system,

00:07:24.640 --> 00:07:29.640
 We are allocated 1% of that whole CMB pool of resources.

00:07:29.640 --> 00:07:32.920
 It has been working well.

00:07:32.920 --> 00:07:35.840
 This model has been working well until very lately.

00:07:35.840 --> 00:07:41.200
 Basically, it's because of the so-called the end of the Moore's law, right?

00:07:41.200 --> 00:07:44.220
 So on the chart on the right-hand side, it's kind of complicated.

00:07:44.220 --> 00:07:53.340
 But basically, the blue line there is the solar blue line is how you would think the supercomputer would grow in time.

00:07:53.340 --> 00:07:57.440
 But the dotted lines tells you that Moore's law is actually slowing down.

00:07:57.440 --> 00:08:01.740
 But you can see the other lines basically telling you that how the CMB data sets are growing.

00:08:01.740 --> 00:08:05.800
 And it hasn't stopped growing by that

00:08:05.800 --> 00:08:12.060
 trend, right? So it means that the amount of data we are having is increasingly overgrowing the

00:08:12.060 --> 00:08:19.600
 amount of compute capability we have in supercomputer basically. So this is the sort of problem that we

00:08:19.600 --> 00:08:28.940
 are facing. So for example, in this year, we know that about 0.1% of the whole NERSC is allocated

00:08:28.940 --> 00:08:36.800
 to the SO analysis, right? So if you take the amount of FLOPS you have for the whole NERSC

00:08:36.800 --> 00:08:42.160
 machine multiplied by that number, that's equivalent to a 0.1 PFLOPS machine, basically.

00:08:42.160 --> 00:08:46.800
 So if you have a 0.1 PFLOPS machine running like for full time for a year, that is the amount

00:08:46.800 --> 00:08:53.040
 of compute time you have. And by my estimation, SO:UK data centre will have basically the same

00:08:53.040 --> 00:08:58.680
 similar order of magnitude amount of PFLOPS. So it's an equally capable

00:08:58.680 --> 00:08:58.920
 machine.

00:08:58.920 --> 00:09:05.900
 for that allocation. And SO:UK data centre is allocated within Blackett, and which is like

00:09:05.900 --> 00:09:12.980
 10 times higher, meaning that the amount of burst throughput you can get when you have like you have

00:09:12.980 --> 00:09:17.800
 a lot of jobs you want to finish very quickly is 10 times as much. So at NERSC is like theoretically

00:09:17.800 --> 00:09:25.720
 1000 times as much. Here is about 10 times as much. So this slide I call it HTC for SATs. So I

00:09:25.720 --> 00:09:28.660
 actually mentioned it here. So naïve mapmaking

00:09:28.660 --> 00:09:32.500
 is well-suited for HTC kind of workflow

00:09:32.500 --> 00:09:34.520
 because of the MapReduce paradigm.

00:09:34.520 --> 00:09:37.480
 And it has been demonstrated to use the Open Science Grid,

00:09:37.480 --> 00:09:39.420
 which is very similar to Blackett here.

00:09:39.420 --> 00:09:45.540
 And SO:UK Data Centre is funded to perform SO SATs analysis.

00:09:45.540 --> 00:09:50.880
 And located from within Blackett is an HTC kind of workflow,

00:09:50.880 --> 00:09:52.320
 and it is funded for eight years.

00:09:52.320 --> 00:09:58.020
 So in this perspective, it is a very stable long-term commitment

00:09:58.020 --> 00:10:01.800
 to the science readiness of SAT analysis.

00:10:01.800 --> 00:10:08.720
 So now I'm going to switch to the next session.

00:10:08.720 --> 00:10:12.040
 So it's basically I'm soft-launching the SO:UK Data Centre.

00:10:12.040 --> 00:10:14.840
 What I mean is it's the first time we are presenting

00:10:14.840 --> 00:10:20.080
 how you can use our resources here.

00:10:20.080 --> 00:10:23.260
 So what is the SO:UK Data Centre first?

00:10:23.260 --> 00:10:28.000
 So physically and infrastructurally, it's located from within Blackett.

00:10:28.000 --> 00:10:32.220
 And which amounts to about 10% of the Blackett resources

00:10:32.220 --> 00:10:35.080
 in terms of number of CPUs.

00:10:35.080 --> 00:10:40.380
 And we are going to have access to most of the available resources there.

00:10:40.380 --> 00:10:45.380
 And in terms of interacting with the computational resources,

00:10:45.380 --> 00:10:49.860
 we are unique in the sense that typical Blackett users

00:10:49.860 --> 00:10:52.400
 are not using the resources like that.

00:10:52.400 --> 00:10:55.480
 They submit their jobs via something called DiracUI,

00:10:55.480 --> 00:10:57.040
 which we are not going to deal with.

00:10:57.040 --> 00:11:00.880
 And we are going to log into a certain log-in nodes

00:11:00.880 --> 00:11:04.000
 or call submit nodes and use HTCondor directly.

00:11:04.000 --> 00:11:09.700
 So the SO:UK Data Centre documentation is written specifically for this

00:11:09.700 --> 00:11:14.640
 because no other Blackett users has been using the resources like this,

00:11:14.640 --> 00:11:20.860
 and HTCondor itself. So I'm just trying to introduce it a little bit.

00:11:20.860 --> 00:11:25.100
 In a certain sense, you can say it's an inferior job manager

00:11:25.100 --> 00:11:27.020
 comparing to the rest of the Blackett users.

00:11:27.020 --> 00:11:28.740
 So the one that you are already used to launch,

00:11:28.740 --> 00:11:32.660
 in handling massively parallel applications, right?

00:11:32.660 --> 00:11:36.040
 But we are not doing that in SATs and others.

00:11:36.040 --> 00:11:39.340
 So it can be viewed as SLURM-like in many aspects.

00:11:39.340 --> 00:11:41.380
 So basically, it's job manager, you submit jobs,

00:11:41.380 --> 00:11:43.900
 and it will try to allocate resources for it to run.

00:11:43.900 --> 00:11:48.200
 However, other design choices in Blackett also contributes

00:11:48.200 --> 00:11:52.280
 to other differences that you might have been experiencing

00:11:52.280 --> 00:11:54.000
 with NERSC, for example.

00:11:54.000 --> 00:11:56.820
 So we are going to see some of them today.

00:11:56.820 --> 00:11:59.820
 So now, we are going to have a live demo,

00:11:59.820 --> 00:12:03.060
 which is a little bit scary because live demo,

00:12:03.060 --> 00:12:04.320
 sometimes it can go wrong.

00:12:04.320 --> 00:12:05.220
 So let me try.

00:12:05.220 --> 00:12:11.380
 So the first thing I will show you is, actually, how do I see?

00:12:11.380 --> 00:12:14.680
 So we are not seeing this.

00:12:14.680 --> 00:12:17.300
 So let me try to click, just click this thing.

00:12:17.300 --> 00:12:21.140
 So this is a link to the SO:UK Data Centre documentation.

00:12:21.140 --> 00:12:26.800
 And for example, some of them is very straightforward.

00:12:26.880 --> 00:12:28.080
 This is the onboarding page.

00:12:28.080 --> 00:12:30.020
 So if you have any other colleagues,

00:12:30.020 --> 00:12:31.700
 you want them to create new users,

00:12:31.700 --> 00:12:33.760
 you can send them to this page first

00:12:33.760 --> 00:12:36.360
 and then try to follow the example

00:12:36.360 --> 00:12:40.560
 to start to be a user of our data centre.

00:12:40.560 --> 00:12:42.240
 And then there's a quick start page

00:12:42.240 --> 00:12:46.460
 here that basically highlights the minimal number of pages

00:12:46.460 --> 00:12:48.740
 you can visit in order to quickly try

00:12:48.740 --> 00:12:51.240
 to run a certain kind of job.

00:12:51.240 --> 00:12:53.040
 OK?

00:12:53.040 --> 00:12:56.680
 And here, this page, I'm trying--

00:12:56.680 --> 00:13:03.720
 trying to outline what is the typical lifecycle of a workflow or pipeline here, right? So basically,

00:13:03.720 --> 00:13:11.640
 it's really that short. So you need some way of defining your job configuration. At NERSC,

00:13:11.640 --> 00:13:18.920
 you have something called, I forgot, batch script. They call it batch script, which has some

00:13:18.920 --> 00:13:24.280
 special SLURM command on the top of your script, right? So here, we have something called ClassAd.

00:13:24.280 --> 00:13:29.480
 You write that job configuration, you submit that to the scheduler, and it will try to allocate

00:13:29.480 --> 00:13:35.240
 the resource. And within that job, then you need to bootstrap a certain software environment.

00:13:35.240 --> 00:13:41.400
 You need to load the environment that has TOAST, for example, in order to run a mapmaker using TOAST.

00:13:41.400 --> 00:13:49.320
 And then, optionally, so depending if your application are using MPI or not,

00:13:49.320 --> 00:13:53.880
 then you can try to launch MPI applications. But as I have briefly mentioned before,

00:13:53.880 --> 00:14:06.880
 MPI is not a first-class thing in HTCondor, so I provided a special wrapper basically trying to launch a parallel job in something called "parallel universe" in their terminology.

00:14:06.880 --> 00:14:10.880
 And with that wrapper, you can launch an MPI job, which we are going to see very soon.

00:14:10.880 --> 00:14:13.880
 And then lastly, you want to do some I/O, right?

00:14:13.880 --> 00:14:22.880
 So once you have your software, you launch an MPI application, maybe you read in some data, and then it is going to write some data.

00:14:22.880 --> 00:14:30.880
 Specifically when you're writing data here at Blackett, because it's part of the grid, there's something called the grid storage system.

00:14:30.880 --> 00:14:35.880
 So we are also going to see how it's going to work over there.

00:14:35.880 --> 00:14:40.880
 So now it's time for the live demo. So hopefully it works.

00:14:40.880 --> 00:14:51.880
 So I'm going to switch my screen.

00:14:51.880 --> 00:14:59.880
 Okay, so are you seeing my terminal?

00:14:59.880 --> 00:15:01.880
 Yeah.

00:15:01.880 --> 00:15:09.880
 Okay. So let's try to log into the login nodes first.

00:15:09.880 --> 00:15:20.880
 So right now we are at this location, so-called VM77. This is now within Blackett.

00:15:20.880 --> 00:15:33.880
 One thing I didn't mention is that the documentation is actually located at a GitHub repository.

00:15:33.880 --> 00:15:43.880
 And the code you see in the documentation is actually in a file within the documentation that you can actually run those code.

00:15:43.880 --> 00:15:49.880
 So the thing I'm going to show you is, let's see, so let's say user document.

00:15:49.880 --> 00:15:52.400
 So in the website, you can just click them.

00:15:52.400 --> 00:15:55.100
 So now I'm within the directory, so I'm

00:15:55.100 --> 00:15:58.980
 going to navigate on my own.

00:15:58.980 --> 00:16:02.120
 So now I'm in MPI application.

00:16:02.120 --> 00:16:07.240
 So here are the files.

00:16:07.240 --> 00:16:10.600
 So first of all, you need to have something

00:16:10.600 --> 00:16:13.320
 like a job configuration file.

00:16:13.320 --> 00:16:15.840
 So this is your job configuration file.

00:16:15.840 --> 00:16:19.000
 So it looks like you try to define what kind of universe

00:16:19.000 --> 00:16:22.160
 you're in, the so-called parallel universe,

00:16:22.160 --> 00:16:24.060
 because you work more than one node.

00:16:24.060 --> 00:16:27.000
 This is the way that HTCondor,

00:16:27.000 --> 00:16:29.020
 how you classify different kinds of job.

00:16:29.020 --> 00:16:30.740
 They call it different universe.

00:16:30.740 --> 00:16:34.580
 And then the executable that is going to start

00:16:34.580 --> 00:16:39.520
 is this thing that I put on the submit node,

00:16:39.520 --> 00:16:41.840
 or on the log-in node, VM77.

00:16:41.840 --> 00:16:44.860
 So this is the wrapper script I mentioned

00:16:44.860 --> 00:16:47.880
 that is trying to coordinate,

00:16:47.880 --> 00:16:52.380
 try to start, try to launch an MPI processes.

00:16:52.380 --> 00:16:54.000
 And then the argument of this executable

00:16:54.000 --> 00:16:55.380
 has two different parts.

00:16:55.380 --> 00:16:57.520
 So there's actually two scripts.

00:16:57.520 --> 00:16:59.920
 The first script is to set up the environment.

00:16:59.920 --> 00:17:03.100
 The second script is to really run the MPI application itself,

00:17:03.100 --> 00:17:05.360
 which we are going to see soon.

00:17:05.360 --> 00:17:07.300
 And then we request machine count,

00:17:07.300 --> 00:17:09.520
 which is the number of nodes you have.

00:17:09.520 --> 00:17:11.860
 And per node, you want to have 16 CPUs.

00:17:11.860 --> 00:17:14.360
 And by the way, this is logical cores.

00:17:14.360 --> 00:17:17.460
 So this corresponds to 16 logical cores here.

00:17:17.460 --> 00:17:20.980
 And this is also like other things that, for example,

00:17:20.980 --> 00:17:25.020
 you tell them the jobs to transfer these files

00:17:25.020 --> 00:17:30.220
 to the worker nodes beforehand, before you start the job.

00:17:30.220 --> 00:17:37.240
 So this is the outline of what you are going to start your job.

00:17:37.240 --> 00:17:39.740
 So within that job configuration file,

00:17:39.740 --> 00:17:42.780
 we said that there's two different scripts.

00:17:42.780 --> 00:17:47.040
 So one of them is called env.sh, which set up the environment.

00:17:47.040 --> 00:17:50.640
 So let's see what it is.

00:17:50.640 --> 00:17:53.540
 And actually, we can actually do something like this.

00:17:53.540 --> 00:18:00.420
 So this is the env.sh on the right-hand side.

00:18:00.420 --> 00:18:05.980
 So you basically see that there was a file when I was--

00:18:05.980 --> 00:18:07.880
 let's see, let me switch to there.

00:18:07.880 --> 00:18:09.800
 When I was in this job configuration file,

00:18:09.800 --> 00:18:13.800
 I tell it to transfer this file to the worker nodes.

00:18:13.800 --> 00:18:18.120
 And over here, this script that starts

00:18:18.120 --> 00:18:20.760
 to prepare your environment is trying

00:18:20.760 --> 00:18:27.960
 to first uncompress your tarball to a certain location, which

00:18:27.960 --> 00:18:30.420
 is this line over here.

00:18:30.420 --> 00:18:34.680
 And then it's going to so-called activate your environment.

00:18:34.680 --> 00:18:39.040
 So after this line, now you have access to some executable.

00:18:39.040 --> 00:18:41.760
 For example, for the one that I'm providing here,

00:18:41.760 --> 00:18:43.740
 you have Python, you have mpirun.

00:18:43.740 --> 00:18:46.000
 So when you run this, like, which Python, which mpirun,

00:18:46.000 --> 00:18:47.940
 you're going to see the feedback there,

00:18:47.940 --> 00:18:51.540
 like where the Python is in that path.

00:18:51.540 --> 00:18:57.240
 And then the next part is once this environment is set up,

00:18:57.240 --> 00:19:01.500
 when your job actually starts, what it's trying to do

00:19:01.500 --> 00:19:05.600
 is to use this bash function that I provide

00:19:05.600 --> 00:19:08.940
 to set up your so-called OpenMPI host first.

00:19:08.940 --> 00:19:11.100
 And then you just, with one single line,

00:19:11.100 --> 00:19:11.740
 you are starting--

00:19:11.740 --> 00:19:15.260
 starting the MPI application like this, MPI run,

00:19:15.260 --> 00:19:19.840
 and then you specify a host by this environment variable that's

00:19:19.840 --> 00:19:23.020
 prepared by this bash function I provided.

00:19:23.020 --> 00:19:25.960
 And then Python-- in this case, it's very simple.

00:19:25.960 --> 00:19:27.500
 It's just doing some sort of hardware

00:19:27.500 --> 00:19:33.640
 trying to test your job.

00:19:33.640 --> 00:19:40.540
 So over here, probably, I can try to demonstrate it.

00:19:40.540 --> 00:19:42.680
 So I just submitted a job.

00:19:42.680 --> 00:19:45.840
 So you can see if you use this-- so if you follow through

00:19:45.840 --> 00:19:49.300
 my example over there, you will have more detailed explanation

00:19:49.300 --> 00:19:50.460
 of what this is doing.

00:19:50.460 --> 00:19:52.760
 But basically, the job has been submitted,

00:19:52.760 --> 00:19:56.320
 and it's currently still in the queue,

00:19:56.320 --> 00:19:57.980
 so nothing is happening.

00:19:57.980 --> 00:20:00.500
 And when I run this so-called tail command,

00:20:00.500 --> 00:20:04.740
 it's trying to aggressively print out the things as soon

00:20:04.740 --> 00:20:07.520
 as it starts to come to life.

00:20:07.520 --> 00:20:10.540
 So right now, the job just started,

00:20:10.540 --> 00:20:12.960
 and you can see mpi-0.out, meaning

00:20:12.960 --> 00:20:15.340
 that the first physical node is trying

00:20:15.340 --> 00:20:20.240
 to run the env.sh script over there,

00:20:20.240 --> 00:20:22.700
 and it's starting to unarchive your environment.

00:20:22.700 --> 00:20:26.300
 And also, the first process of the second physical node

00:20:26.300 --> 00:20:27.500
 is also doing the same thing.

00:20:27.500 --> 00:20:30.720
 And you wait like a minute or like 50 seconds,

00:20:30.720 --> 00:20:34.920
 and then it will tell you that, OK,

00:20:34.920 --> 00:20:37.520
 Python is now available at this path.

00:20:37.520 --> 00:20:39.400
 So right now, the job has started.

00:20:39.400 --> 00:20:42.060
 So let's go back a little bit to see what happened.

00:20:42.060 --> 00:20:44.300
 So the Python is available over here.

00:20:44.300 --> 00:20:46.040
 mpirun is available over there.

00:20:46.040 --> 00:20:48.640
 And then this mpi.log file is telling you

00:20:48.640 --> 00:20:52.580
 that these processes are actually

00:20:52.580 --> 00:20:54.240
 launched over there already.

00:20:54.240 --> 00:20:57.580
 So these different files are telling you different things.

00:20:57.580 --> 00:21:02.820
 So mpi-0.out is really the standard output

00:21:02.820 --> 00:21:05.340
 from your MPI application.

00:21:05.340 --> 00:21:07.340
 Now it's telling you, hello world, I'm from--

00:21:07.340 --> 00:21:15.420
 which processor on which node etc and then it ends so mpi.log which is what HTCondor is telling you

00:21:15.420 --> 00:21:23.340
 tells you that okay these nodes exit exceeded over there and is terminated and over here is that

00:21:23.340 --> 00:21:28.620
 job terminates on its own accord so there's no like failure over there so that's the end now i can just

00:21:31.260 --> 00:21:37.740
 stop this process. So this is a successful live demo of the MPI application. So any questions so

00:21:37.740 --> 00:21:49.240
 far? I would have one. Yes. Oh, sorry. Andrew was first. Yeah. I mean, so I lost the forest for the

00:21:49.240 --> 00:21:53.920
 trees a bit. I'm suddenly confused about communication within a node versus between

00:21:53.920 --> 00:22:01.200
 nodes. Is MPI just agnostic to that, the way it's running? Or is MPI only being used between

00:22:01.200 --> 00:22:04.500
 nodes, or I'm not sure if even my questions are

00:22:04.500 --> 00:22:06.240
 making a lot of sense.

00:22:06.240 --> 00:22:09.440
 OK, so I will try to explain it.

00:22:09.440 --> 00:22:11.380
 So let me go back to the slide.

00:22:11.380 --> 00:22:11.920
 I know this.

00:22:11.920 --> 00:22:14.260
 Can we know whether we're communicating only quickly

00:22:14.260 --> 00:22:19.720
 between-- within a single node or slowly between nodes, right?

00:22:19.720 --> 00:22:22.360
 Yeah, I need to explain a little more in the terminal.

00:22:22.360 --> 00:22:25.520
 So let me go back to this job configuration file.

00:22:25.520 --> 00:22:28.280
 Are you seeing the file?

00:22:28.280 --> 00:22:28.960
 Yeah.

00:22:28.960 --> 00:22:30.560
 Yeah, you're seeing the file.

00:22:30.560 --> 00:22:32.960
 So this is the job configuration file.

00:22:32.960 --> 00:22:36.060
 So yeah, I didn't really explain it over here.

00:22:36.060 --> 00:22:39.180
 But when we say machine counts here,

00:22:39.180 --> 00:22:43.880
 it's a little bit confusing because we are not actually

00:22:43.880 --> 00:22:46.220
 requesting number of machines.

00:22:46.220 --> 00:22:52.140
 So machine count here is more like MPI processes.

00:22:52.140 --> 00:22:54.540
 So when I request machine count equals to 2,

00:22:54.540 --> 00:22:57.840
 it means that I'm specifically launching two MPI processes.

00:22:57.840 --> 00:23:00.540
 And then when I say request CPU equals 16, it means

00:23:00.540 --> 00:23:10.540
 set per like MPI or HTCondor processors, they have 16 logical CPUs there. So one thing I was trying to

00:23:10.540 --> 00:23:19.420
 explain is those two so-called machines can actually land in the same physical node. This is part of the

00:23:19.420 --> 00:23:24.300
 confusion and sometimes can lead to some very confusing errors that you're seeing when you are

00:23:24.300 --> 00:23:31.020
 trying to debug some program. But this is something to bear in mind. Now, once we understood that,

00:23:31.020 --> 00:23:38.060
 so if you use my wrapper script to launch MPI processors, the amount of MPI processors you

00:23:38.060 --> 00:23:43.660
 have is exactly two. If you read through the documentation in that script, there's

00:23:43.660 --> 00:23:48.700
 also another mode you can launch MPI processes, which I'm not talking about here. So we only have

00:23:48.700 --> 00:23:53.980
 two MPI processes. When you are seeing the so-called like how well from the process one

00:23:53.980 --> 00:23:58.300
 one out of two there is the first MPI process.

00:23:58.300 --> 00:24:01.920
 So there's no like thread level parallelism

00:24:01.920 --> 00:24:04.780
 so far in the thing I demonstrated.

00:24:04.780 --> 00:24:07.780
 So it means that you are free to do

00:24:07.780 --> 00:24:10.520
 like hybrid kind of parallelism.

00:24:10.520 --> 00:24:13.840
 You can use MPI, sometimes called MPI plus X,

00:24:13.840 --> 00:24:17.560
 where X is often OpenMP,

00:24:17.560 --> 00:24:20.220
 if you're not doing something fancy.

00:24:20.220 --> 00:24:24.180
 So MPI plus OpenMP there, right?

00:24:24.180 --> 00:24:26.580
 So now you have two MPI processes

00:24:26.580 --> 00:24:30.280
 and within each process, you can have multi-threading.

00:24:30.280 --> 00:24:33.080
 So for example, if you use TOAST mapmaking,

00:24:33.080 --> 00:24:36.300
 by default is already using OpenMP.

00:24:36.300 --> 00:24:38.520
 So there's some sort of environmental variable,

00:24:38.520 --> 00:24:41.300
 which is also set by my wrapper script

00:24:41.300 --> 00:24:44.920
 that tells TOAST how many threads they should use

00:24:44.920 --> 00:24:47.160
 within an MPI process.

00:24:47.160 --> 00:24:50.200
 Yeah, so in short, to answer that question,

00:24:50.200 --> 00:24:54.180
 there's no like multi-threading communication

00:24:54.180 --> 00:24:56.560
 that I have shown in that demonstration.

00:24:56.560 --> 00:25:00.020
 So it's all like different MPI processes

00:25:00.020 --> 00:25:03.340
 trying to do and how well and nothing else.

00:25:03.340 --> 00:25:05.960
 - I guess I was kind of asking the opposite question.

00:25:05.960 --> 00:25:07.740
 Should you therefore assume

00:25:07.740 --> 00:25:10.120
 that all MPI communication is slow

00:25:10.120 --> 00:25:13.640
 because it might be on different nodes

00:25:13.640 --> 00:25:17.440
 and because this is only HTC and not HPC,

00:25:17.440 --> 00:25:18.640
 that can be very slow?

00:25:20.180 --> 00:25:22.600
 - You can say so, yes, you can say so.

00:25:22.600 --> 00:25:26.600
 And it's kind of like my recommended way of setting, right?

00:25:26.600 --> 00:25:31.600
 So if you are requesting like two so-called machines

00:25:31.600 --> 00:25:34.300
 that so happen in one physical node,

00:25:34.300 --> 00:25:37.080
 then the way I would do is I better just ask

00:25:37.080 --> 00:25:41.220
 for one single machine with more number of CPUs,

00:25:41.220 --> 00:25:45.100
 such that it's guaranteed to land within a physical node,

00:25:45.100 --> 00:25:46.700
 such that communication is fast

00:25:46.700 --> 00:25:49.200
 because at most it's going across the socket,

00:25:49.200 --> 00:25:50.160
 but not over the...

00:25:50.160 --> 00:25:53.160
 the network, which by the way have other complications

00:25:53.160 --> 00:25:55.900
 because I mentioned that it's 10 GbE,

00:25:55.900 --> 00:25:58.780
 but like which switch are they landing on?

00:25:58.780 --> 00:26:01.680
 And it is because it's like of the grid,

00:26:01.680 --> 00:26:05.060
 like infrastructure is not optimized in that way.

00:26:05.060 --> 00:26:08.200
 So the network topology can be complicated

00:26:08.200 --> 00:26:11.940
 and there's no guarantee that both machines

00:26:11.940 --> 00:26:14.200
 will land within the switch.

00:26:14.200 --> 00:26:18.080
 So I would say like try to avoid this, called,

00:26:18.080 --> 00:26:20.140
 like inter-network communication

00:26:20.140 --> 00:26:21.640
 as best as you can.

00:26:21.640 --> 00:26:24.940
 But if you need the memory, just like the message

00:26:24.940 --> 00:26:26.160
 I was saying there.

00:26:26.160 --> 00:26:29.260
 So usually, MPI is needed because you need more memory

00:26:29.260 --> 00:26:32.240
 than the one that is available in one single node.

00:26:32.240 --> 00:26:34.200
 If you need more memory, then of course, you

00:26:34.200 --> 00:26:36.060
 need to parallelize it there carefully.

00:26:36.060 --> 00:26:48.020
 OK, any other questions?

00:26:48.020 --> 00:26:48.520
 So I do--

00:26:48.520 --> 00:26:49.920
 All right.

00:26:49.920 --> 00:27:17.400
 Yeah, I would have one. Oh, sorry. Yes, yes, yes. Just very practical. So you told us that there was a tarball that contains sort of the programming environment. I wasn't very sure what that means. So concretely, I would like to know if you have sort of the classical programming environment, let's say at NERSC where you have your whatever you need, I mean, your maybe conda environment, maybe modules, so on.

00:27:17.400 --> 00:27:27.540
 And then you have also scripts that you want to launch from a certain directory. Is that all comprised in that tarball? Or are your scripts still in your directory?

00:27:27.540 --> 00:27:40.640
 And then the second related question, if you have sort of large input files, would they comprise a third set of inputs that you specify to kind of import from somewhere else?

00:27:40.640 --> 00:27:47.380
 So what are the three? What is the tarball doing? And where do you get input files from that are large, I would say?

00:27:47.380 --> 00:27:50.060
 Well, how do you tell to your to your launcher?

00:27:50.060 --> 00:28:04.980
 Yeah, so yeah, that was a very simple demo that I didn't touch these issues. And in a certain sense, these issues, so in terms of so I actually have written down, so that's what I'm showing in the screen right now.

00:28:04.980 --> 00:28:11.220
 So there's like software deployment, and even there's another page over here, reading and writing data, right?

00:28:11.220 --> 00:28:16.980
 So you can think of when you are preparing a node to do something, you need to read.

00:28:16.980 --> 00:28:23.360
 You need to read from a directory to load the software. So this is a software deployment over here, and also input data, output data.

00:28:23.360 --> 00:28:30.300
 So this is more complicated than what you would be doing at NERSC.

00:28:30.300 --> 00:28:36.040
 Basically, because at NERSC, everything is in a file system already. They are mounted. They are mounted in your login node.

00:28:36.040 --> 00:28:40.860
 They are also mounted on the compute node, and they are totally transparent that you don't need to.

00:28:40.860 --> 00:28:46.960
 So the only thing you need to copy is a file path, and you don't need to worry if this file will exist on the compute node.

00:28:47.460 --> 00:28:55.040
 But at a grid system here, each node, the so-called compute nodes or worker nodes over there, they are like in a blank state.

00:28:55.040 --> 00:28:58.580
 They have, they doesn't have your home, for example.

00:28:58.580 --> 00:29:05.200
 It doesn't, it doesn't see the kind of things you write at your login nodes, the vm77 there.

00:29:05.200 --> 00:29:08.960
 So the software environment you prepare over there will not work.

00:29:08.960 --> 00:29:16.240
 So that's why we need to think about how software can be deployed, which is the centre of this page.

00:29:16.240 --> 00:29:19.800
 And there are basically at least three different methods, right?

00:29:19.800 --> 00:29:29.760
 You can basically ignore the last method containers here is not supported by us for, you might say, a deficiency in HTCondor.

00:29:29.760 --> 00:29:40.740
 So the two like really de-facto standards to deploy software at a grid computing like this is either tarball or CVMFS.

00:29:40.740 --> 00:29:45.220
 And CVMFS is right now not

00:29:46.040 --> 00:29:49.340
 set up yet for our so-called virtual organization.

00:29:49.340 --> 00:29:54.260
 So it means that right now, the tarball method

00:29:54.260 --> 00:29:56.420
 is the only way we can do it.

00:29:56.420 --> 00:29:59.180
 And there's a software package that I

00:29:59.180 --> 00:30:03.040
 provide that includes all the dependencies and sotodlib.

00:30:03.040 --> 00:30:07.360
 So it also includes SPT-3G, I think.

00:30:07.360 --> 00:30:10.420
 And also, what else?

00:30:10.420 --> 00:30:13.100
 Like, I put it there.

00:30:13.100 --> 00:30:15.040
 But of course, it doesn't include everything.

00:30:15.040 --> 00:30:18.520
 For example, currently, the so-called NaMaster

00:30:18.520 --> 00:30:20.100
 is not there yet.

00:30:20.100 --> 00:30:24.000
 So I also, in one page--

00:30:24.000 --> 00:30:28.220
 actually, in this page of tarball method, in the end of that page,

00:30:28.220 --> 00:30:33.580
 I mentioned how you can tailor your tarball, basically.

00:30:33.580 --> 00:30:36.140
 So not everything I provided--

00:30:36.140 --> 00:30:39.100
 I mean, whatever you need might not be provided by me.

00:30:39.100 --> 00:30:40.800
 So you need to add something, or you

00:30:40.800 --> 00:30:43.000
 want to have a newer version of TOAST, let's say.

00:30:43.000 --> 00:30:45.000
 And this is sort of an internal--

00:30:45.000 --> 00:30:45.840
 I'm not going to do that.

00:30:45.840 --> 00:30:48.400
 It's not very clear yet right now.

00:30:48.400 --> 00:30:51.380
 But it will improve over time.

00:30:51.380 --> 00:30:53.600
 And by the way, yeah, I should address that.

00:30:53.600 --> 00:30:55.720
 So when I say--

00:30:55.720 --> 00:30:58.240
 OK, so the next question is, do we

00:30:58.240 --> 00:31:00.640
 have time to make one more demo, which

00:31:00.640 --> 00:31:05.200
 will answer your next question, like doing I/O to the nodes?

00:31:05.200 --> 00:31:07.320
 So do you want to have one more demo?

00:31:07.320 --> 00:31:10.980
 Maybe I can just read through them.

00:31:10.980 --> 00:31:14.960
 Yeah, I think probably don't do the other demo Kolen, because we only

00:31:14.960 --> 00:31:20.240
 have sort of minutes till kind of hard deadline.

00:31:20.240 --> 00:31:22.280
 Yeah, so we will just--

00:31:22.280 --> 00:31:24.620
 which is also more safe to me.

00:31:24.620 --> 00:31:26.460
 The next demo is more complicated.

00:31:26.460 --> 00:31:30.860
 So now we are going to talk about how to read and write data there.

00:31:30.860 --> 00:31:33.860
 So basically, there's-- let's see.

00:31:33.860 --> 00:31:40.560
 By the way, this session is very long, because in a certain sense,

00:31:40.560 --> 00:31:41.640
 it's more complicated.

00:31:41.640 --> 00:31:43.020
 So there's three big sessions.

00:31:43.020 --> 00:31:44.920
 If you look at the table of contents on the left-hand side,

00:31:44.920 --> 00:31:50.700
 So one is just transferring file from via the so-called ClassAd, the job

00:31:50.700 --> 00:31:56.140
 configuration file, using HTCondor. So this is the simplest one but also the

00:31:56.140 --> 00:32:01.120
 most limited in terms of capability and is something we are not recommending.

00:32:01.120 --> 00:32:05.260
 Basically the submit node is very small. We only have 200 GB of this space

00:32:05.260 --> 00:32:10.440
 and if you want to use this basically you need to assume everything is on

00:32:10.440 --> 00:32:14.940
 your submit node either like reading from or transmitting to. So this is not a

00:32:14.940 --> 00:32:19.560
 recommended method so I will skip it here but but the example is there you

00:32:19.560 --> 00:32:25.380
 can try. So this is the next method the so-called grid storage system is the

00:32:25.380 --> 00:32:29.940
 de-facto standard that on the grid they were committed to do it this way. So this

00:32:29.940 --> 00:32:34.140
 is the main thing and lastly there's something we are not going to mention

00:32:34.140 --> 00:32:39.440
 today is the Librarian. So we I think we all know that we are going to load some

00:32:40.420 --> 00:32:46.540
 the TOD from librarian, but it's more a special case of loading data. So I'm not going to talk

00:32:46.540 --> 00:32:51.920
 about it right now. And also it hasn't been set up yet. So in terms of grid storage system,

00:32:51.920 --> 00:32:57.780
 so I, this documentation goes through it in detail. So you see, there's a lot of sub pages

00:32:57.780 --> 00:33:05.280
 within this. Many of them is related to setting things up. But fortunately, if you're using vm77

00:33:05.280 --> 00:33:11.740
 is easier. So the thing you need to do is with the setup that we already

00:33:11.740 --> 00:33:24.940
 created, let's see this part. So the, the thing that you need to do is starting from here. So

00:33:24.940 --> 00:33:30.920
 you, you need your grid certificate. You put it in a specific location, which is this path,

00:33:30.920 --> 00:33:35.260
 basically. And then once you have that certificate over there,

00:33:35.260 --> 00:33:44.260
 the tool set you're going to use down there is going to authenticate your jobs via your user

00:33:44.260 --> 00:33:50.640
 certificate, basically. So you need to create something periodically too. So what happens is

00:33:50.640 --> 00:33:56.880
 when you run this command, it's actually trying to create something called the AC attribute

00:33:56.880 --> 00:34:04.060
 certificate, which is temporary. Your user certificate is a permanent thing that ties to you.

00:34:04.580 --> 00:34:09.100
 And using your certificate, this will create something temporary that you can use to submit

00:34:09.100 --> 00:34:17.080
 a job. And once you run this command, this AC is going to be valid for one week.

00:34:17.080 --> 00:34:21.920
 So it means that every week you need to run it once, like every Monday morning you're trying

00:34:21.920 --> 00:34:29.580
 to run it once. And it will tell you this sort of a demo that if you run this command,

00:34:29.580 --> 00:34:34.100
 this is what you're going to see. And it will say a certain proxy is created over here,

00:34:34.560 --> 00:34:38.800
 and the proxy is going to be valid until that time.

00:34:38.800 --> 00:34:42.320
 Once you have that file, you can then start your job.

00:34:42.320 --> 00:34:45.960
 So this, I think you can, can't ignore that.

00:34:45.960 --> 00:34:50.340
 So this is a job demo.

00:34:50.340 --> 00:34:55.340
 So basically, once you have your AC certificate,

00:34:55.340 --> 00:34:58.900
 then you have a job conversion file like this,

00:34:58.900 --> 00:35:02.840
 which transfer your AC to the worker nodes.

00:35:02.840 --> 00:35:07.840
 And then in your job script on the worker nodes,

00:35:07.840 --> 00:35:10.960
 there's one single command you need to set,

00:35:10.960 --> 00:35:13.600
 export this environment variable.

00:35:13.600 --> 00:35:18.380
 And after this, your worker nodes will be able to read

00:35:18.380 --> 00:35:23.380
 and write to and from the so-called grid storage system.

00:35:23.380 --> 00:35:27.520
 So this script is basically a demonstration on that.

00:35:27.520 --> 00:35:30.560
 You can try to use a special kind of list command,

00:35:30.560 --> 00:35:32.600
 so-called gfal-ls,

00:35:32.600 --> 00:35:37.100
 to see what is available within your grid storage system.

00:35:37.100 --> 00:35:39.860
 And then you can try to, over this line,

00:35:39.860 --> 00:35:44.280
 you, let's see, I try to make a directory first.

00:35:44.280 --> 00:35:45.200
 And then in the next line,

00:35:45.200 --> 00:35:47.280
 I try to remove that directory again,

00:35:47.280 --> 00:35:49.840
 and I create some file and you use this command

00:35:49.840 --> 00:35:54.840
 to copy that file to your grid storage system.

00:35:54.840 --> 00:35:57.660
 So this is the example of how you can

00:35:57.660 --> 00:36:01.240
 interact with the grid storage system.

00:36:01.240 --> 00:36:02.360
 And, and for

00:36:02.360 --> 00:36:04.960
 the bulk of the data that we are going to have,

00:36:04.960 --> 00:36:09.020
 like about a hundred terabytes right now,

00:36:09.020 --> 00:36:13.080
 is going to be only accessible using this way.

00:36:13.080 --> 00:36:16.180
 And in the long run, by the way,

00:36:16.180 --> 00:36:17.400
 there will be another system

00:36:17.400 --> 00:36:19.200
 which will make something like this easier.

00:36:19.200 --> 00:36:21.740
 But in the interim, let's say for the next months,

00:36:21.740 --> 00:36:23.440
 we need to do something like this.

00:36:23.440 --> 00:36:28.880
 Okay, so probably, I have more documentation over there,

00:36:28.880 --> 00:36:32.120
 which you can try to read on your own,

00:36:32.120 --> 00:36:33.660
 but I will go back to the presentation.

00:36:33.660 --> 00:36:35.480
 So this was the live demo.

00:36:35.480 --> 00:36:38.220
 Any other questions so far?

00:36:38.220 --> 00:36:45.740
 Okay, so this last big session is like to be explored,

00:36:45.740 --> 00:36:46.920
 like the how over there,

00:36:46.920 --> 00:36:50.080
 which is like the future aspects of things.

00:36:50.080 --> 00:36:53.380
 So this slide basically mentioned

00:36:53.380 --> 00:36:55.500
 how to design a workflow for a system, right?

00:36:55.500 --> 00:36:58.180
 So I tried to mention something about effectiveness.

00:36:58.180 --> 00:37:00.120
 You can think about it in two different ways.

00:37:00.120 --> 00:37:01.880
 One is amount of compute time,

00:37:01.880 --> 00:37:05.600
 so you can say how much amount of fair share

00:37:05.600 --> 00:37:07.360
 you have used up from your system,

00:37:07.360 --> 00:37:11.440
 like allocation, like NERSC hours, something like that.

00:37:11.440 --> 00:37:13.860
 And also in terms of the turnaround time,

00:37:13.860 --> 00:37:15.840
 like how fast, once I submit a job,

00:37:15.840 --> 00:37:17.720
 how quickly can I get the result, right?

00:37:17.720 --> 00:37:21.660
 This is what I define to be effectiveness of a workflow.

00:37:21.660 --> 00:37:25.740
 And for more capable system,

00:37:25.740 --> 00:37:29.720
 like it is more lenient on sub-optimality, right?

00:37:29.720 --> 00:37:31.640
 So if you write a workflow that is not very optimal,

00:37:31.640 --> 00:37:33.520
 but you know that it works,

00:37:33.520 --> 00:37:37.280
 a more capable system is more lenient to that.

00:37:37.280 --> 00:37:40.360
 Meaning that if you want to adapt a workflow from NERSC

00:37:40.360 --> 00:37:42.980
 to the SO:UK Data Centre,

00:37:42.980 --> 00:37:46.720
 you might see bottlenecks that you haven't been thinking about.

00:37:46.720 --> 00:37:49.700
 So one example I gave here is that,

00:37:49.700 --> 00:37:50.760
 for example, in mapmaking,

00:37:50.760 --> 00:37:54.500
 so let's say I have 500 hours of observation.

00:37:54.500 --> 00:37:58.220
 Each map making is going to inject one single hour observation

00:37:58.220 --> 00:37:59.600
 and provide you one single map, right?

00:37:59.600 --> 00:38:01.400
 So now I have 5,000 jobs.

00:38:01.400 --> 00:38:07.540
 You can launch a very wide MPI application with 5,000 processes

00:38:07.540 --> 00:38:10.760
 and it will give you the answer in a very short amount of time,

00:38:10.760 --> 00:38:12.780
 let's say three minutes, right?

00:38:12.780 --> 00:38:15.980
 But this is not a very good use of NERSC resource.

00:38:15.980 --> 00:38:18.400
 The reason is now you have a couple of other problems.

00:38:18.400 --> 00:38:20.400
 So it's load-balancing.

00:38:20.400 --> 00:38:23.340
 So do all the mapmaking finish at the same time

00:38:23.340 --> 00:38:25.180
 and also turnaround time,

00:38:25.180 --> 00:38:28.080
 because you are asking for a very wide job.

00:38:28.080 --> 00:38:31.160
 How soon will the scheduler be able

00:38:31.160 --> 00:38:37.880
 to put your job in the queue. So when you adapt that kind of workflow to Blackett, it will be

00:38:37.880 --> 00:38:43.400
 even worse. So if you try to start 5,000 processes, I don't know how long would it be able to start.

00:38:43.400 --> 00:38:49.160
 But if you bring it down into 500 different jobs, then basically most of them will start instantly.

00:38:49.160 --> 00:38:56.440
 So in this aspect, if you tune your workflow specifically for Blackett, you can actually

00:38:56.440 --> 00:39:01.560
 have faster turnaround time in at least in some situation because at NERSC if you if you try you

00:39:01.560 --> 00:39:06.440
 know something called debug partition that's not supposed to be in production if you put it in

00:39:06.440 --> 00:39:11.320
 production queue the typical turnaround time is like at least three days so if you want to run

00:39:11.320 --> 00:39:18.840
 a map that will finish in three minutes it will be most probably going very faster over here so the

00:39:18.840 --> 00:39:25.320
 recommend workflow then is to launch more small jobs right so you just launch each job is an

00:39:25.320 --> 00:39:33.720
 atomic job by atomic i mean that that single job has it is not going to communicate with any other

00:39:33.720 --> 00:39:38.760
 job like kind of like the mapmaking I was talking about you're not doing some you're only doing

00:39:38.760 --> 00:39:44.120
 naive mapmaking each process are not communicating so each of them are reading their own TOD and

00:39:44.120 --> 00:39:49.720
 writing their own maps so each of them is one atomic job so if you try to factor that out

00:39:49.720 --> 00:39:54.200
 it will actually help you to go through the uh the queue faster and

00:39:54.200 --> 00:39:59.800
 but then you now have a complication because now you need to need to launch tens of thousands of

00:39:59.800 --> 00:40:04.040
 jobs right how do you handle it so there's something workflow manager is going to help you

00:40:04.040 --> 00:40:11.800
 and i can't talk about more details about here but like i kind of discourage which is also

00:40:11.800 --> 00:40:18.440
 what NERSC is discouraging about rolling your own workflow manager which basically because

00:40:18.440 --> 00:40:23.480
 workflow managers are trying to solve these issues right so we can think about like you see job

00:40:23.480 --> 00:40:29.080
 managers agnostics—do your workflow manager assumes the presence of SLURM5 for example

00:40:29.080 --> 00:40:34.120
 and how job dependencies are handled so let's say if you after you start map making you also

00:40:34.120 --> 00:40:39.400
 have other things that depends on the finished maps right how how would those be launched later

00:40:39.400 --> 00:40:45.160
 on do you need to like kind of babysit your job and launch it manually, how failed jobs are handled,

00:40:45.160 --> 00:40:52.200
 like do you need to check for corrupted maps from individual outputs etc so

00:40:52.920 --> 00:41:00.760
 One caveat I want to mention here is, while filter-bin mapmaking is well-suited for MapReduce

00:41:00.760 --> 00:41:07.640
 product, but we need to think about how you treat the data from each map. Map in the MapReduce

00:41:07.640 --> 00:41:13.800
 sense, not necessarily the CMB maps. So meaning that each function, once you have the product,

00:41:13.800 --> 00:41:20.920
 do you write it on disk or otherwise? So the reason is, if you write it on disk, then you need

00:41:20.920 --> 00:41:23.740
 you need to deal with two extra problems-- the data explosion.

00:41:23.740 --> 00:41:26.660
 You have a lot of intermediate data you need to write down.

00:41:26.660 --> 00:41:29.120
 And also, you are congesting the interconnects.

00:41:29.120 --> 00:41:32.020
 So basically, because once you start to write a file,

00:41:32.020 --> 00:41:33.880
 you actually need to go through the interconnect

00:41:33.880 --> 00:41:35.360
 and write it somewhere else.

00:41:35.360 --> 00:41:40.480
 So now the network condition, the link of the network

00:41:40.480 --> 00:41:42.460
 becomes more apparent.

00:41:42.460 --> 00:41:45.580
 So yeah, this is the thing I want to mention.

00:41:45.580 --> 00:41:50.360
 And last slide, sorry.

00:41:50.360 --> 00:41:52.400
 Yeah, this is the last slide, documentation.

00:41:52.400 --> 00:41:55.120
 So I already mentioned that the documentation site is

00:41:55.120 --> 00:41:55.760
 available here.

00:41:55.760 --> 00:41:59.860
 I will also copy and paste in the chat there very soon.

00:41:59.860 --> 00:42:02.140
 And how you can use the documentation.

00:42:02.140 --> 00:42:03.860
 So I documented a few ways.

00:42:03.860 --> 00:42:05.320
 So of course, you can search it.

00:42:05.320 --> 00:42:07.880
 So if you just click this, so basically everything

00:42:07.880 --> 00:42:12.080
 is indexed and searchable over there, which is very useful.

00:42:12.080 --> 00:42:14.480
 And you can also download different formats,

00:42:14.480 --> 00:42:18.040
 like PDF, EPUB, man page, single page HTML.

00:42:18.040 --> 00:42:20.120
 And how would those formats be useful?

00:42:20.120 --> 00:42:24.780
 you can actually try to fit in that to a large language model, right?

00:42:24.780 --> 00:42:26.620
 So you can now start to chat with your model.

00:42:26.620 --> 00:42:30.300
 For example, if you inject your single-page HTML to ChatGPT,

00:42:30.300 --> 00:42:33.940
 and you can try to summarize what is this documentation,

00:42:33.940 --> 00:42:35.900
 what do we have in the documentation,

00:42:35.900 --> 00:42:38.840
 and then start chatting with the document.

00:42:38.840 --> 00:42:42.640
 So tell me to run an MPI application.

00:42:42.640 --> 00:42:44.720
 According to the documentation, how should I do it?

00:42:44.720 --> 00:42:45.260
 Something like that.

00:42:45.260 --> 00:42:48.280
 And whatever questions you have, you can try asking that.

00:42:48.700 --> 00:42:52.200
 But of course, sometimes it might hallucinate your answer.

00:42:52.200 --> 00:42:55.840
 But anyway, so these are some of the ways you can use the documentation.

00:42:55.840 --> 00:43:00.680
 And you can also start to collaborate and discuss issues here, right?

00:43:00.680 --> 00:43:03.300
 So there's a GitHub repository there.

00:43:03.300 --> 00:43:07.220
 You can raise issues, and we already have a lot of them here.

00:43:07.220 --> 00:43:12.180
 And also discuss things over here, GitHub discussions, et cetera.

00:43:12.180 --> 00:43:14.140
 Or even to contribute, right?

00:43:14.140 --> 00:43:18.540
 So at the page of onboarding, I mentioned only the UK certificate.

00:43:18.540 --> 00:43:22.160
 So if you have a UK certificate, you can use the URL over there.

00:43:22.160 --> 00:43:26.320
 But if you know other ones, you can try to add the document, submit a pull request.

00:43:26.320 --> 00:43:32.020
 Okay, so that's all for the introduction of our data centre.

00:43:32.020 --> 00:43:33.780
 Thank you.

00:43:33.780 --> 00:43:36.480
 Cool. That's brilliant. Thanks, Kolen.

00:43:36.480 --> 00:43:39.480
 Right, we're close to five o'clock.

00:43:39.480 --> 00:43:41.660
 Does anyone have any quick questions for Kolen?

00:43:41.660 --> 00:43:45.300
 Okay, so we're close to five o'clock.

00:43:45.300 --> 00:44:15.280
 Thank you.