WEBVTT 00:00:00.000 --> 00:00:05.760 Okay, Kolen, please, please take it away. 00:00:05.760 --> 00:00:15.080 Hello, oh, wait a second. 00:00:15.080 --> 00:00:25.080 Oh, sorry, sorry. 00:00:25.080 --> 00:00:26.080 Can you hear me? 00:00:26.080 --> 00:00:27.080 Yeah, yeah. 00:00:27.080 --> 00:00:30.200 Okay, okay, I will just do it like this. 00:00:30.200 --> 00:00:33.840 So can you all see the slides? 00:00:33.840 --> 00:00:35.580 Yeah. 00:00:35.580 --> 00:00:41.760 Okay, so I will start with a big picture. 00:00:41.760 --> 00:00:45.720 So basically, this is an introduction to the SO:UK Data Centre. 00:00:45.720 --> 00:00:48.300 And I will start with a big picture. 00:00:48.300 --> 00:00:51.860 So first of all, what is HPC and HTC? 00:00:51.860 --> 00:00:55.640 So let's define this terminology first. 00:00:55.640 --> 00:00:57.080 And this is according to the... 00:00:57.080 --> 00:01:02.200 The so-called European Grid Infrastructure (EGI), High Throughput Computing (HTC) is a computer computing 00:01:02.200 --> 00:01:07.880 paradigm that focuses on the efficient execution of a large number of loosely coupled tasks. 00:01:07.880 --> 00:01:13.360 Given the minimal parallel communication requirements, the task can be executed on cluster, et cetera. 00:01:13.360 --> 00:01:16.700 So it's loosely coupled, minimally parallel. 00:01:16.700 --> 00:01:21.860 And on the other hand, HPC, high-performance computing, is focusing on the efficient execution 00:01:21.860 --> 00:01:26.300 of compute-intensive, tightly-coupled tasks. 00:01:26.300 --> 00:01:26.960 Given the high... 00:01:26.960 --> 00:01:27.080 Okay. 00:01:27.080 --> 00:01:33.080 parallel communication requirements usually involve low latency interconnects that 00:01:33.080 --> 00:01:39.400 can share data very rapidly. So I made a table on the right hand side that shows some of the 00:01:39.400 --> 00:01:48.440 differences between an HTC kind of computer and HPC kind of computer. So HTC loosely coupled, 00:01:48.440 --> 00:01:55.720 HPC tightly coupled, and then interconnect low bandwidth, let's say 10 gigabit ethernet, 00:01:56.840 --> 00:02:03.400 versus high bandwidth low latency like InfiniBand interconnects. And then in terms of 00:02:03.400 --> 00:02:10.680 computational capability, HTC is a subset of HPC in the sense that any workflow that you can execute 00:02:10.680 --> 00:02:18.120 on HTC should be computable on HPC, but not vice versa. In terms of cost, 00:02:18.120 --> 00:02:26.600 then HTC has a lower cost per node, hence it has higher throughput per system budget 00:02:26.600 --> 00:02:31.960 because each node is cheaper. You can have more nodes, right? So HPC on the other hand is more 00:02:31.960 --> 00:02:36.520 expensive. You need to spend more money on interconnects, et cetera. So for example, 00:02:36.520 --> 00:02:42.760 in SO:UK Data Centre at a point like we try to estimate how much it will cost for us to build 00:02:42.760 --> 00:02:49.560 a high performance, only the storage system for the capacity we need, right? So we need to spend 00:02:49.560 --> 00:02:56.360 basically all of our budget just to create the storage system. So it's much more expensive that way. 00:02:56.360 --> 00:03:03.720 So in terms of parallelism, HTC is technically possible. You can run MPI jobs there, 00:03:03.720 --> 00:03:08.840 but it won't scale very well beyond one node, right? So it means like 10 nodes, 100 nodes, 00:03:08.840 --> 00:03:17.320 probably fine. But for HPC, you can scale up to 10,000 nodes without much problem. The biggest 00:03:17.320 --> 00:03:26.120 supercomputer on earth is HPC, basically. So homogeneity, HTC is very forgiving in a hetero 00:03:26.120 --> 00:03:33.560 homogeneous nodes, which is what we are going to have in SO:UK Data Centre, but HPC is highly 00:03:33.560 --> 00:03:40.880 homogeneous. MPI support in HTC often is afterthoughts. In HPC, often it's first class or 00:03:40.880 --> 00:03:47.180 maybe the only way you can launch a multi-nodes job, something like that. HTC sometimes is also 00:03:47.180 --> 00:03:54.740 known as the grid computing, technically, which is like a subset of HTC are the grid, 00:03:55.000 --> 00:04:01.720 and it's usually used by people like high energy physicists, like from people from CERN, et cetera. 00:04:01.720 --> 00:04:12.200 Any questions so far? So examples of HPC and HTC in CMB data analysis. So to oversimplify, 00:04:12.200 --> 00:04:19.000 to think about like what kind of compute system we need to deploy our scientific application, 00:04:19.000 --> 00:04:23.880 we can think in terms of how much memory your application needs, right? So if the amount of 00:04:23.880 --> 00:04:31.080 memory is very, very large, beyond one single node, then probably you need HPC. So on the one hand of 00:04:31.080 --> 00:04:35.960 the spectrum, which is not on the diagram on the right-hand side, there's something relatively new, 00:04:35.960 --> 00:04:41.960 something called so-called Cosmoglobe that some of you already know. So basically it's a full Bayesian 00:04:41.960 --> 00:04:49.240 analysis on CMB data, which requires you to load all the TOD in memory, including all the different 00:04:49.240 --> 00:04:55.240 frequencies of maps, right? So I estimated basically, if you want to do this kind of analysis 00:04:55.240 --> 00:05:02.280 using SO data, you require basically the whole NERSC computer that might barely fit in all the 00:05:02.280 --> 00:05:06.440 data or it might actually doesn't. So you might actually need something even bigger. 00:05:06.440 --> 00:05:12.200 So, but for typical other kind of like CMB analysis, like maximum likelihood or 00:05:12.200 --> 00:05:19.640 medium-based mapmaking, then you need to load all the, at least like per frequency map into the data, 00:05:19.640 --> 00:05:24.040 into memory. Then for example, this kind of analysis is done in Planck and will be done in 00:05:24.040 --> 00:05:33.240 SO LAT mapmaking. So these are more like HPC kind of workload. So for naive or sometimes called filter 00:05:33.240 --> 00:05:42.040 bin or biased mapmaking, this suits the HTC kind of workflow very well because of something called 00:05:42.040 --> 00:05:48.520 the MapReduce paradigm. So each mapmaking is a single, can be, can be a single, very small, 00:05:48.520 --> 00:05:53.400 I mean, it takes a small chunk of TOD and make a single map out there without 00:05:53.400 --> 00:05:57.480 involving any other data. So it's well suited for HTC kind of workflow. 00:05:57.480 --> 00:06:02.840 I think I might have missed something here. 00:06:02.840 --> 00:06:10.520 Yes. So the thing I want to mention in this slide is actually it has been demonstrated in CMB that 00:06:10.520 --> 00:06:18.120 the HTC kind of workflow already is well suited in SPT-3G. So they have done, they actually released 00:06:18.120 --> 00:06:24.760 a paper that I tried to cite here. I put it in the last slide actually. And then they demonstrated 00:06:24.760 --> 00:06:31.480 they can use the Open Science Grid (OSG) in the US to use HTC kind of resource to complete the analysis 00:06:31.480 --> 00:06:37.640 of SPT-3G, right? So there's something already demonstrated to work well. And the next slide is, 00:06:37.640 --> 00:06:40.200 I just call that the NERSC problem. 00:06:40.200 --> 00:06:54.180 So, you know, you probably, most of you has already received this email in early this year, talking about like the allocation of NERSC resources to the CMB community, right? 00:06:54.180 --> 00:07:05.340 So the way I summarize the situation is that typically, if you want to use NERSC resource, you need to make a certain kind of request, right, to request some kind of allocation. 00:07:05.340 --> 00:07:10.020 But this is done by someone else for us already for like 25 years. 00:07:10.020 --> 00:07:16.220 They routinely, every year, they dedicate 1% of the resources to the whole CMB community. 00:07:16.220 --> 00:07:24.640 And then it's also very generously, any CMB-like researcher want to have, I mean, use that system, 00:07:24.640 --> 00:07:29.640 We are allocated 1% of that whole CMB pool of resources. 00:07:29.640 --> 00:07:32.920 It has been working well. 00:07:32.920 --> 00:07:35.840 This model has been working well until very lately. 00:07:35.840 --> 00:07:41.200 Basically, it's because of the so-called the end of the Moore's law, right? 00:07:41.200 --> 00:07:44.220 So on the chart on the right-hand side, it's kind of complicated. 00:07:44.220 --> 00:07:53.340 But basically, the blue line there is the solar blue line is how you would think the supercomputer would grow in time. 00:07:53.340 --> 00:07:57.440 But the dotted lines tells you that Moore's law is actually slowing down. 00:07:57.440 --> 00:08:01.740 But you can see the other lines basically telling you that how the CMB data sets are growing. 00:08:01.740 --> 00:08:05.800 And it hasn't stopped growing by that 00:08:05.800 --> 00:08:12.060 trend, right? So it means that the amount of data we are having is increasingly overgrowing the 00:08:12.060 --> 00:08:19.600 amount of compute capability we have in supercomputer basically. So this is the sort of problem that we 00:08:19.600 --> 00:08:28.940 are facing. So for example, in this year, we know that about 0.1% of the whole NERSC is allocated 00:08:28.940 --> 00:08:36.800 to the SO analysis, right? So if you take the amount of FLOPS you have for the whole NERSC 00:08:36.800 --> 00:08:42.160 machine multiplied by that number, that's equivalent to a 0.1 PFLOPS machine, basically. 00:08:42.160 --> 00:08:46.800 So if you have a 0.1 PFLOPS machine running like for full time for a year, that is the amount 00:08:46.800 --> 00:08:53.040 of compute time you have. And by my estimation, SO:UK data centre will have basically the same 00:08:53.040 --> 00:08:58.680 similar order of magnitude amount of PFLOPS. So it's an equally capable 00:08:58.680 --> 00:08:58.920 machine. 00:08:58.920 --> 00:09:05.900 for that allocation. And SO:UK data centre is allocated within Blackett, and which is like 00:09:05.900 --> 00:09:12.980 10 times higher, meaning that the amount of burst throughput you can get when you have like you have 00:09:12.980 --> 00:09:17.800 a lot of jobs you want to finish very quickly is 10 times as much. So at NERSC is like theoretically 00:09:17.800 --> 00:09:25.720 1000 times as much. Here is about 10 times as much. So this slide I call it HTC for SATs. So I 00:09:25.720 --> 00:09:28.660 actually mentioned it here. So naïve mapmaking 00:09:28.660 --> 00:09:32.500 is well-suited for HTC kind of workflow 00:09:32.500 --> 00:09:34.520 because of the MapReduce paradigm. 00:09:34.520 --> 00:09:37.480 And it has been demonstrated to use the Open Science Grid, 00:09:37.480 --> 00:09:39.420 which is very similar to Blackett here. 00:09:39.420 --> 00:09:45.540 And SO:UK Data Centre is funded to perform SO SATs analysis. 00:09:45.540 --> 00:09:50.880 And located from within Blackett is an HTC kind of workflow, 00:09:50.880 --> 00:09:52.320 and it is funded for eight years. 00:09:52.320 --> 00:09:58.020 So in this perspective, it is a very stable long-term commitment 00:09:58.020 --> 00:10:01.800 to the science readiness of SAT analysis. 00:10:01.800 --> 00:10:08.720 So now I'm going to switch to the next session. 00:10:08.720 --> 00:10:12.040 So it's basically I'm soft-launching the SO:UK Data Centre. 00:10:12.040 --> 00:10:14.840 What I mean is it's the first time we are presenting 00:10:14.840 --> 00:10:20.080 how you can use our resources here. 00:10:20.080 --> 00:10:23.260 So what is the SO:UK Data Centre first? 00:10:23.260 --> 00:10:28.000 So physically and infrastructurally, it's located from within Blackett. 00:10:28.000 --> 00:10:32.220 And which amounts to about 10% of the Blackett resources 00:10:32.220 --> 00:10:35.080 in terms of number of CPUs. 00:10:35.080 --> 00:10:40.380 And we are going to have access to most of the available resources there. 00:10:40.380 --> 00:10:45.380 And in terms of interacting with the computational resources, 00:10:45.380 --> 00:10:49.860 we are unique in the sense that typical Blackett users 00:10:49.860 --> 00:10:52.400 are not using the resources like that. 00:10:52.400 --> 00:10:55.480 They submit their jobs via something called DiracUI, 00:10:55.480 --> 00:10:57.040 which we are not going to deal with. 00:10:57.040 --> 00:11:00.880 And we are going to log into a certain log-in nodes 00:11:00.880 --> 00:11:04.000 or call submit nodes and use HTCondor directly. 00:11:04.000 --> 00:11:09.700 So the SO:UK Data Centre documentation is written specifically for this 00:11:09.700 --> 00:11:14.640 because no other Blackett users has been using the resources like this, 00:11:14.640 --> 00:11:20.860 and HTCondor itself. So I'm just trying to introduce it a little bit. 00:11:20.860 --> 00:11:25.100 In a certain sense, you can say it's an inferior job manager 00:11:25.100 --> 00:11:27.020 comparing to the rest of the Blackett users. 00:11:27.020 --> 00:11:28.740 So the one that you are already used to launch, 00:11:28.740 --> 00:11:32.660 in handling massively parallel applications, right? 00:11:32.660 --> 00:11:36.040 But we are not doing that in SATs and others. 00:11:36.040 --> 00:11:39.340 So it can be viewed as SLURM-like in many aspects. 00:11:39.340 --> 00:11:41.380 So basically, it's job manager, you submit jobs, 00:11:41.380 --> 00:11:43.900 and it will try to allocate resources for it to run. 00:11:43.900 --> 00:11:48.200 However, other design choices in Blackett also contributes 00:11:48.200 --> 00:11:52.280 to other differences that you might have been experiencing 00:11:52.280 --> 00:11:54.000 with NERSC, for example. 00:11:54.000 --> 00:11:56.820 So we are going to see some of them today. 00:11:56.820 --> 00:11:59.820 So now, we are going to have a live demo, 00:11:59.820 --> 00:12:03.060 which is a little bit scary because live demo, 00:12:03.060 --> 00:12:04.320 sometimes it can go wrong. 00:12:04.320 --> 00:12:05.220 So let me try. 00:12:05.220 --> 00:12:11.380 So the first thing I will show you is, actually, how do I see? 00:12:11.380 --> 00:12:14.680 So we are not seeing this. 00:12:14.680 --> 00:12:17.300 So let me try to click, just click this thing. 00:12:17.300 --> 00:12:21.140 So this is a link to the SO:UK Data Centre documentation. 00:12:21.140 --> 00:12:26.800 And for example, some of them is very straightforward. 00:12:26.880 --> 00:12:28.080 This is the onboarding page. 00:12:28.080 --> 00:12:30.020 So if you have any other colleagues, 00:12:30.020 --> 00:12:31.700 you want them to create new users, 00:12:31.700 --> 00:12:33.760 you can send them to this page first 00:12:33.760 --> 00:12:36.360 and then try to follow the example 00:12:36.360 --> 00:12:40.560 to start to be a user of our data centre. 00:12:40.560 --> 00:12:42.240 And then there's a quick start page 00:12:42.240 --> 00:12:46.460 here that basically highlights the minimal number of pages 00:12:46.460 --> 00:12:48.740 you can visit in order to quickly try 00:12:48.740 --> 00:12:51.240 to run a certain kind of job. 00:12:51.240 --> 00:12:53.040 OK? 00:12:53.040 --> 00:12:56.680 And here, this page, I'm trying-- 00:12:56.680 --> 00:13:03.720 trying to outline what is the typical lifecycle of a workflow or pipeline here, right? So basically, 00:13:03.720 --> 00:13:11.640 it's really that short. So you need some way of defining your job configuration. At NERSC, 00:13:11.640 --> 00:13:18.920 you have something called, I forgot, batch script. They call it batch script, which has some 00:13:18.920 --> 00:13:24.280 special SLURM command on the top of your script, right? So here, we have something called ClassAd. 00:13:24.280 --> 00:13:29.480 You write that job configuration, you submit that to the scheduler, and it will try to allocate 00:13:29.480 --> 00:13:35.240 the resource. And within that job, then you need to bootstrap a certain software environment. 00:13:35.240 --> 00:13:41.400 You need to load the environment that has TOAST, for example, in order to run a mapmaker using TOAST. 00:13:41.400 --> 00:13:49.320 And then, optionally, so depending if your application are using MPI or not, 00:13:49.320 --> 00:13:53.880 then you can try to launch MPI applications. But as I have briefly mentioned before, 00:13:53.880 --> 00:14:06.880 MPI is not a first-class thing in HTCondor, so I provided a special wrapper basically trying to launch a parallel job in something called "parallel universe" in their terminology. 00:14:06.880 --> 00:14:10.880 And with that wrapper, you can launch an MPI job, which we are going to see very soon. 00:14:10.880 --> 00:14:13.880 And then lastly, you want to do some I/O, right? 00:14:13.880 --> 00:14:22.880 So once you have your software, you launch an MPI application, maybe you read in some data, and then it is going to write some data. 00:14:22.880 --> 00:14:30.880 Specifically when you're writing data here at Blackett, because it's part of the grid, there's something called the grid storage system. 00:14:30.880 --> 00:14:35.880 So we are also going to see how it's going to work over there. 00:14:35.880 --> 00:14:40.880 So now it's time for the live demo. So hopefully it works. 00:14:40.880 --> 00:14:51.880 So I'm going to switch my screen. 00:14:51.880 --> 00:14:59.880 Okay, so are you seeing my terminal? 00:14:59.880 --> 00:15:01.880 Yeah. 00:15:01.880 --> 00:15:09.880 Okay. So let's try to log into the login nodes first. 00:15:09.880 --> 00:15:20.880 So right now we are at this location, so-called VM77. This is now within Blackett. 00:15:20.880 --> 00:15:33.880 One thing I didn't mention is that the documentation is actually located at a GitHub repository. 00:15:33.880 --> 00:15:43.880 And the code you see in the documentation is actually in a file within the documentation that you can actually run those code. 00:15:43.880 --> 00:15:49.880 So the thing I'm going to show you is, let's see, so let's say user document. 00:15:49.880 --> 00:15:52.400 So in the website, you can just click them. 00:15:52.400 --> 00:15:55.100 So now I'm within the directory, so I'm 00:15:55.100 --> 00:15:58.980 going to navigate on my own. 00:15:58.980 --> 00:16:02.120 So now I'm in MPI application. 00:16:02.120 --> 00:16:07.240 So here are the files. 00:16:07.240 --> 00:16:10.600 So first of all, you need to have something 00:16:10.600 --> 00:16:13.320 like a job configuration file. 00:16:13.320 --> 00:16:15.840 So this is your job configuration file. 00:16:15.840 --> 00:16:19.000 So it looks like you try to define what kind of universe 00:16:19.000 --> 00:16:22.160 you're in, the so-called parallel universe, 00:16:22.160 --> 00:16:24.060 because you work more than one node. 00:16:24.060 --> 00:16:27.000 This is the way that HTCondor, 00:16:27.000 --> 00:16:29.020 how you classify different kinds of job. 00:16:29.020 --> 00:16:30.740 They call it different universe. 00:16:30.740 --> 00:16:34.580 And then the executable that is going to start 00:16:34.580 --> 00:16:39.520 is this thing that I put on the submit node, 00:16:39.520 --> 00:16:41.840 or on the log-in node, VM77. 00:16:41.840 --> 00:16:44.860 So this is the wrapper script I mentioned 00:16:44.860 --> 00:16:47.880 that is trying to coordinate, 00:16:47.880 --> 00:16:52.380 try to start, try to launch an MPI processes. 00:16:52.380 --> 00:16:54.000 And then the argument of this executable 00:16:54.000 --> 00:16:55.380 has two different parts. 00:16:55.380 --> 00:16:57.520 So there's actually two scripts. 00:16:57.520 --> 00:16:59.920 The first script is to set up the environment. 00:16:59.920 --> 00:17:03.100 The second script is to really run the MPI application itself, 00:17:03.100 --> 00:17:05.360 which we are going to see soon. 00:17:05.360 --> 00:17:07.300 And then we request machine count, 00:17:07.300 --> 00:17:09.520 which is the number of nodes you have. 00:17:09.520 --> 00:17:11.860 And per node, you want to have 16 CPUs. 00:17:11.860 --> 00:17:14.360 And by the way, this is logical cores. 00:17:14.360 --> 00:17:17.460 So this corresponds to 16 logical cores here. 00:17:17.460 --> 00:17:20.980 And this is also like other things that, for example, 00:17:20.980 --> 00:17:25.020 you tell them the jobs to transfer these files 00:17:25.020 --> 00:17:30.220 to the worker nodes beforehand, before you start the job. 00:17:30.220 --> 00:17:37.240 So this is the outline of what you are going to start your job. 00:17:37.240 --> 00:17:39.740 So within that job configuration file, 00:17:39.740 --> 00:17:42.780 we said that there's two different scripts. 00:17:42.780 --> 00:17:47.040 So one of them is called env.sh, which set up the environment. 00:17:47.040 --> 00:17:50.640 So let's see what it is. 00:17:50.640 --> 00:17:53.540 And actually, we can actually do something like this. 00:17:53.540 --> 00:18:00.420 So this is the env.sh on the right-hand side. 00:18:00.420 --> 00:18:05.980 So you basically see that there was a file when I was-- 00:18:05.980 --> 00:18:07.880 let's see, let me switch to there. 00:18:07.880 --> 00:18:09.800 When I was in this job configuration file, 00:18:09.800 --> 00:18:13.800 I tell it to transfer this file to the worker nodes. 00:18:13.800 --> 00:18:18.120 And over here, this script that starts 00:18:18.120 --> 00:18:20.760 to prepare your environment is trying 00:18:20.760 --> 00:18:27.960 to first uncompress your tarball to a certain location, which 00:18:27.960 --> 00:18:30.420 is this line over here. 00:18:30.420 --> 00:18:34.680 And then it's going to so-called activate your environment. 00:18:34.680 --> 00:18:39.040 So after this line, now you have access to some executable. 00:18:39.040 --> 00:18:41.760 For example, for the one that I'm providing here, 00:18:41.760 --> 00:18:43.740 you have Python, you have mpirun. 00:18:43.740 --> 00:18:46.000 So when you run this, like, which Python, which mpirun, 00:18:46.000 --> 00:18:47.940 you're going to see the feedback there, 00:18:47.940 --> 00:18:51.540 like where the Python is in that path. 00:18:51.540 --> 00:18:57.240 And then the next part is once this environment is set up, 00:18:57.240 --> 00:19:01.500 when your job actually starts, what it's trying to do 00:19:01.500 --> 00:19:05.600 is to use this bash function that I provide 00:19:05.600 --> 00:19:08.940 to set up your so-called OpenMPI host first. 00:19:08.940 --> 00:19:11.100 And then you just, with one single line, 00:19:11.100 --> 00:19:11.740 you are starting-- 00:19:11.740 --> 00:19:15.260 starting the MPI application like this, MPI run, 00:19:15.260 --> 00:19:19.840 and then you specify a host by this environment variable that's 00:19:19.840 --> 00:19:23.020 prepared by this bash function I provided. 00:19:23.020 --> 00:19:25.960 And then Python-- in this case, it's very simple. 00:19:25.960 --> 00:19:27.500 It's just doing some sort of hardware 00:19:27.500 --> 00:19:33.640 trying to test your job. 00:19:33.640 --> 00:19:40.540 So over here, probably, I can try to demonstrate it. 00:19:40.540 --> 00:19:42.680 So I just submitted a job. 00:19:42.680 --> 00:19:45.840 So you can see if you use this-- so if you follow through 00:19:45.840 --> 00:19:49.300 my example over there, you will have more detailed explanation 00:19:49.300 --> 00:19:50.460 of what this is doing. 00:19:50.460 --> 00:19:52.760 But basically, the job has been submitted, 00:19:52.760 --> 00:19:56.320 and it's currently still in the queue, 00:19:56.320 --> 00:19:57.980 so nothing is happening. 00:19:57.980 --> 00:20:00.500 And when I run this so-called tail command, 00:20:00.500 --> 00:20:04.740 it's trying to aggressively print out the things as soon 00:20:04.740 --> 00:20:07.520 as it starts to come to life. 00:20:07.520 --> 00:20:10.540 So right now, the job just started, 00:20:10.540 --> 00:20:12.960 and you can see mpi-0.out, meaning 00:20:12.960 --> 00:20:15.340 that the first physical node is trying 00:20:15.340 --> 00:20:20.240 to run the env.sh script over there, 00:20:20.240 --> 00:20:22.700 and it's starting to unarchive your environment. 00:20:22.700 --> 00:20:26.300 And also, the first process of the second physical node 00:20:26.300 --> 00:20:27.500 is also doing the same thing. 00:20:27.500 --> 00:20:30.720 And you wait like a minute or like 50 seconds, 00:20:30.720 --> 00:20:34.920 and then it will tell you that, OK, 00:20:34.920 --> 00:20:37.520 Python is now available at this path. 00:20:37.520 --> 00:20:39.400 So right now, the job has started. 00:20:39.400 --> 00:20:42.060 So let's go back a little bit to see what happened. 00:20:42.060 --> 00:20:44.300 So the Python is available over here. 00:20:44.300 --> 00:20:46.040 mpirun is available over there. 00:20:46.040 --> 00:20:48.640 And then this mpi.log file is telling you 00:20:48.640 --> 00:20:52.580 that these processes are actually 00:20:52.580 --> 00:20:54.240 launched over there already. 00:20:54.240 --> 00:20:57.580 So these different files are telling you different things. 00:20:57.580 --> 00:21:02.820 So mpi-0.out is really the standard output 00:21:02.820 --> 00:21:05.340 from your MPI application. 00:21:05.340 --> 00:21:07.340 Now it's telling you, hello world, I'm from-- 00:21:07.340 --> 00:21:15.420 which processor on which node etc and then it ends so mpi.log which is what HTCondor is telling you 00:21:15.420 --> 00:21:23.340 tells you that okay these nodes exit exceeded over there and is terminated and over here is that 00:21:23.340 --> 00:21:28.620 job terminates on its own accord so there's no like failure over there so that's the end now i can just 00:21:31.260 --> 00:21:37.740 stop this process. So this is a successful live demo of the MPI application. So any questions so 00:21:37.740 --> 00:21:49.240 far? I would have one. Yes. Oh, sorry. Andrew was first. Yeah. I mean, so I lost the forest for the 00:21:49.240 --> 00:21:53.920 trees a bit. I'm suddenly confused about communication within a node versus between 00:21:53.920 --> 00:22:01.200 nodes. Is MPI just agnostic to that, the way it's running? Or is MPI only being used between 00:22:01.200 --> 00:22:04.500 nodes, or I'm not sure if even my questions are 00:22:04.500 --> 00:22:06.240 making a lot of sense. 00:22:06.240 --> 00:22:09.440 OK, so I will try to explain it. 00:22:09.440 --> 00:22:11.380 So let me go back to the slide. 00:22:11.380 --> 00:22:11.920 I know this. 00:22:11.920 --> 00:22:14.260 Can we know whether we're communicating only quickly 00:22:14.260 --> 00:22:19.720 between-- within a single node or slowly between nodes, right? 00:22:19.720 --> 00:22:22.360 Yeah, I need to explain a little more in the terminal. 00:22:22.360 --> 00:22:25.520 So let me go back to this job configuration file. 00:22:25.520 --> 00:22:28.280 Are you seeing the file? 00:22:28.280 --> 00:22:28.960 Yeah. 00:22:28.960 --> 00:22:30.560 Yeah, you're seeing the file. 00:22:30.560 --> 00:22:32.960 So this is the job configuration file. 00:22:32.960 --> 00:22:36.060 So yeah, I didn't really explain it over here. 00:22:36.060 --> 00:22:39.180 But when we say machine counts here, 00:22:39.180 --> 00:22:43.880 it's a little bit confusing because we are not actually 00:22:43.880 --> 00:22:46.220 requesting number of machines. 00:22:46.220 --> 00:22:52.140 So machine count here is more like MPI processes. 00:22:52.140 --> 00:22:54.540 So when I request machine count equals to 2, 00:22:54.540 --> 00:22:57.840 it means that I'm specifically launching two MPI processes. 00:22:57.840 --> 00:23:00.540 And then when I say request CPU equals 16, it means 00:23:00.540 --> 00:23:10.540 set per like MPI or HTCondor processors, they have 16 logical CPUs there. So one thing I was trying to 00:23:10.540 --> 00:23:19.420 explain is those two so-called machines can actually land in the same physical node. This is part of the 00:23:19.420 --> 00:23:24.300 confusion and sometimes can lead to some very confusing errors that you're seeing when you are 00:23:24.300 --> 00:23:31.020 trying to debug some program. But this is something to bear in mind. Now, once we understood that, 00:23:31.020 --> 00:23:38.060 so if you use my wrapper script to launch MPI processors, the amount of MPI processors you 00:23:38.060 --> 00:23:43.660 have is exactly two. If you read through the documentation in that script, there's 00:23:43.660 --> 00:23:48.700 also another mode you can launch MPI processes, which I'm not talking about here. So we only have 00:23:48.700 --> 00:23:53.980 two MPI processes. When you are seeing the so-called like how well from the process one 00:23:53.980 --> 00:23:58.300 one out of two there is the first MPI process. 00:23:58.300 --> 00:24:01.920 So there's no like thread level parallelism 00:24:01.920 --> 00:24:04.780 so far in the thing I demonstrated. 00:24:04.780 --> 00:24:07.780 So it means that you are free to do 00:24:07.780 --> 00:24:10.520 like hybrid kind of parallelism. 00:24:10.520 --> 00:24:13.840 You can use MPI, sometimes called MPI plus X, 00:24:13.840 --> 00:24:17.560 where X is often OpenMP, 00:24:17.560 --> 00:24:20.220 if you're not doing something fancy. 00:24:20.220 --> 00:24:24.180 So MPI plus OpenMP there, right? 00:24:24.180 --> 00:24:26.580 So now you have two MPI processes 00:24:26.580 --> 00:24:30.280 and within each process, you can have multi-threading. 00:24:30.280 --> 00:24:33.080 So for example, if you use TOAST mapmaking, 00:24:33.080 --> 00:24:36.300 by default is already using OpenMP. 00:24:36.300 --> 00:24:38.520 So there's some sort of environmental variable, 00:24:38.520 --> 00:24:41.300 which is also set by my wrapper script 00:24:41.300 --> 00:24:44.920 that tells TOAST how many threads they should use 00:24:44.920 --> 00:24:47.160 within an MPI process. 00:24:47.160 --> 00:24:50.200 Yeah, so in short, to answer that question, 00:24:50.200 --> 00:24:54.180 there's no like multi-threading communication 00:24:54.180 --> 00:24:56.560 that I have shown in that demonstration. 00:24:56.560 --> 00:25:00.020 So it's all like different MPI processes 00:25:00.020 --> 00:25:03.340 trying to do and how well and nothing else. 00:25:03.340 --> 00:25:05.960 - I guess I was kind of asking the opposite question. 00:25:05.960 --> 00:25:07.740 Should you therefore assume 00:25:07.740 --> 00:25:10.120 that all MPI communication is slow 00:25:10.120 --> 00:25:13.640 because it might be on different nodes 00:25:13.640 --> 00:25:17.440 and because this is only HTC and not HPC, 00:25:17.440 --> 00:25:18.640 that can be very slow? 00:25:20.180 --> 00:25:22.600 - You can say so, yes, you can say so. 00:25:22.600 --> 00:25:26.600 And it's kind of like my recommended way of setting, right? 00:25:26.600 --> 00:25:31.600 So if you are requesting like two so-called machines 00:25:31.600 --> 00:25:34.300 that so happen in one physical node, 00:25:34.300 --> 00:25:37.080 then the way I would do is I better just ask 00:25:37.080 --> 00:25:41.220 for one single machine with more number of CPUs, 00:25:41.220 --> 00:25:45.100 such that it's guaranteed to land within a physical node, 00:25:45.100 --> 00:25:46.700 such that communication is fast 00:25:46.700 --> 00:25:49.200 because at most it's going across the socket, 00:25:49.200 --> 00:25:50.160 but not over the... 00:25:50.160 --> 00:25:53.160 the network, which by the way have other complications 00:25:53.160 --> 00:25:55.900 because I mentioned that it's 10 GbE, 00:25:55.900 --> 00:25:58.780 but like which switch are they landing on? 00:25:58.780 --> 00:26:01.680 And it is because it's like of the grid, 00:26:01.680 --> 00:26:05.060 like infrastructure is not optimized in that way. 00:26:05.060 --> 00:26:08.200 So the network topology can be complicated 00:26:08.200 --> 00:26:11.940 and there's no guarantee that both machines 00:26:11.940 --> 00:26:14.200 will land within the switch. 00:26:14.200 --> 00:26:18.080 So I would say like try to avoid this, called, 00:26:18.080 --> 00:26:20.140 like inter-network communication 00:26:20.140 --> 00:26:21.640 as best as you can. 00:26:21.640 --> 00:26:24.940 But if you need the memory, just like the message 00:26:24.940 --> 00:26:26.160 I was saying there. 00:26:26.160 --> 00:26:29.260 So usually, MPI is needed because you need more memory 00:26:29.260 --> 00:26:32.240 than the one that is available in one single node. 00:26:32.240 --> 00:26:34.200 If you need more memory, then of course, you 00:26:34.200 --> 00:26:36.060 need to parallelize it there carefully. 00:26:36.060 --> 00:26:48.020 OK, any other questions? 00:26:48.020 --> 00:26:48.520 So I do-- 00:26:48.520 --> 00:26:49.920 All right. 00:26:49.920 --> 00:27:17.400 Yeah, I would have one. Oh, sorry. Yes, yes, yes. Just very practical. So you told us that there was a tarball that contains sort of the programming environment. I wasn't very sure what that means. So concretely, I would like to know if you have sort of the classical programming environment, let's say at NERSC where you have your whatever you need, I mean, your maybe conda environment, maybe modules, so on. 00:27:17.400 --> 00:27:27.540 And then you have also scripts that you want to launch from a certain directory. Is that all comprised in that tarball? Or are your scripts still in your directory? 00:27:27.540 --> 00:27:40.640 And then the second related question, if you have sort of large input files, would they comprise a third set of inputs that you specify to kind of import from somewhere else? 00:27:40.640 --> 00:27:47.380 So what are the three? What is the tarball doing? And where do you get input files from that are large, I would say? 00:27:47.380 --> 00:27:50.060 Well, how do you tell to your to your launcher? 00:27:50.060 --> 00:28:04.980 Yeah, so yeah, that was a very simple demo that I didn't touch these issues. And in a certain sense, these issues, so in terms of so I actually have written down, so that's what I'm showing in the screen right now. 00:28:04.980 --> 00:28:11.220 So there's like software deployment, and even there's another page over here, reading and writing data, right? 00:28:11.220 --> 00:28:16.980 So you can think of when you are preparing a node to do something, you need to read. 00:28:16.980 --> 00:28:23.360 You need to read from a directory to load the software. So this is a software deployment over here, and also input data, output data. 00:28:23.360 --> 00:28:30.300 So this is more complicated than what you would be doing at NERSC. 00:28:30.300 --> 00:28:36.040 Basically, because at NERSC, everything is in a file system already. They are mounted. They are mounted in your login node. 00:28:36.040 --> 00:28:40.860 They are also mounted on the compute node, and they are totally transparent that you don't need to. 00:28:40.860 --> 00:28:46.960 So the only thing you need to copy is a file path, and you don't need to worry if this file will exist on the compute node. 00:28:47.460 --> 00:28:55.040 But at a grid system here, each node, the so-called compute nodes or worker nodes over there, they are like in a blank state. 00:28:55.040 --> 00:28:58.580 They have, they doesn't have your home, for example. 00:28:58.580 --> 00:29:05.200 It doesn't, it doesn't see the kind of things you write at your login nodes, the vm77 there. 00:29:05.200 --> 00:29:08.960 So the software environment you prepare over there will not work. 00:29:08.960 --> 00:29:16.240 So that's why we need to think about how software can be deployed, which is the centre of this page. 00:29:16.240 --> 00:29:19.800 And there are basically at least three different methods, right? 00:29:19.800 --> 00:29:29.760 You can basically ignore the last method containers here is not supported by us for, you might say, a deficiency in HTCondor. 00:29:29.760 --> 00:29:40.740 So the two like really de-facto standards to deploy software at a grid computing like this is either tarball or CVMFS. 00:29:40.740 --> 00:29:45.220 And CVMFS is right now not 00:29:46.040 --> 00:29:49.340 set up yet for our so-called virtual organization. 00:29:49.340 --> 00:29:54.260 So it means that right now, the tarball method 00:29:54.260 --> 00:29:56.420 is the only way we can do it. 00:29:56.420 --> 00:29:59.180 And there's a software package that I 00:29:59.180 --> 00:30:03.040 provide that includes all the dependencies and sotodlib. 00:30:03.040 --> 00:30:07.360 So it also includes SPT-3G, I think. 00:30:07.360 --> 00:30:10.420 And also, what else? 00:30:10.420 --> 00:30:13.100 Like, I put it there. 00:30:13.100 --> 00:30:15.040 But of course, it doesn't include everything. 00:30:15.040 --> 00:30:18.520 For example, currently, the so-called NaMaster 00:30:18.520 --> 00:30:20.100 is not there yet. 00:30:20.100 --> 00:30:24.000 So I also, in one page-- 00:30:24.000 --> 00:30:28.220 actually, in this page of tarball method, in the end of that page, 00:30:28.220 --> 00:30:33.580 I mentioned how you can tailor your tarball, basically. 00:30:33.580 --> 00:30:36.140 So not everything I provided-- 00:30:36.140 --> 00:30:39.100 I mean, whatever you need might not be provided by me. 00:30:39.100 --> 00:30:40.800 So you need to add something, or you 00:30:40.800 --> 00:30:43.000 want to have a newer version of TOAST, let's say. 00:30:43.000 --> 00:30:45.000 And this is sort of an internal-- 00:30:45.000 --> 00:30:45.840 I'm not going to do that. 00:30:45.840 --> 00:30:48.400 It's not very clear yet right now. 00:30:48.400 --> 00:30:51.380 But it will improve over time. 00:30:51.380 --> 00:30:53.600 And by the way, yeah, I should address that. 00:30:53.600 --> 00:30:55.720 So when I say-- 00:30:55.720 --> 00:30:58.240 OK, so the next question is, do we 00:30:58.240 --> 00:31:00.640 have time to make one more demo, which 00:31:00.640 --> 00:31:05.200 will answer your next question, like doing I/O to the nodes? 00:31:05.200 --> 00:31:07.320 So do you want to have one more demo? 00:31:07.320 --> 00:31:10.980 Maybe I can just read through them. 00:31:10.980 --> 00:31:14.960 Yeah, I think probably don't do the other demo Kolen, because we only 00:31:14.960 --> 00:31:20.240 have sort of minutes till kind of hard deadline. 00:31:20.240 --> 00:31:22.280 Yeah, so we will just-- 00:31:22.280 --> 00:31:24.620 which is also more safe to me. 00:31:24.620 --> 00:31:26.460 The next demo is more complicated. 00:31:26.460 --> 00:31:30.860 So now we are going to talk about how to read and write data there. 00:31:30.860 --> 00:31:33.860 So basically, there's-- let's see. 00:31:33.860 --> 00:31:40.560 By the way, this session is very long, because in a certain sense, 00:31:40.560 --> 00:31:41.640 it's more complicated. 00:31:41.640 --> 00:31:43.020 So there's three big sessions. 00:31:43.020 --> 00:31:44.920 If you look at the table of contents on the left-hand side, 00:31:44.920 --> 00:31:50.700 So one is just transferring file from via the so-called ClassAd, the job 00:31:50.700 --> 00:31:56.140 configuration file, using HTCondor. So this is the simplest one but also the 00:31:56.140 --> 00:32:01.120 most limited in terms of capability and is something we are not recommending. 00:32:01.120 --> 00:32:05.260 Basically the submit node is very small. We only have 200 GB of this space 00:32:05.260 --> 00:32:10.440 and if you want to use this basically you need to assume everything is on 00:32:10.440 --> 00:32:14.940 your submit node either like reading from or transmitting to. So this is not a 00:32:14.940 --> 00:32:19.560 recommended method so I will skip it here but but the example is there you 00:32:19.560 --> 00:32:25.380 can try. So this is the next method the so-called grid storage system is the 00:32:25.380 --> 00:32:29.940 de-facto standard that on the grid they were committed to do it this way. So this 00:32:29.940 --> 00:32:34.140 is the main thing and lastly there's something we are not going to mention 00:32:34.140 --> 00:32:39.440 today is the Librarian. So we I think we all know that we are going to load some 00:32:40.420 --> 00:32:46.540 the TOD from librarian, but it's more a special case of loading data. So I'm not going to talk 00:32:46.540 --> 00:32:51.920 about it right now. And also it hasn't been set up yet. So in terms of grid storage system, 00:32:51.920 --> 00:32:57.780 so I, this documentation goes through it in detail. So you see, there's a lot of sub pages 00:32:57.780 --> 00:33:05.280 within this. Many of them is related to setting things up. But fortunately, if you're using vm77 00:33:05.280 --> 00:33:11.740 is easier. So the thing you need to do is with the setup that we already 00:33:11.740 --> 00:33:24.940 created, let's see this part. So the, the thing that you need to do is starting from here. So 00:33:24.940 --> 00:33:30.920 you, you need your grid certificate. You put it in a specific location, which is this path, 00:33:30.920 --> 00:33:35.260 basically. And then once you have that certificate over there, 00:33:35.260 --> 00:33:44.260 the tool set you're going to use down there is going to authenticate your jobs via your user 00:33:44.260 --> 00:33:50.640 certificate, basically. So you need to create something periodically too. So what happens is 00:33:50.640 --> 00:33:56.880 when you run this command, it's actually trying to create something called the AC attribute 00:33:56.880 --> 00:34:04.060 certificate, which is temporary. Your user certificate is a permanent thing that ties to you. 00:34:04.580 --> 00:34:09.100 And using your certificate, this will create something temporary that you can use to submit 00:34:09.100 --> 00:34:17.080 a job. And once you run this command, this AC is going to be valid for one week. 00:34:17.080 --> 00:34:21.920 So it means that every week you need to run it once, like every Monday morning you're trying 00:34:21.920 --> 00:34:29.580 to run it once. And it will tell you this sort of a demo that if you run this command, 00:34:29.580 --> 00:34:34.100 this is what you're going to see. And it will say a certain proxy is created over here, 00:34:34.560 --> 00:34:38.800 and the proxy is going to be valid until that time. 00:34:38.800 --> 00:34:42.320 Once you have that file, you can then start your job. 00:34:42.320 --> 00:34:45.960 So this, I think you can, can't ignore that. 00:34:45.960 --> 00:34:50.340 So this is a job demo. 00:34:50.340 --> 00:34:55.340 So basically, once you have your AC certificate, 00:34:55.340 --> 00:34:58.900 then you have a job conversion file like this, 00:34:58.900 --> 00:35:02.840 which transfer your AC to the worker nodes. 00:35:02.840 --> 00:35:07.840 And then in your job script on the worker nodes, 00:35:07.840 --> 00:35:10.960 there's one single command you need to set, 00:35:10.960 --> 00:35:13.600 export this environment variable. 00:35:13.600 --> 00:35:18.380 And after this, your worker nodes will be able to read 00:35:18.380 --> 00:35:23.380 and write to and from the so-called grid storage system. 00:35:23.380 --> 00:35:27.520 So this script is basically a demonstration on that. 00:35:27.520 --> 00:35:30.560 You can try to use a special kind of list command, 00:35:30.560 --> 00:35:32.600 so-called gfal-ls, 00:35:32.600 --> 00:35:37.100 to see what is available within your grid storage system. 00:35:37.100 --> 00:35:39.860 And then you can try to, over this line, 00:35:39.860 --> 00:35:44.280 you, let's see, I try to make a directory first. 00:35:44.280 --> 00:35:45.200 And then in the next line, 00:35:45.200 --> 00:35:47.280 I try to remove that directory again, 00:35:47.280 --> 00:35:49.840 and I create some file and you use this command 00:35:49.840 --> 00:35:54.840 to copy that file to your grid storage system. 00:35:54.840 --> 00:35:57.660 So this is the example of how you can 00:35:57.660 --> 00:36:01.240 interact with the grid storage system. 00:36:01.240 --> 00:36:02.360 And, and for 00:36:02.360 --> 00:36:04.960 the bulk of the data that we are going to have, 00:36:04.960 --> 00:36:09.020 like about a hundred terabytes right now, 00:36:09.020 --> 00:36:13.080 is going to be only accessible using this way. 00:36:13.080 --> 00:36:16.180 And in the long run, by the way, 00:36:16.180 --> 00:36:17.400 there will be another system 00:36:17.400 --> 00:36:19.200 which will make something like this easier. 00:36:19.200 --> 00:36:21.740 But in the interim, let's say for the next months, 00:36:21.740 --> 00:36:23.440 we need to do something like this. 00:36:23.440 --> 00:36:28.880 Okay, so probably, I have more documentation over there, 00:36:28.880 --> 00:36:32.120 which you can try to read on your own, 00:36:32.120 --> 00:36:33.660 but I will go back to the presentation. 00:36:33.660 --> 00:36:35.480 So this was the live demo. 00:36:35.480 --> 00:36:38.220 Any other questions so far? 00:36:38.220 --> 00:36:45.740 Okay, so this last big session is like to be explored, 00:36:45.740 --> 00:36:46.920 like the how over there, 00:36:46.920 --> 00:36:50.080 which is like the future aspects of things. 00:36:50.080 --> 00:36:53.380 So this slide basically mentioned 00:36:53.380 --> 00:36:55.500 how to design a workflow for a system, right? 00:36:55.500 --> 00:36:58.180 So I tried to mention something about effectiveness. 00:36:58.180 --> 00:37:00.120 You can think about it in two different ways. 00:37:00.120 --> 00:37:01.880 One is amount of compute time, 00:37:01.880 --> 00:37:05.600 so you can say how much amount of fair share 00:37:05.600 --> 00:37:07.360 you have used up from your system, 00:37:07.360 --> 00:37:11.440 like allocation, like NERSC hours, something like that. 00:37:11.440 --> 00:37:13.860 And also in terms of the turnaround time, 00:37:13.860 --> 00:37:15.840 like how fast, once I submit a job, 00:37:15.840 --> 00:37:17.720 how quickly can I get the result, right? 00:37:17.720 --> 00:37:21.660 This is what I define to be effectiveness of a workflow. 00:37:21.660 --> 00:37:25.740 And for more capable system, 00:37:25.740 --> 00:37:29.720 like it is more lenient on sub-optimality, right? 00:37:29.720 --> 00:37:31.640 So if you write a workflow that is not very optimal, 00:37:31.640 --> 00:37:33.520 but you know that it works, 00:37:33.520 --> 00:37:37.280 a more capable system is more lenient to that. 00:37:37.280 --> 00:37:40.360 Meaning that if you want to adapt a workflow from NERSC 00:37:40.360 --> 00:37:42.980 to the SO:UK Data Centre, 00:37:42.980 --> 00:37:46.720 you might see bottlenecks that you haven't been thinking about. 00:37:46.720 --> 00:37:49.700 So one example I gave here is that, 00:37:49.700 --> 00:37:50.760 for example, in mapmaking, 00:37:50.760 --> 00:37:54.500 so let's say I have 500 hours of observation. 00:37:54.500 --> 00:37:58.220 Each map making is going to inject one single hour observation 00:37:58.220 --> 00:37:59.600 and provide you one single map, right? 00:37:59.600 --> 00:38:01.400 So now I have 5,000 jobs. 00:38:01.400 --> 00:38:07.540 You can launch a very wide MPI application with 5,000 processes 00:38:07.540 --> 00:38:10.760 and it will give you the answer in a very short amount of time, 00:38:10.760 --> 00:38:12.780 let's say three minutes, right? 00:38:12.780 --> 00:38:15.980 But this is not a very good use of NERSC resource. 00:38:15.980 --> 00:38:18.400 The reason is now you have a couple of other problems. 00:38:18.400 --> 00:38:20.400 So it's load-balancing. 00:38:20.400 --> 00:38:23.340 So do all the mapmaking finish at the same time 00:38:23.340 --> 00:38:25.180 and also turnaround time, 00:38:25.180 --> 00:38:28.080 because you are asking for a very wide job. 00:38:28.080 --> 00:38:31.160 How soon will the scheduler be able 00:38:31.160 --> 00:38:37.880 to put your job in the queue. So when you adapt that kind of workflow to Blackett, it will be 00:38:37.880 --> 00:38:43.400 even worse. So if you try to start 5,000 processes, I don't know how long would it be able to start. 00:38:43.400 --> 00:38:49.160 But if you bring it down into 500 different jobs, then basically most of them will start instantly. 00:38:49.160 --> 00:38:56.440 So in this aspect, if you tune your workflow specifically for Blackett, you can actually 00:38:56.440 --> 00:39:01.560 have faster turnaround time in at least in some situation because at NERSC if you if you try you 00:39:01.560 --> 00:39:06.440 know something called debug partition that's not supposed to be in production if you put it in 00:39:06.440 --> 00:39:11.320 production queue the typical turnaround time is like at least three days so if you want to run 00:39:11.320 --> 00:39:18.840 a map that will finish in three minutes it will be most probably going very faster over here so the 00:39:18.840 --> 00:39:25.320 recommend workflow then is to launch more small jobs right so you just launch each job is an 00:39:25.320 --> 00:39:33.720 atomic job by atomic i mean that that single job has it is not going to communicate with any other 00:39:33.720 --> 00:39:38.760 job like kind of like the mapmaking I was talking about you're not doing some you're only doing 00:39:38.760 --> 00:39:44.120 naive mapmaking each process are not communicating so each of them are reading their own TOD and 00:39:44.120 --> 00:39:49.720 writing their own maps so each of them is one atomic job so if you try to factor that out 00:39:49.720 --> 00:39:54.200 it will actually help you to go through the uh the queue faster and 00:39:54.200 --> 00:39:59.800 but then you now have a complication because now you need to need to launch tens of thousands of 00:39:59.800 --> 00:40:04.040 jobs right how do you handle it so there's something workflow manager is going to help you 00:40:04.040 --> 00:40:11.800 and i can't talk about more details about here but like i kind of discourage which is also 00:40:11.800 --> 00:40:18.440 what NERSC is discouraging about rolling your own workflow manager which basically because 00:40:18.440 --> 00:40:23.480 workflow managers are trying to solve these issues right so we can think about like you see job 00:40:23.480 --> 00:40:29.080 managers agnostics—do your workflow manager assumes the presence of SLURM5 for example 00:40:29.080 --> 00:40:34.120 and how job dependencies are handled so let's say if you after you start map making you also 00:40:34.120 --> 00:40:39.400 have other things that depends on the finished maps right how how would those be launched later 00:40:39.400 --> 00:40:45.160 on do you need to like kind of babysit your job and launch it manually, how failed jobs are handled, 00:40:45.160 --> 00:40:52.200 like do you need to check for corrupted maps from individual outputs etc so 00:40:52.920 --> 00:41:00.760 One caveat I want to mention here is, while filter-bin mapmaking is well-suited for MapReduce 00:41:00.760 --> 00:41:07.640 product, but we need to think about how you treat the data from each map. Map in the MapReduce 00:41:07.640 --> 00:41:13.800 sense, not necessarily the CMB maps. So meaning that each function, once you have the product, 00:41:13.800 --> 00:41:20.920 do you write it on disk or otherwise? So the reason is, if you write it on disk, then you need 00:41:20.920 --> 00:41:23.740 you need to deal with two extra problems-- the data explosion. 00:41:23.740 --> 00:41:26.660 You have a lot of intermediate data you need to write down. 00:41:26.660 --> 00:41:29.120 And also, you are congesting the interconnects. 00:41:29.120 --> 00:41:32.020 So basically, because once you start to write a file, 00:41:32.020 --> 00:41:33.880 you actually need to go through the interconnect 00:41:33.880 --> 00:41:35.360 and write it somewhere else. 00:41:35.360 --> 00:41:40.480 So now the network condition, the link of the network 00:41:40.480 --> 00:41:42.460 becomes more apparent. 00:41:42.460 --> 00:41:45.580 So yeah, this is the thing I want to mention. 00:41:45.580 --> 00:41:50.360 And last slide, sorry. 00:41:50.360 --> 00:41:52.400 Yeah, this is the last slide, documentation. 00:41:52.400 --> 00:41:55.120 So I already mentioned that the documentation site is 00:41:55.120 --> 00:41:55.760 available here. 00:41:55.760 --> 00:41:59.860 I will also copy and paste in the chat there very soon. 00:41:59.860 --> 00:42:02.140 And how you can use the documentation. 00:42:02.140 --> 00:42:03.860 So I documented a few ways. 00:42:03.860 --> 00:42:05.320 So of course, you can search it. 00:42:05.320 --> 00:42:07.880 So if you just click this, so basically everything 00:42:07.880 --> 00:42:12.080 is indexed and searchable over there, which is very useful. 00:42:12.080 --> 00:42:14.480 And you can also download different formats, 00:42:14.480 --> 00:42:18.040 like PDF, EPUB, man page, single page HTML. 00:42:18.040 --> 00:42:20.120 And how would those formats be useful? 00:42:20.120 --> 00:42:24.780 you can actually try to fit in that to a large language model, right? 00:42:24.780 --> 00:42:26.620 So you can now start to chat with your model. 00:42:26.620 --> 00:42:30.300 For example, if you inject your single-page HTML to ChatGPT, 00:42:30.300 --> 00:42:33.940 and you can try to summarize what is this documentation, 00:42:33.940 --> 00:42:35.900 what do we have in the documentation, 00:42:35.900 --> 00:42:38.840 and then start chatting with the document. 00:42:38.840 --> 00:42:42.640 So tell me to run an MPI application. 00:42:42.640 --> 00:42:44.720 According to the documentation, how should I do it? 00:42:44.720 --> 00:42:45.260 Something like that. 00:42:45.260 --> 00:42:48.280 And whatever questions you have, you can try asking that. 00:42:48.700 --> 00:42:52.200 But of course, sometimes it might hallucinate your answer. 00:42:52.200 --> 00:42:55.840 But anyway, so these are some of the ways you can use the documentation. 00:42:55.840 --> 00:43:00.680 And you can also start to collaborate and discuss issues here, right? 00:43:00.680 --> 00:43:03.300 So there's a GitHub repository there. 00:43:03.300 --> 00:43:07.220 You can raise issues, and we already have a lot of them here. 00:43:07.220 --> 00:43:12.180 And also discuss things over here, GitHub discussions, et cetera. 00:43:12.180 --> 00:43:14.140 Or even to contribute, right? 00:43:14.140 --> 00:43:18.540 So at the page of onboarding, I mentioned only the UK certificate. 00:43:18.540 --> 00:43:22.160 So if you have a UK certificate, you can use the URL over there. 00:43:22.160 --> 00:43:26.320 But if you know other ones, you can try to add the document, submit a pull request. 00:43:26.320 --> 00:43:32.020 Okay, so that's all for the introduction of our data centre. 00:43:32.020 --> 00:43:33.780 Thank you. 00:43:33.780 --> 00:43:36.480 Cool. That's brilliant. Thanks, Kolen. 00:43:36.480 --> 00:43:39.480 Right, we're close to five o'clock. 00:43:39.480 --> 00:43:41.660 Does anyone have any quick questions for Kolen? 00:43:41.660 --> 00:43:45.300 Okay, so we're close to five o'clock. 00:43:45.300 --> 00:44:15.280 Thank you.