Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on how to make custom changes to slurm config file #6648

Open
adebayoj opened this issue Jan 28, 2025 · 7 comments
Open

Feedback on how to make custom changes to slurm config file #6648

adebayoj opened this issue Jan 28, 2025 · 7 comments

Comments

@adebayoj
Copy link

Hi, we are currently using parallel cluster as a SLURM cluster with a capacity reservation of 3 p4de.24xlarge instances. We've been running into certain issues, but we couldn't find clear feedback on how to address them, so we wanted to check here. I have included our cluster-config-yaml file, and would appreciate feedback.

Problem 1: Custom Changes to config files

We would like to make changes to the slurm config files to enable certain behavior. Specificially, we would like to enable the following:

# temp environment changes
PrologFlags             = Alloc,Contain,X11
JobContainerType        = job_container/tmpfs

# this might help make it so that nvidia-smi is isolated
ConstrainDevices        = yes
ConstrainRAMSpace       = yes

# For OOM containment
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = NoOverMemoryKill

# make salloc call srun for interactive jobs
LaunchParameters        = use_interactive_step [or] use_interactive_step,enable_nss_slurm

However, we've found that we can't set these parameters through the CustomSlurmSettings option. For temp environments, it seems like we might need to create a custom job_container.conf file. However, I currently see no way to do this via the config file.

Question: can we manually enable all of these options ourselves without repercussions? What would you suggest?

Problem 2: Separate partition for root (/) and how to enable usrquota.

We would like to mount root (/) to a separate file system through Lustre or something else. However, it currently says that only a single lustre file system can be used as a part of a given installation. Secondly, we would like to constrain the size of each user's home directory to be a particular size. Can you share how we can enable this programmatically? We could do this manually as described here: http://www.yolinux.com/TUTORIALS/LinuxTutorialQuotas.html, but we are wondering if there are any other alternatives?

Thanks for the help.

@adebayoj
Copy link
Author

Adding our cluster configuration file here.

Region: us-west-2
Imds:
  ImdsSupport: v2.0
Image:
  Os: ubuntu2004
  CustomAmi: ami-08f434825b73b442e
HeadNode:
  InstanceType: m5.8xlarge
  Networking:
    SubnetId: xxxx
    ElasticIp: true
  Ssh:
    KeyName: julius-ed
  LocalStorage:
    RootVolume:
      Size: 500
      DeleteOnTermination: true
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
      - Policy: arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
    S3Access:
      - BucketName: name
        EnableWriteAccess: true
      - BucketName: name
        EnableWriteAccess: true
      - BucketName: name
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 60
    CustomSlurmSettings:
    # Slurm accounting settings
    # disable this once accounting is enabled to a database.
      - JobCompType: jobcomp/filetxt
      - JobCompLoc: /home/slurm/slurm-jobcompletions.txt
      - JobAcctGatherType: jobacct_gather/linux
  SlurmQueues:
    - Name: cpu-queue
      ComputeResources:
        - Name: c59xlarge
          Instances:
            - InstanceType: c5.9xlarge
          MinCount: 1 # if min = max then capacity is fixed. If min < max then capacity is dynamic
          MaxCount: 2
      Networking:
        SubnetIds: 
        - subnet-id
      Iam:
        S3Access:
          - BucketName: name
            EnableWriteAccess: true
          - BucketName: guidelabs-users
            EnableWriteAccess: true
          - BucketName: guidelabs-scripts
    - Name: gpu-queue
      Networking:
        SubnetIds: 
          - subnet-id
        PlacementGroup:
          Enabled: false
      ComputeSettings:
        LocalStorage:
          EphemeralVolume:
            MountDir: /scratch
          RootVolume:
            Size: 200
      CapacityReservationTarget:
        CapacityReservationId: capacityreservation-id
      ComputeResources:
        - Name: a100
          InstanceType: p4de.24xlarge
          MinCount: 3 # if min = max then capacity is fixed. If min < max then capacity is dynamic
          MaxCount: 3
          Efa:
            Enabled: true
      Iam:
        S3Access:
          - BucketName: name
            EnableWriteAccess: true
          - BucketName: name
            EnableWriteAccess: true
          - BucketName: name
SharedStorage:
  - MountDir: /fsx
    Name: fsx
    StorageType: FsxLustre
    FsxLustreSettings:
      DeploymentType: PERSISTENT_2
      StorageCapacity: 12000
      DeletionPolicy: Retain
      PerUnitStorageThroughput: 250
Monitoring:
  DetailedMonitoring: true
  Logs:
    CloudWatch:
      Enabled: true
  Dashboards:
    CloudWatch:
      Enabled: true
DirectoryService:
  DomainName: x
  DomainAddr: x
  PasswordSecretArn: x
  DomainReadOnlyUser: x
  GenerateSshKeysForUsers: True
  AdditionalSssdConfigs:
    ldap_auth_disable_tls_never_use_in_production: True

@hanwen-cluster
Copy link
Contributor

Hi Julius,

Answer to Problem 1

The custom slurm settings look good to me. The only parameters prevented by our validator are JobAcctGatherType and LaunchParameters, because ParallelCluster sets values to them. You can use --suppress-validators to suppress the error (see doc), and your values in custom slurm settings will overwrite ParallelCluster values. Note that cluster configurations requiring --suppress-validators are not fully tested by our team, so please use it carefully.

Answer to Problem 2

You can mount /home using Lustre. See https://www.youtube.com/watch?v=KDTMczTtPdM. Unfortunately, ParallelCluster doesn't support mount /root differently or set user quotas. To do them, consider including the setups in your custom AMI.

Moreover, please let us know if you would make this a feature request. If yes, could you explain your use case more?

Thank you,
Hanwen

@adebayoj
Copy link
Author

Hi @hanwen-cluster Thanks for the feedback. I have a few more clarification questions.

For Problem 1, how should we handle options that require editing additional files. My questions, in detail, below:

  1. For example, for JobContainerType = job_container/tmpfs, we need to edit(or provide) a job_container.conf text file to enable that option. Will this file be automatically created? Or do we need to manually create it in /opt/slurm/etc/?
  2. For the options, ConstrainDevices = yes and ConstrainRAMSpace = yes, these actually need to go in /etc/slurm/cgroup.conf, I believe (see: https://slurm.schedmd.com/cgroup.conf.html). Does specifying them under custom slurm settings, and using the --suppress-validators flag automatically write to the /etc/slurm/cgroup.conf file? What would you suggest as the best option to enable this?
  3. Lastly, can you give us guidance on whether we should be manually editing the slurm.conf files? If we wanted to test a few ideas in a small cluster to validate some of these things. When we edit the slurm.conf, should we restart slurm and update the cluster via cluster update? If you would caution against doing this, it would be helpful to know as well.

For Problem 2,

  1. We currently have a shared storage at fsx, which is a Lustre file system. However, when we tried to add another Lustre file system for /home, we got an error that a given cluster configuration only allows for a single Lustre file system.
  2. Can you point us to a tutorial on how to customize user quotas while building an AMI. We are familiar with this tutorial: https://www.youtube.com/watch?v=3ysMkZrDlGI. Right now, we simply go with one of the deep learning AMI.

Thanks for the help!

@gmarciani
Copy link
Contributor

Hi @adebayoj ,

Problem 1

  1. You need to manually create it. Creation can be done by an OnNodeStart custom action script
  2. No. CustomSlurmSettings are injected into slurm.conf only. We have a feature request to support injection of custom config to cgroup, but not even planned yet. Currently, you can customize /etc/slurm/cgroup.conf using a OnNodeConfigured custom action.
  3. Slurm must be restarted every time there is a change in slurm.conf. We suggest to customize the file injecting properties via CustomSlurmSettings and submit a pcluster update-cluster.

Problem 2

  1. You can mount only 1 FSx managed by pcluster (see [quotas](See https://docs.aws.amazon.com/parallelcluster/latest/ug/shared-storage-quotas-v3.html)). If you want to add more, you need to create the file system outside of pcluster and add it to your SharedStorage section, specifying its FileSystemId.
  2. What quota would you like to modify? Also, please notice thta the video you're referring to is not related to ParallelCluster but to AWS Parallel Computing Service.

@adebayoj
Copy link
Author

adebayoj commented Feb 3, 2025

Hi @gmarciani,

Thanks for the updates. Your suggestions were helpful, and we are now able to address all of issues in problem 1 easily. For problem 2-1, we also took your advice.

Problem 2-2 Clarification
We would like to enable size quotas, per-user, for their home directories. For example, we would like a user's home directory to be ~ 200GB each. For a standard linux file system, we can follow this tutorials: https://askubuntu.com/questions/723849/user-home-folder-size-limit, and http://www.yolinux.com/TUTORIALS/LinuxTutorialQuotas.html. However, we are wondering whether this translates directly for us? We can just make edits to the files.

Microsoft AD
We do multi-user authentication with microsoft AD service right now. Following the tutorial here: and part of this video https://www.youtube.com/watch?v=wvd6bFieht0. However, we have been unable to get it to work reliably. Here are the issues:

  1. When we change a user's password via aws ds reset-user-password --directory-id $DIRECTORY_ID --user-name "ReadOnlyUser" --new-password "xxxx" --region "region-id" from (https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_05_multi-user-ad-step1.html). It doesn't actually propagate to the cluster.
  2. Can we use a single MS AD service with multiple parallel clusters installations? When we do so, we find that users are not able to login.
  3. When we make password changes, for multiple users, do we need to update the PasswordSecretArn in the config file for the cluster?
  4. To update the user password, is it enough to do aws ds reset-user-password --directory-id $DIRECTORY_ID --user-name "ReadOnlyUser" --new-password "ro-p@ssw0rd" --region "region-id" locally via the aws cli or do we need to be on the AWS EC2 instance that is connected to the MS AD directory?
  5. Is there anything special about corp.example.com? When we use our own option, i.e., corp.guidelabs.ai, we are not able to authenticate to the cluster. This is despite using the same procedure as for the corp.example.com option.

Thank you so much for all the feedback so far. As you can see, we are struggling with MS AD, so any alternatives that you can provide would be helpful. We would also be happy to go over additional tutorials that can help us.

Thanks!

@gmarciani
Copy link
Contributor

Hi @adebayoj ,

glad to know it was helpful.

Problem 2-2 Clarification
You can make changes to /etc/fstab as part of a custom action script. Please notice the the ParallelCluster cookbook as well makes use of fstab to mount the shared storage, so you need to make sure that whatever logic you will implement to change it, does not corrupt the entries made by ParallelCluster.

Microsoft AD
It is expected that aws ds reset-user-password does not propagate the password change to the cluster. That cmmand is meant to change the password only on the AD. To reset the password on the cluster you need to run a dedicated script. See https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3-multi-user.html#troubleshooting-v3-multi-user-reset-passwd

@adebayoj
Copy link
Author

Hi @gmarciani I wanted to follow up to say that you feedback here was very helpful. Do you want me to close this issue or leave it open. I used it more as a way to get help for certain key issues. I have more questions around best practices, but unsure whether I should use this issue or open another one? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants