Manage Access to credentialed-access projects Using Access Point Policies #2293

Chrystinne · 2024-09-17T04:00:16Z

This Pull Request implements the use of AWS S3 Access Point policies to manage access to restricted-access projects stored in S3 buckets.

The key change in this code is to ensure scalability, enabling access management as the number of projects and users grows.

bemoody · 2024-09-18T18:09:11Z

class AccessPoint(models.Model):
    aws = models.ForeignKey(AWS, related_name='access_points', on_delete=models.CASCADE)
    name = models.CharField(max_length=255)
    users = models.ManyToManyField('AccessPointUser', related_name='access_points')

    def __str__(self):
        return self.name

class AccessPointUser(models.Model):
    aws_id = models.CharField(max_length=20)  # Assuming AWS ID is a string like '053677451470'

    def __str__(self):
        return self.aws_id

First observation: the name of the class should probably be AWSAccessPoint or S3AccessPoint. Make it clear what this is for.

Second: we want to associate access points with particular Users, not particular AWS principals. Because:

If somebody removes the AWS account from their PhysioNet profile, we need to remove them from any access points that they had access to.
If they subsequently add a different AWS account to their PhysioNet profile, I think we want to add them back to the same access points they were using previously (so their existing scripts will work with their new AWS credentials.)

We could define users as a ManyToManyField pointing to user.User. But I think it'd be better to do something like:

class AWSAccessPointUser(models.Model):
    access_point = models.ForeignKey(AWSAccessPoint, related_name='users',
                                     on_delete=models.CASCADE)
    user = models.ForeignKey('user.User', related_name='aws_access_points',
                             on_delete=models.CASCADE)

    class Meta:
        unique_together = [('access_point', 'user')]

Which is essentially the same thing as ManyToManyField at the database level, but at the Python level it gives more flexibility for the future.

bemoody · 2024-09-18T18:16:46Z

we want to associate access points with particular Users, not particular AWS principals

Now, I say this, but to be fair, there is still an open question of whether we want to allow one person to use multiple AWS principals. Doing that, however, could be quite messy UX-wise (we'd either have to tell people "use access point X if you're using principal A, use access point Y if you're using principal B"... or else we'd have to tell people that every time they add a new principal they might be bumped to a different AP.)

Anyway, I think if we want to add that feature down the road, we can define new models at that point to support it.

bemoody · 2024-09-18T22:30:14Z

I don't like having "aws_id" as an argument to AWS.s3_uri(). What I think we want is to define an s3_uri method in the AccessPoint class.

To obtain the S3 URI for a particular authorized user, we might end up doing something like

    s3_uri = None
    if project.aws:
        if project.aws.is_private:
            ap = project.aws.access_points.filter(users__user=user).first()
            if ap:
                s3_uri = ap.s3_uri()
        else:
            s3_uri = project.aws.s3_uri()

If the region and account ID are required as part of the S3 URI for the access point, those things should be stored in the AccessPoint model (they shouldn't be hard-coded or inferred.)

bemoody · 2024-09-20T19:12:40Z

The basic concept here, I think, is to automatically grant access to everyone who has a linked AWS account and has permission to access the project.

Still some details need working out, but is that broadly how we want this to work? Or should we instead grant access only to people who request it?

I don't think it makes a huge difference, but doing the latter has some advantages: fewer APs to manage, and we get some feedback about how many people are using the feature.

bemoody · 2024-09-27T16:01:16Z

In upload_project_to_S3, the controlled-access bucket is created if it doesn't exist, but its bucket policy is not set. We need to set the bucket policy.
create_controlled_bucket_policy produces a policy that doesn't work; you want "s3:DataAccessPointAccount": settings.AWS_ACCOUNT_ID (not f"s3:DataAccessPointAccount": "{settings.AWS_ACCOUNT_ID}").
create_data_access_point_policy needs to restrict the s3:prefix for ListBucket actions.

…s data.

…ring projects with a 'RESTRICTED/CREDENTIALED' access policy.

…oint policy.

…s being created for the open data bucket or the controlled data bucket. This update also changes how we grant access to users for the controlled-access dataset by using Access Points (APs). It includes creating and listing APs, creating and updating AP policies, and associating AWS users with APs.

… view to use update_data_access_point_policy instead of the old bucket policy update method. Ensure the S3 credentials exist before updating the data access point policy.

…play the AWS sync command.

…PointUser. Associate access points with specific users instead of AWS principals. Modify the s3_uri() method to retrieve the AWS ID from the cloud information associated with the logged-in user. This information is used to properly display the AWS sync command for the logged-in user.

…ormation to be compatible with the changes made in the AWSAccessPoint and AWSAccessPointUser models.

…ged-in user in the AWS model.

…al creation.

Chrystinne · 2025-01-22T21:49:20Z

@bemoody This is a list of the changes addressed in this current PR:

Rename classes AWSAccessPoint or S3AccessPoint;
Associate access points with particular Users, not particular AWS principals;
Update AWSAccessPointUser model;
Fix policy created by create_controlled_bucket_policy:
Replacing f"s3:DataAccessPointAccount": "{settings.AWS_ACCOUNT_ID}" by "s3:DataAccessPointAccount": settings.AWS_ACCOUNT_ID ;
Remove the use of middleware/thread-local storage;
When adding a new user to an access point, ensure other users are not reassigned to different access points;
Use the aws_userid, not the aws_id, in access point policies;
Update only the access-point policy that is being changed, not updating all access-point policies at once;
Enforcing uniqueness of AWSAccessPointUser for (user, aws), by adding a foreign key;
Remove "aws_id" as an argument to AWS.s3_uri(). Define an s3_uri method in the AccessPoint class;
Move the method for retrieving the S3 URI for a specific authorized user to the AWSAccessPoint model.

Chrystinne · 2025-01-22T21:50:42Z

@bemoody These are the issues created to address the next features to be implemented as enhancements (a new label, "enhancement," has been created) as new PRs:

Issue #2330

Detect and remove invalid AWS IDs from the policy, or use a policy condition that does not require IDs to be valid;
Refresh policies periodically to add, remove, or update users whose authorization status or AWS ID has changed;
Remove users from access points when they remove their AWS account from their PhysioNet profile;
Reassign users to the same access points if they later add a different AWS account to their PhysioNet profile, ensuring their existing scripts continue to work with new AWS credentials;

Issue #2332

Ensure the region and account ID, required as part of the S3 URI for the access point, are stored in the AccessPoint model rather than being hard-coded or inferred.

Chrystinne · 2025-01-22T21:53:24Z

@bemoody The PR is now ready for review.

bemoody · 2025-01-23T20:15:23Z

        if (
            has_s3_credentials()
            and files_sent_to_S3(project) is not None
            and s3_bucket_has_credentialed_users(project)
        ):

This should be:

if has_s3_credentials() and files_sent_to_S3(project):

bemoody · 2025-01-23T20:17:12Z

          {% if has_s3_credentials and project.aws.sent_files and s3_uri != None %}
            {% if not project.aws.is_private or has_signed_dua %}

Just write

          {% if project.aws.sent_files and s3_uri %}

(or maybe the check for sent_files should be in the view)

bemoody · 2025-01-23T20:48:03Z

get_access_point_name_for_user_and_project: not used, remove it.
list_access_points: not used, remove it.
update_aws_access_point_policy: not used, remove it.

bemoody · 2025-01-23T20:54:29Z

    if project.access_policy == AccessPolicy.OPEN:
        update_open_bucket_policy(project, bucket_name)
    else:
        if s3_bucket_has_credentialed_users(project):
            initialize_access_points(project)

Just write:

    if project.access_policy == AccessPolicy.OPEN:
        update_open_bucket_policy(project, bucket_name)
    else:
        initialize_access_points(project)

Then s3_bucket_has_credentialed_users is not needed and can be removed.

bemoody · 2025-01-23T21:46:02Z

s3_bucket_has_access_point: not used, remove it.
get_access_point_name: not used except by s3_bucket_has_access_point, remove it.
create_first_data_access_point_policy: not used, remove it.
validade_aws_id: not used, remove it.

Chrystinne · 2025-01-28T01:36:31Z

@bemoody Your last review comments have been addressed.

bemoody · 2025-02-03T17:04:32Z

In project/cloud/s3.py

files_sent_to_S3:

(cleanup) The added "import" statement is not needed.

get_aws_accounts_for_access_point:

(user verification) Should use cloud_information__aws_verification_datetime__isnull=False, rather than cloud_information__aws_id__isnull=False.
(error handling) Remove the try ... except Exception. In general, don't ignore unexpected exceptions, and don't use print for logging.

get_aws_accounts_for_dataset:

(user verification) Should use cloud_information__aws_verification_datetime__isnull=False, rather than cloud_information__aws_id__isnull=False.

get_aws_account_by_id:

(cleanup) This function shouldn't be needed. aws_id is not unique and shouldn't be used to identify users.

set_data_access_point_policy:

(cleanup) Remove the if s3_control is None: return. (create_s3_control_client never returns None.)
(error handling) Remove the try ... except Exception.

add_user_to_access_point_policy:

(style) Should not take max_users as an argument; should refer to MAX_PRINCIPALS_PER_AP_POLICY instead (which should be a global variable.)
(user verification) Should take a User object as an argument, rather than aws_id. (Should not need to call get_aws_account_by_id.)
(error handling) Remove the try ... except Exception.

get_access_point_with_capacity:

(style) Should not take max_users as an argument; should refer to MAX_PRINCIPALS_PER_AP_POLICY instead (which should be a global variable.)

initialize_access_points:

(style) MAX_PRINCIPALS_PER_AP_POLICY should be a global variable.
(error handling) Remove the try ... except Exception.
(error handling) Remove the if not access_point or aws_ids is None.

associate_aws_users_with_data_access_point:

(user verification) Q(aws_id=aws_id) | Q(aws_userid=aws_userid) should be just aws_userid=aws_userid. aws_userid is unique and all verified users should have a userid. aws_id is not unique and is not needed.
(error handling) Remove the try ... except Exception.

create_s3_access_point:

(error handling) Remove the try ... except Exception.

In project/views.py

sign_dua:

(user verification) Should check user.cloud_information.aws_verification_datetime.
(user verification) Should pass user to add_user_to_access_point_policy, not user.cloud_information.aws_id.

Chrystinne · 2025-02-10T23:07:56Z

@bemoody Your last review comments have been addressed.

bemoody · 2025-02-12T17:07:57Z

In the AWSAccessPoint class:

    name = models.CharField(max_length=255)

Should be:

    # See https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-restrictions-limitations-naming-rules.html
    name = models.CharField(max_length=50, unique=True)

Trying to create two objects with the same name should be an error.

Also, it's important to be aware of the maximum length, and it's useful to add a comment to explain this for future maintainers.

bemoody · 2025-02-12T17:51:53Z

In create_data_access_point_policy, principal_value should be:

    principal_value = {
        "AWS": [
            account['aws_userid']
            for account in aws_accounts
        ]
    }

userid is not the same as the AWS username; we literally just want to use the userid here. (The userid has the person's account number encoded in it, which is why aws_id is not needed.)

bemoody · 2025-02-12T17:52:58Z

As I've said before, there needs to be a restriction on ListBucket operations. This code allows anyone who is authorized to access any one published project to view the metadata of every published file in every published project.

It would probably be sufficient to do this

{
    "Effect": "Deny",
    "Principal": "*",
    "Action": ["s3:ListBucket"],
    "Resource": "arn:aws:s3:<REGION>:<ACCOUNT>:accesspoint/<AP>",
    "Condition": {"StringNotLike": {"s3:prefix": "<PROJECT>/<VERSION>/*"}}
}

but that should still be tested.

bemoody · 2025-02-14T18:27:31Z

Instead of writing this:

    access_point = AWSAccessPoint.objects.filter(name=access_point_name).first()

    if not access_point:
        return aws_accounts

Write this:

    try:
        access_point = AWSAccessPoint.objects.get(name=access_point_name)
    except AWSAccessPoint.DoesNotExist:
        return aws_accounts

If you expect to retrieve exactly one object (or zero), don't use first(). If it turns out that there are multiple objects satisfying your query, get() will raise an exception.

bemoody · 2025-02-14T18:44:17Z

Here's an example of where first() is wrong and dangerous:

    cloud_info = CloudInformation.objects.filter(
        aws_id=user.cloud_information.aws_id
    ).first()

    aws_account = {
        'aws_id': cloud_info.aws_id,
        'aws_userid': cloud_info.aws_userid
    }

Here, you're trying to retrieve the CloudInformation associated with a particular person (the person who signed the DUA.) You think you know which CloudInformation you're looking for - the one that has a particular aws_id - so you filter the CloudInformation objects and pick the first one.

But your assumption is incorrect. In fact, aws_id is not enough to uniquely identify a CloudInformation. There may be more than one person with the same aws_id and therefore there may be more than one CloudInformation. And because you used first() here rather than get(), Django won't complain about your mistake. Instead, it'll pick one object "at random".

(Now, in this case, the query is completely unnecessary. You have the CloudInformation object already - it's user.cloud_information. But I think this is a useful illustration.)

… in AP policies.

Chrystinne · 2025-02-24T18:19:18Z

@bemoody I have made the changes to generate the access point policy with AWS user IDs (aws_userid) in the "Principal" field. I have tested their insertion, and they are working properly.

Chrystinne · 2025-02-24T18:28:44Z

@bemoody I've also finished addressing the other comments. I'd appreciate it if you could take a look at it again.

Chrystinne assigned Chrystinne and bemoody and unassigned Chrystinne and bemoody Sep 17, 2024

Chrystinne changed the title ~~Manage Access to Restricted S3 Buckets Using Access Point Policies~~ Manage Access to credentialed-access projects Using Access Point Policies Sep 17, 2024

Chrystinne added 16 commits November 18, 2024 16:46

Set the bucket name for the S3 bucket containing the controlled acces…

102ef42

…s data.

Add S3_CONTROLLED_ACCESS_BUCKET to specify the default bucket for sto…

0304cc2

…ring projects with a 'RESTRICTED/CREDENTIALED' access policy.

Add AccessPoint and AccessPointUser models.

05e4ada

Change references from AWS S3 bucket access policy to AWS S3 access p…

ed36e43

…oint policy.

Check if the AWS instance exists for the project. Modify the sign_dua…

4790d95

… view to use update_data_access_point_policy instead of the old bucket policy update method. Ensure the S3 credentials exist before updating the data access point policy.

Fix s3 sync command for credentialed-access projects.

da7dff7

Add migration file for AccessPointUser and AccessPoint models.

2f02a39

Updating the s3_uri parameter based on the project's access type.

bdf25c2

Checking whether the user has signed the DUA before attempting to dis…

730fd2c

…play the AWS sync command.

Avoid including aws_id as an argument in AWS.s3_uri().

df79aa8

Update methods that retrieve access point and access point policy inf…

07ee0ef

…ormation to be compatible with the changes made in the AWSAccessPoint and AWSAccessPointUser models.

Adding a function to access the request object for retrieving the log…

a4b287d

…ged-in user in the AWS model.

Remove the previous migration file: 0078_accesspointuser_accesspoint.py.

92447ef

Add new migration file.

e7a527e

Chrystinne force-pushed the restricted-access-s3-bucket branch from 729aa5d to e7a527e Compare November 21, 2024 20:38

Chrystinne added 2 commits November 22, 2024 13:37

Fix the bucket policy for the controlled-data-dev-server bucket.

7afbaaf

Set the bucket policy for the controlled-access bucket upon its initi…

1dc66d5

…al creation.

Chrystinne force-pushed the restricted-access-s3-bucket branch from 93b7cc5 to ed5c551 Compare January 22, 2025 21:21

Chrystinne added 2 commits January 27, 2025 17:53

Remove unused methods.

a180adf

Refactor: remove redundant validation checks in template and view.

6f23a41

tompollard mentioned this pull request Jan 30, 2025

Slow Download Speeds for VinDr-CXR MIT-LCP/mimic-code#1852

Closed

Chrystinne force-pushed the restricted-access-s3-bucket branch from 703eaa5 to 1d25b61 Compare February 10, 2025 22:42

Chrystinne added 2 commits February 10, 2025 17:57

Refactor: improve error handling and code style.

7eea3b7

Include all non-Open access types.

08fbcd6

Chrystinne force-pushed the restricted-access-s3-bucket branch from 1d25b61 to 08fbcd6 Compare February 10, 2025 22:59

Chrystinne added 2 commits February 24, 2025 12:25

Enforce unique AWS access point names and replace ARN with aws_userid…

bc5c9fa

… in AP policies.

Apply migration for AWSAccessPoint name constraints.

15b9363

Chrystinne force-pushed the restricted-access-s3-bucket branch from 8bdb305 to 15b9363 Compare February 24, 2025 17:29

Replace .first() with .get() and remove redundant query.

3cae7ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage Access to credentialed-access projects Using Access Point Policies #2293

Manage Access to credentialed-access projects Using Access Point Policies #2293

Chrystinne commented Sep 17, 2024

bemoody commented Sep 18, 2024 •

edited

Loading

bemoody commented Sep 18, 2024

bemoody commented Sep 18, 2024

bemoody commented Sep 20, 2024

bemoody commented Sep 27, 2024

Chrystinne commented Jan 22, 2025

Chrystinne commented Jan 22, 2025

Chrystinne commented Jan 22, 2025

bemoody commented Jan 23, 2025

bemoody commented Jan 23, 2025 •

edited

Loading

bemoody commented Jan 23, 2025

bemoody commented Jan 23, 2025

bemoody commented Jan 23, 2025

Chrystinne commented Jan 28, 2025

bemoody commented Feb 3, 2025

Chrystinne commented Feb 10, 2025

bemoody commented Feb 12, 2025

bemoody commented Feb 12, 2025

bemoody commented Feb 12, 2025

bemoody commented Feb 14, 2025

bemoody commented Feb 14, 2025

Chrystinne commented Feb 24, 2025

Chrystinne commented Feb 24, 2025

Manage Access to credentialed-access projects Using Access Point Policies #2293

Are you sure you want to change the base?

Manage Access to credentialed-access projects Using Access Point Policies #2293

Conversation

Chrystinne commented Sep 17, 2024

bemoody commented Sep 18, 2024 • edited Loading

bemoody commented Sep 18, 2024

bemoody commented Sep 18, 2024

bemoody commented Sep 20, 2024

bemoody commented Sep 27, 2024

Chrystinne commented Jan 22, 2025

Chrystinne commented Jan 22, 2025

Chrystinne commented Jan 22, 2025

bemoody commented Jan 23, 2025

bemoody commented Jan 23, 2025 • edited Loading

bemoody commented Jan 23, 2025

bemoody commented Jan 23, 2025

bemoody commented Jan 23, 2025

Chrystinne commented Jan 28, 2025

bemoody commented Feb 3, 2025

In project/cloud/s3.py

In project/views.py

Chrystinne commented Feb 10, 2025

bemoody commented Feb 12, 2025

bemoody commented Feb 12, 2025

bemoody commented Feb 12, 2025

bemoody commented Feb 14, 2025

bemoody commented Feb 14, 2025

Chrystinne commented Feb 24, 2025

Chrystinne commented Feb 24, 2025

bemoody commented Sep 18, 2024 •

edited

Loading

bemoody commented Jan 23, 2025 •

edited

Loading