Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage Access to credentialed-access projects Using Access Point Policies #2293

Open
wants to merge 31 commits into
base: dev
Choose a base branch
from

Conversation

Chrystinne
Copy link
Contributor

This Pull Request implements the use of AWS S3 Access Point policies to manage access to restricted-access projects stored in S3 buckets.

The key change in this code is to ensure scalability, enabling access management as the number of projects and users grows.

@Chrystinne Chrystinne assigned Chrystinne and bemoody and unassigned Chrystinne and bemoody Sep 17, 2024
@Chrystinne Chrystinne changed the title Manage Access to Restricted S3 Buckets Using Access Point Policies Manage Access to credentialed-access projects Using Access Point Policies Sep 17, 2024
@bemoody
Copy link
Collaborator

bemoody commented Sep 18, 2024

class AccessPoint(models.Model):
    aws = models.ForeignKey(AWS, related_name='access_points', on_delete=models.CASCADE)
    name = models.CharField(max_length=255)
    users = models.ManyToManyField('AccessPointUser', related_name='access_points')

    def __str__(self):
        return self.name

class AccessPointUser(models.Model):
    aws_id = models.CharField(max_length=20)  # Assuming AWS ID is a string like '053677451470'

    def __str__(self):
        return self.aws_id

First observation: the name of the class should probably be AWSAccessPoint or S3AccessPoint. Make it clear what this is for.

Second: we want to associate access points with particular Users, not particular AWS principals. Because:

  • If somebody removes the AWS account from their PhysioNet profile, we need to remove them from any access points that they had access to.

  • If they subsequently add a different AWS account to their PhysioNet profile, I think we want to add them back to the same access points they were using previously (so their existing scripts will work with their new AWS credentials.)

We could define users as a ManyToManyField pointing to user.User. But I think it'd be better to do something like:

class AWSAccessPointUser(models.Model):
    access_point = models.ForeignKey(AWSAccessPoint, related_name='users',
                                     on_delete=models.CASCADE)
    user = models.ForeignKey('user.User', related_name='aws_access_points',
                             on_delete=models.CASCADE)

    class Meta:
        unique_together = [('access_point', 'user')]

Which is essentially the same thing as ManyToManyField at the database level, but at the Python level it gives more flexibility for the future.

@bemoody
Copy link
Collaborator

bemoody commented Sep 18, 2024

we want to associate access points with particular Users, not particular AWS principals

Now, I say this, but to be fair, there is still an open question of whether we want to allow one person to use multiple AWS principals. Doing that, however, could be quite messy UX-wise (we'd either have to tell people "use access point X if you're using principal A, use access point Y if you're using principal B"... or else we'd have to tell people that every time they add a new principal they might be bumped to a different AP.)

Anyway, I think if we want to add that feature down the road, we can define new models at that point to support it.

@bemoody
Copy link
Collaborator

bemoody commented Sep 18, 2024

I don't like having "aws_id" as an argument to AWS.s3_uri(). What I think we want is to define an s3_uri method in the AccessPoint class.

To obtain the S3 URI for a particular authorized user, we might end up doing something like

    s3_uri = None
    if project.aws:
        if project.aws.is_private:
            ap = project.aws.access_points.filter(users__user=user).first()
            if ap:
                s3_uri = ap.s3_uri()
        else:
            s3_uri = project.aws.s3_uri()

If the region and account ID are required as part of the S3 URI for the access point, those things should be stored in the AccessPoint model (they shouldn't be hard-coded or inferred.)

@bemoody
Copy link
Collaborator

bemoody commented Sep 20, 2024

The basic concept here, I think, is to automatically grant access to everyone who has a linked AWS account and has permission to access the project.

Still some details need working out, but is that broadly how we want this to work? Or should we instead grant access only to people who request it?

I don't think it makes a huge difference, but doing the latter has some advantages: fewer APs to manage, and we get some feedback about how many people are using the feature.

@bemoody
Copy link
Collaborator

bemoody commented Sep 27, 2024

  • In upload_project_to_S3, the controlled-access bucket is created if it doesn't exist, but its bucket policy is not set. We need to set the bucket policy.

  • create_controlled_bucket_policy produces a policy that doesn't work; you want "s3:DataAccessPointAccount": settings.AWS_ACCOUNT_ID (not f"s3:DataAccessPointAccount": "{settings.AWS_ACCOUNT_ID}").

  • create_data_access_point_policy needs to restrict the s3:prefix for ListBucket actions.

…ring projects with a 'RESTRICTED/CREDENTIALED' access policy.
…s being created for the open data bucket or the controlled data bucket. This update also changes how we grant access to users for the controlled-access dataset by using Access Points (APs). It includes creating and listing APs, creating and updating AP policies, and associating AWS users with APs.
… view to use update_data_access_point_policy instead of the old bucket policy update method. Ensure the S3 credentials exist before updating the data access point policy.
…PointUser. Associate access points with specific users instead of AWS principals. Modify the s3_uri() method to retrieve the AWS ID from the cloud information associated with the logged-in user. This information is used to properly display the AWS sync command for the logged-in user.
…ormation to be compatible with the changes made in the AWSAccessPoint and AWSAccessPointUser models.
@Chrystinne Chrystinne force-pushed the restricted-access-s3-bucket branch from 729aa5d to e7a527e Compare November 21, 2024 20:38
@Chrystinne Chrystinne force-pushed the restricted-access-s3-bucket branch from 93b7cc5 to ed5c551 Compare January 22, 2025 21:21
@Chrystinne
Copy link
Contributor Author

@bemoody This is a list of the changes addressed in this current PR:

  • Rename classes AWSAccessPoint or S3AccessPoint;

  • Associate access points with particular Users, not particular AWS principals;

  • Update AWSAccessPointUser model;

  • Fix policy created by create_controlled_bucket_policy:
    Replacing f"s3:DataAccessPointAccount": "{settings.AWS_ACCOUNT_ID}" by "s3:DataAccessPointAccount": settings.AWS_ACCOUNT_ID ;

  • Remove the use of middleware/thread-local storage;

  • When adding a new user to an access point, ensure other users are not reassigned to different access points;

  • Use the aws_userid, not the aws_id, in access point policies;

  • Update only the access-point policy that is being changed, not updating all access-point policies at once;

  • Enforcing uniqueness of AWSAccessPointUser for (user, aws), by adding a foreign key;

  • Remove "aws_id" as an argument to AWS.s3_uri(). Define an s3_uri method in the AccessPoint class;

  • Move the method for retrieving the S3 URI for a specific authorized user to the AWSAccessPoint model.

@Chrystinne
Copy link
Contributor Author

@bemoody These are the issues created to address the next features to be implemented as enhancements (a new label, "enhancement," has been created) as new PRs:

Issue #2330

  • Detect and remove invalid AWS IDs from the policy, or use a policy condition that does not require IDs to be valid;

  • Refresh policies periodically to add, remove, or update users whose authorization status or AWS ID has changed;

  • Remove users from access points when they remove their AWS account from their PhysioNet profile;

  • Reassign users to the same access points if they later add a different AWS account to their PhysioNet profile, ensuring their existing scripts continue to work with new AWS credentials;

Issue #2332

  • Ensure the region and account ID, required as part of the S3 URI for the access point, are stored in the AccessPoint model rather than being hard-coded or inferred.

@Chrystinne
Copy link
Contributor Author

@bemoody The PR is now ready for review.

@bemoody
Copy link
Collaborator

bemoody commented Jan 23, 2025

        if (
            has_s3_credentials()
            and files_sent_to_S3(project) is not None
            and s3_bucket_has_credentialed_users(project)
        ):

This should be:

if has_s3_credentials() and files_sent_to_S3(project):

@bemoody
Copy link
Collaborator

bemoody commented Jan 23, 2025

          {% if has_s3_credentials and project.aws.sent_files and s3_uri != None %}
            {% if not project.aws.is_private or has_signed_dua %}

Just write

          {% if project.aws.sent_files and s3_uri %}

(or maybe the check for sent_files should be in the view)

@bemoody
Copy link
Collaborator

bemoody commented Jan 23, 2025

get_access_point_name_for_user_and_project: not used, remove it.
list_access_points: not used, remove it.
update_aws_access_point_policy: not used, remove it.

@bemoody
Copy link
Collaborator

bemoody commented Jan 23, 2025

    if project.access_policy == AccessPolicy.OPEN:
        update_open_bucket_policy(project, bucket_name)
    else:
        if s3_bucket_has_credentialed_users(project):
            initialize_access_points(project)

Just write:

    if project.access_policy == AccessPolicy.OPEN:
        update_open_bucket_policy(project, bucket_name)
    else:
        initialize_access_points(project)

Then s3_bucket_has_credentialed_users is not needed and can be removed.

@bemoody
Copy link
Collaborator

bemoody commented Jan 23, 2025

s3_bucket_has_access_point: not used, remove it.
get_access_point_name: not used except by s3_bucket_has_access_point, remove it.
create_first_data_access_point_policy: not used, remove it.
validade_aws_id: not used, remove it.

@Chrystinne
Copy link
Contributor Author

@bemoody Your last review comments have been addressed.

@bemoody
Copy link
Collaborator

bemoody commented Feb 3, 2025

In project/cloud/s3.py

files_sent_to_S3:

  • (cleanup) The added "import" statement is not needed.

get_aws_accounts_for_access_point:

  • (user verification) Should use cloud_information__aws_verification_datetime__isnull=False, rather than cloud_information__aws_id__isnull=False.
  • (error handling) Remove the try ... except Exception. In general, don't ignore unexpected exceptions, and don't use print for logging.

get_aws_accounts_for_dataset:

  • (user verification) Should use cloud_information__aws_verification_datetime__isnull=False, rather than cloud_information__aws_id__isnull=False.

get_aws_account_by_id:

  • (cleanup) This function shouldn't be needed. aws_id is not unique and shouldn't be used to identify users.

set_data_access_point_policy:

  • (cleanup) Remove the if s3_control is None: return. (create_s3_control_client never returns None.)
  • (error handling) Remove the try ... except Exception.

add_user_to_access_point_policy:

  • (style) Should not take max_users as an argument; should refer to MAX_PRINCIPALS_PER_AP_POLICY instead (which should be a global variable.)
  • (user verification) Should take a User object as an argument, rather than aws_id. (Should not need to call get_aws_account_by_id.)
  • (error handling) Remove the try ... except Exception.

get_access_point_with_capacity:

  • (style) Should not take max_users as an argument; should refer to MAX_PRINCIPALS_PER_AP_POLICY instead (which should be a global variable.)

initialize_access_points:

  • (style) MAX_PRINCIPALS_PER_AP_POLICY should be a global variable.
  • (error handling) Remove the try ... except Exception.
  • (error handling) Remove the if not access_point or aws_ids is None.

associate_aws_users_with_data_access_point:

  • (user verification) Q(aws_id=aws_id) | Q(aws_userid=aws_userid) should be just aws_userid=aws_userid. aws_userid is unique and all verified users should have a userid. aws_id is not unique and is not needed.
  • (error handling) Remove the try ... except Exception.

create_s3_access_point:

  • (error handling) Remove the try ... except Exception.

In project/views.py

sign_dua:

  • (user verification) Should check user.cloud_information.aws_verification_datetime.
  • (user verification) Should pass user to add_user_to_access_point_policy, not user.cloud_information.aws_id.

@Chrystinne Chrystinne force-pushed the restricted-access-s3-bucket branch from 703eaa5 to 1d25b61 Compare February 10, 2025 22:42
@Chrystinne Chrystinne force-pushed the restricted-access-s3-bucket branch from 1d25b61 to 08fbcd6 Compare February 10, 2025 22:59
@Chrystinne
Copy link
Contributor Author

@bemoody Your last review comments have been addressed.

@bemoody
Copy link
Collaborator

bemoody commented Feb 12, 2025

In the AWSAccessPoint class:

    name = models.CharField(max_length=255)

Should be:

    # See https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-restrictions-limitations-naming-rules.html
    name = models.CharField(max_length=50, unique=True)

Trying to create two objects with the same name should be an error.

Also, it's important to be aware of the maximum length, and it's useful to add a comment to explain this for future maintainers.

@bemoody
Copy link
Collaborator

bemoody commented Feb 12, 2025

In create_data_access_point_policy, principal_value should be:

    principal_value = {
        "AWS": [
            account['aws_userid']
            for account in aws_accounts
        ]
    }

userid is not the same as the AWS username; we literally just want to use the userid here. (The userid has the person's account number encoded in it, which is why aws_id is not needed.)

@bemoody
Copy link
Collaborator

bemoody commented Feb 12, 2025

As I've said before, there needs to be a restriction on ListBucket operations. This code allows anyone who is authorized to access any one published project to view the metadata of every published file in every published project.

It would probably be sufficient to do this

{
    "Effect": "Deny",
    "Principal": "*",
    "Action": ["s3:ListBucket"],
    "Resource": "arn:aws:s3:<REGION>:<ACCOUNT>:accesspoint/<AP>",
    "Condition": {"StringNotLike": {"s3:prefix": "<PROJECT>/<VERSION>/*"}}
}

but that should still be tested.

@bemoody
Copy link
Collaborator

bemoody commented Feb 14, 2025

Instead of writing this:

    access_point = AWSAccessPoint.objects.filter(name=access_point_name).first()

    if not access_point:
        return aws_accounts

Write this:

    try:
        access_point = AWSAccessPoint.objects.get(name=access_point_name)
    except AWSAccessPoint.DoesNotExist:
        return aws_accounts

If you expect to retrieve exactly one object (or zero), don't use first(). If it turns out that there are multiple objects satisfying your query, get() will raise an exception.

@bemoody
Copy link
Collaborator

bemoody commented Feb 14, 2025

Here's an example of where first() is wrong and dangerous:

    cloud_info = CloudInformation.objects.filter(
        aws_id=user.cloud_information.aws_id
    ).first()

    aws_account = {
        'aws_id': cloud_info.aws_id,
        'aws_userid': cloud_info.aws_userid
    }

Here, you're trying to retrieve the CloudInformation associated with a particular person (the person who signed the DUA.) You think you know which CloudInformation you're looking for - the one that has a particular aws_id - so you filter the CloudInformation objects and pick the first one.

But your assumption is incorrect. In fact, aws_id is not enough to uniquely identify a CloudInformation. There may be more than one person with the same aws_id and therefore there may be more than one CloudInformation. And because you used first() here rather than get(), Django won't complain about your mistake. Instead, it'll pick one object "at random".

(Now, in this case, the query is completely unnecessary. You have the CloudInformation object already - it's user.cloud_information. But I think this is a useful illustration.)

@Chrystinne Chrystinne force-pushed the restricted-access-s3-bucket branch from 8bdb305 to 15b9363 Compare February 24, 2025 17:29
@Chrystinne
Copy link
Contributor Author

@bemoody I have made the changes to generate the access point policy with AWS user IDs (aws_userid) in the "Principal" field. I have tested their insertion, and they are working properly.

@Chrystinne
Copy link
Contributor Author

@bemoody I've also finished addressing the other comments. I'd appreciate it if you could take a look at it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants