boto3에서 S3 버킷의 하위 폴더 이름 검색

Question 1

boto3를 사용하여 AWS S3 버킷에 액세스 할 수 있습니다.

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')

이제 버킷에는 폴더 first-level가 포함되어 있으며 그 자체에는 타임 스탬프가있는 여러 하위 폴더가 포함됩니다 (예 :) 1456753904534. 내가하고있는 다른 작업을 위해이 하위 폴더의 이름을 알아야하고 boto3가 나를 대신하여 검색하도록 할 수 있는지 궁금합니다.

그래서 나는 시도했다.

objs = bucket.meta.client.list_objects(Bucket='my-bucket-name')

'Contents'키가 두 번째 레벨 타임 스탬프 디렉토리 대신 세 번째 레벨 파일을 모두 제공하는 사전을 제공합니다. 사실 다음과 같은 항목을 포함하는 목록을 얻습니다.

{u'ETag ':' "etag" ', u'Key': 첫 번째 수준 / 1456753904534 / part-00014 ', u'LastModified': datetime.datetime (2016, 2, 29, 13, 52, 24, tzinfo = tzutc ()),
u'Owner ': {u'DisplayName': 'owner', u'ID ':'id '},
u'Size': size, u'StorageClass ':'storageclass '}

이 경우 특정 파일 part-00014이 검색되는 것을 볼 수 있지만 디렉터리 이름 만 가져오고 싶습니다. 원칙적으로 모든 경로에서 디렉토리 이름을 제거 할 수 있지만 두 번째 레벨을 얻기 위해 세 번째 레벨에서 모든 것을 검색하는 것은 추하고 비용이 많이 듭니다!

나는 또한 여기에 보고 된 것을 시도 했습니다 .

for o in bucket.objects.filter(Delimiter='/'):
    print(o.key)

그러나 원하는 수준의 폴더를 얻지 못합니다.

이 문제를 해결할 방법이 있습니까?

Question 2

S3는 객체 스토리지이며 실제 디렉터리 구조가 없습니다. "/"는 다소 외형 적입니다. 사람들이 응용 프로그램에 트리를 유지 / 정리 / 추가 할 수 있기 때문에 사람들이 디렉토리 구조를 갖고 싶어하는 한 가지 이유입니다. S3의 경우 이러한 구조를 일종의 인덱스 또는 검색 태그로 취급합니다.

S3에서 객체를 조작하려면 boto3.client 또는 boto3.resource가 필요합니다. 예를 들어 모든 객체를 나열하려면

import boto3 
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name')

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

실제로 s3 객체 이름이 '/'구분 기호를 사용하여 저장되는 경우. 최신 버전의 list_objects (list_objects_v2)를 사용하면 지정된 접두사로 시작하는 키에 대한 응답을 제한 할 수 있습니다.

항목을 특정 하위 폴더 아래의 항목으로 제한하려면 :

    import boto3 
    s3 = boto3.client("s3")
    response = s3.list_objects_v2(
            Bucket=BUCKET,
            Prefix ='DIR1/DIR2',
            MaxKeys=100 )

선적 서류 비치

또 다른 옵션은 python os.path 함수를 사용하여 폴더 접두사를 추출하는 것입니다. 문제는 원하지 않는 디렉토리에서 객체를 나열해야한다는 것입니다.

import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key) 
foldername = os.path.dirname(s3_key)

# if you are not using conventional delimiter like '#' 
s3_key = 'first-level#1456753904534#part-00014
filename = s3_key.split("#")[-1]

boto3에 대한 알림 : boto3.resource는 멋진 고급 API입니다. boto3.client와 boto3.resource를 사용하면 장단점이 있습니다. 내부 공유 라이브러리를 개발하는 경우 boto3.resource를 사용하면 사용 된 리소스에 대한 블랙 박스 레이어가 제공됩니다.

Question 3

아래 코드는 s3 버킷의 '폴더'에있는 '하위 폴더'만 반환합니다.

import boto3
bucket = 'my-bucket'
#Make sure you provide / in the end
prefix = 'prefix-name-with-slash/'  

client = boto3.client('s3')
result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/')
for o in result.get('CommonPrefixes'):
    print 'sub folder : ', o.get('Prefix')

자세한 내용은 https://github.com/boto/boto3/issues/134 를 참조하세요.

Question 4

알아내는 데 많은 시간이 걸렸지 만 마지막으로 boto3를 사용하여 S3 버킷의 하위 폴더 콘텐츠를 나열하는 간단한 방법이 있습니다. 도움이되기를 바랍니다.

prefix = "folderone/foldertwo/"
s3 = boto3.resource('s3')
bucket = s3.Bucket(name="bucket_name_here")
FilesNotFound = True
for obj in bucket.objects.filter(Prefix=prefix):
     print('{0}:{1}'.format(bucket.name, obj.key))
     FilesNotFound = False
if FilesNotFound:
     print("ALERT", "No file in {0}/{1}".format(bucket, prefix))

Question 5

짧은 대답 :

사용 Delimiter='/'. 이렇게하면 버킷을 반복적으로 나열하지 않습니다. 여기에 일부 답변은 전체 목록을 작성하고 일부 문자열 조작을 사용하여 디렉토리 이름을 검색하는 것을 잘못 제안합니다. 이것은 매우 비효율적 일 수 있습니다. S3는 버킷에 포함 할 수있는 객체 수에 사실상 제한이 없습니다. 따라서 bar/와 사이 foo/에 1 조 개의 객체가 있다고 상상해보십시오 ['bar/', 'foo/']. 를 얻기까지 매우 오랜 시간이 걸릴 것 입니다.
사용 Paginators. 같은 이유로 (S3는 엔지니어의 무한대 근사치 임) 페이지를 통해 나열하고 모든 목록을 메모리에 저장하지 않아야 합니다. 대신 "lister"를 반복자로 간주하고 생성되는 스트림을 처리하십시오.
사용 boto3.client하지 않고 boto3.resource. resource버전 잘 처리하지 않는 것 Delimiter옵션을 선택합니다. 리소스가있는 경우 (예 :) 다음을 사용 bucket = boto3.resource('s3').Bucket(name)하여 해당 클라이언트를 가져올 수 있습니다 bucket.meta.client.

긴 대답 :

다음은 간단한 버킷에 사용하는 반복기입니다 (버전 처리 없음).

import boto3
from collections import namedtuple
from operator import attrgetter


S3Obj = namedtuple('S3Obj', ['key', 'mtime', 'size', 'ETag'])


def s3list(bucket, path, start=None, end=None, recursive=True, list_dirs=True,
           list_objs=True, limit=None):
    """
    Iterator that lists a bucket's objects under path, (optionally) starting with
    start and ending before end.

    If recursive is False, then list only the "depth=0" items (dirs and objects).

    If recursive is True, then list recursively all objects (no dirs).

    Args:
        bucket:
            a boto3.resource('s3').Bucket().
        path:
            a directory in the bucket.
        start:
            optional: start key, inclusive (may be a relative path under path, or
            absolute in the bucket)
        end:
            optional: stop key, exclusive (may be a relative path under path, or
            absolute in the bucket)
        recursive:
            optional, default True. If True, lists only objects. If False, lists
            only depth 0 "directories" and objects.
        list_dirs:
            optional, default True. Has no effect in recursive listing. On
            non-recursive listing, if False, then directories are omitted.
        list_objs:
            optional, default True. If False, then directories are omitted.
        limit:
            optional. If specified, then lists at most this many items.

    Returns:
        an iterator of S3Obj.

    Examples:
        # set up
        >>> s3 = boto3.resource('s3')
        ... bucket = s3.Bucket(name)

        # iterate through all S3 objects under some dir
        >>> for p in s3ls(bucket, 'some/dir'):
        ...     print(p)

        # iterate through up to 20 S3 objects under some dir, starting with foo_0010
        >>> for p in s3ls(bucket, 'some/dir', limit=20, start='foo_0010'):
        ...     print(p)

        # non-recursive listing under some dir:
        >>> for p in s3ls(bucket, 'some/dir', recursive=False):
        ...     print(p)

        # non-recursive listing under some dir, listing only dirs:
        >>> for p in s3ls(bucket, 'some/dir', recursive=False, list_objs=False):
        ...     print(p)
"""
    kwargs = dict()
    if start is not None:
        if not start.startswith(path):
            start = os.path.join(path, start)
        # note: need to use a string just smaller than start, because
        # the list_object API specifies that start is excluded (the first
        # result is *after* start).
        kwargs.update(Marker=__prev_str(start))
    if end is not None:
        if not end.startswith(path):
            end = os.path.join(path, end)
    if not recursive:
        kwargs.update(Delimiter='/')
        if not path.endswith('/'):
            path += '/'
    kwargs.update(Prefix=path)
    if limit is not None:
        kwargs.update(PaginationConfig={'MaxItems': limit})

    paginator = bucket.meta.client.get_paginator('list_objects')
    for resp in paginator.paginate(Bucket=bucket.name, **kwargs):
        q = []
        if 'CommonPrefixes' in resp and list_dirs:
            q = [S3Obj(f['Prefix'], None, None, None) for f in resp['CommonPrefixes']]
        if 'Contents' in resp and list_objs:
            q += [S3Obj(f['Key'], f['LastModified'], f['Size'], f['ETag']) for f in resp['Contents']]
        # note: even with sorted lists, it is faster to sort(a+b)
        # than heapq.merge(a, b) at least up to 10K elements in each list
        q = sorted(q, key=attrgetter('key'))
        if limit is not None:
            q = q[:limit]
            limit -= len(q)
        for p in q:
            if end is not None and p.key >= end:
                return
            yield p


def __prev_str(s):
    if len(s) == 0:
        return s
    s, c = s[:-1], ord(s[-1])
    if c > 0:
        s += chr(c - 1)
    s += ''.join(['\u7FFF' for _ in range(10)])
    return s

테스트 :

다음은 paginator및 의 동작을 테스트하는 데 유용합니다 list_objects. 많은 디렉토리와 파일을 생성합니다. 페이지는 최대 1000 개 항목이므로 dirs 및 파일에 대해 여러 항목을 사용합니다. dirs디렉토리 만 포함합니다 (각각 하나의 객체를 가짐). mixed각 dir에 대해 2 개의 객체 비율로 dirs와 객체의 혼합을 포함합니다 (물론 dir 아래에 하나의 객체를 더합니다. S3는 객체 만 저장합니다).

import concurrent
def genkeys(top='tmp/test', n=2000):
    for k in range(n):
        if k % 100 == 0:
            print(k)
        for name in [
            os.path.join(top, 'dirs', f'{k:04d}_dir', 'foo'),
            os.path.join(top, 'mixed', f'{k:04d}_dir', 'foo'),
            os.path.join(top, 'mixed', f'{k:04d}_foo_a'),
            os.path.join(top, 'mixed', f'{k:04d}_foo_b'),
        ]:
            yield name


with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor:
    executor.map(lambda name: bucket.put_object(Key=name, Body='hi\n'.encode()), genkeys())

결과 구조는 다음과 같습니다.

./dirs/0000_dir/foo
./dirs/0001_dir/foo
./dirs/0002_dir/foo
...
./dirs/1999_dir/foo
./mixed/0000_dir/foo
./mixed/0000_foo_a
./mixed/0000_foo_b
./mixed/0001_dir/foo
./mixed/0001_foo_a
./mixed/0001_foo_b
./mixed/0002_dir/foo
./mixed/0002_foo_a
./mixed/0002_foo_b
...
./mixed/1999_dir/foo
./mixed/1999_foo_a
./mixed/1999_foo_b

s3list에서 응답을 검사하기 위해 위에 제공된 코드를 약간만 처리하면 paginator몇 가지 재미있는 사실을 관찰 할 수 있습니다.

은 Marker정말 배타적입니다. 주어 Marker=topdir + 'mixed/0500_foo_a'리스팅 (listing) 시작하게됩니다 후 (당으로 해당 키를 AmazonS3의 API 와 함께,) 즉 .../mixed/0500_foo_b. 그것이 __prev_str().
를 사용 하면를 Delimiter나열 할 때의 mixed/각 응답 paginator에 666 개의 키와 334 개의 공통 접두사가 포함됩니다. 엄청난 응답을 구축하지 않는 것이 좋습니다.
반대로 나열 할 때의 dirs/각 응답 paginator에는 1000 개의 공통 접두사 가 포함됩니다 (키 없음).
제한 형식으로 PaginationConfig={'MaxItems': limit}제한을 전달 하면 공통 접두사가 아닌 키 수만 제한됩니다. 우리는 반복자의 스트림을 더 잘림으로써이를 처리합니다.

Question 6

S3의 큰 깨달음은 폴더 / 디렉토리가없고 키만 있다는 것입니다. 명백한 폴더 구조는 단지 파일 이름 앞에 추가되는 정도의 내용을 나열하려면 '키'가 될 myBucket이야 ' some/path/to/the/file/당신이 시도 할 수 있습니다 :

s3 = boto3.client('s3')
for obj in s3.list_objects_v2(Bucket="myBucket", Prefix="some/path/to/the/file/")['Contents']:
    print(obj['Key'])

다음과 같은 결과를 얻을 수 있습니다.

some/path/to/the/file/yo.jpg
some/path/to/the/file/meAndYou.gif
...

Question 7

저도 같은 문제를 가지고 있지만 사용하여 해결하기 위해 관리 boto3.client와 list_objects_v2함께 Bucket하고 StartAfter매개 변수를 설정합니다.

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    print object['Key']

위 코드의 출력 결과는 다음과 같습니다.

firstlevelFolder/secondLevelFolder/item1
firstlevelFolder/secondLevelFolder/item2

Boto3 list_objects_v2 문서

secondLevelFolder방금 python 메소드를 사용한 디렉토리 이름 만 제거하려면 다음을 수행하십시오 split().

s3client = boto3.client('s3')
bucket = 'my-bucket-name'
startAfter = 'firstlevelFolder/secondLevelFolder'

theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in theobjects['Contents']:
    direcoryName = object['Key'].encode("string_escape").split('/')
    print direcoryName[1]

위 코드의 출력 결과는 다음과 같습니다.

secondLevelFolder
secondLevelFolder

Python split () 문서

디렉토리 이름과 내용 항목 이름을 얻으려면 인쇄 줄을 다음으로 바꿉니다.

print "{}/{}".format(fileName[1], fileName[2])

그리고 다음이 출력됩니다.

secondLevelFolder/item2
secondLevelFolder/item2

도움이 되었기를 바랍니다

Question 8

다음은 나를 위해 작동합니다 ... S3 객체 :

s3://bucket/
    form1/
       section11/
          file111
          file112
       section12/
          file121
    form2/
       section21/
          file211
          file112
       section22/
          file221
          file222
          ...
      ...
   ...

사용 :

from boto3.session import Session
s3client = session.client('s3')
resp = s3client.list_objects(Bucket=bucket, Prefix='', Delimiter="/")
forms = [x['Prefix'] for x in resp['CommonPrefixes']]

우리는 얻는다 :

form1/
form2/
...

와:

resp = s3client.list_objects(Bucket=bucket, Prefix='form1/', Delimiter="/")
sections = [x['Prefix'] for x in resp['CommonPrefixes']]

우리는 얻는다 :

form1/section11/
form1/section12/

Question 9

을 실행할 때 AWS cli가이 작업을 수행하므로 (버킷의 모든 키를 가져오고 반복하지 않고) aws s3 ls s3://my-bucket/boto3를 사용하는 방법이 있어야한다고 생각했습니다.

https://github.com/aws/aws-cli/blob/0fedc4c1b6a7aee13e2ed10c3ada778c702c22c3/awscli/customizations/s3/subcommands.py#L499

실제로 Prefix 및 Delimiter를 사용하는 것처럼 보입니다. 해당 코드를 약간 수정하여 버킷의 루트 수준에서 모든 디렉터리를 가져 오는 함수를 작성할 수있었습니다.

def list_folders_in_bucket(bucket):
    paginator = boto3.client('s3').get_paginator('list_objects')
    folders = []
    iterator = paginator.paginate(Bucket=bucket, Prefix='', Delimiter='/', PaginationConfig={'PageSize': None})
    for response_data in iterator:
        prefixes = response_data.get('CommonPrefixes', [])
        for prefix in prefixes:
            prefix_name = prefix['Prefix']
            if prefix_name.endswith('/'):
                folders.append(prefix_name.rstrip('/'))
    return folders

Question 10

가능한 해결책은 다음과 같습니다.

def download_list_s3_folder(my_bucket,my_folder):
    import boto3
    s3 = boto3.client('s3')
    response = s3.list_objects_v2(
        Bucket=my_bucket,
        Prefix=my_folder,
        MaxKeys=1000)
    return [item["Key"] for item in response['Contents']]

Question 11

사용 `boto3.resource`

이것은 선택 사항을 적용하기 위해 itz-azhar 의 답변을 기반으로합니다 limit. boto3.client버전 보다 사용하기가 훨씬 더 간단합니다 .

import logging
from typing import List, Optional

import boto3
from boto3_type_annotations.s3 import ObjectSummary  # pip install boto3_type_annotations

log = logging.getLogger(__name__)
_S3_RESOURCE = boto3.resource("s3")

def s3_list(bucket_name: str, prefix: str, *, limit: Optional[int] = None) -> List[ObjectSummary]:
    """Return a list of S3 object summaries."""
    # Ref: https://stackoverflow.com/a/57718002/
    return list(_S3_RESOURCE.Bucket(bucket_name).objects.limit(count=limit).filter(Prefix=prefix))


if __name__ == "__main__":
    s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)

사용 `boto3.client`

이것은 1000 개 이상의 개체를 검색 할 수 있도록 CpILLlist_objects_v2 의 답변을 사용 하고 구축 합니다.

import logging
from typing import cast, List

import boto3

log = logging.getLogger(__name__)
_S3_CLIENT = boto3.client("s3")

def s3_list(bucket_name: str, prefix: str, *, limit: int = cast(int, float("inf"))) -> List[dict]:
    """Return a list of S3 object summaries."""
    # Ref: https://stackoverflow.com/a/57718002/
    contents: List[dict] = []
    continuation_token = None
    if limit <= 0:
        return contents
    while True:
        max_keys = min(1000, limit - len(contents))
        request_kwargs = {"Bucket": bucket_name, "Prefix": prefix, "MaxKeys": max_keys}
        if continuation_token:
            log.info(  # type: ignore
                "Listing %s objects in s3://%s/%s using continuation token ending with %s with %s objects listed thus far.",
                max_keys, bucket_name, prefix, continuation_token[-6:], len(contents))  # pylint: disable=unsubscriptable-object
            response = _S3_CLIENT.list_objects_v2(**request_kwargs, ContinuationToken=continuation_token)
        else:
            log.info("Listing %s objects in s3://%s/%s with %s objects listed thus far.", max_keys, bucket_name, prefix, len(contents))
            response = _S3_CLIENT.list_objects_v2(**request_kwargs)
        assert response["ResponseMetadata"]["HTTPStatusCode"] == 200
        contents.extend(response["Contents"])
        is_truncated = response["IsTruncated"]
        if (not is_truncated) or (len(contents) >= limit):
            break
        continuation_token = response["NextContinuationToken"]
    assert len(contents) <= limit
    log.info("Returning %s objects from s3://%s/%s.", len(contents), bucket_name, prefix)
    return contents


if __name__ == "__main__":
    s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)

Question 12

우선 S3에는 실제 폴더 개념이 없습니다. @ 파일을 가질 수 '/folder/subfolder/myfile.txt'있고 폴더도 하위 폴더도 없습니다.

S3에서 폴더를 "시뮬레이션"하려면 이름 끝에 '/'가있는 빈 파일을 생성해야합니다 ( Amazon S3 boto-폴더 생성 방법 참조 ).

문제를 들어, 당신은 아마 방법을 사용해야합니다 get_all_keys2 개 매개 변수를 : prefix및delimiter

https://github.com/boto/boto/blob/develop/boto/s3/bucket.py#L427

for key in bucket.get_all_keys(prefix='first-level/', delimiter='/'):
    print(key.name)

Question 13

boto3가 여기서 논의되는 주제라는 것을 알고 있지만, 일반적으로 이와 같은 용도로 awscli 를 사용하는 것이 일반적으로 더 빠르고 직관적 이라는 것을 알았습니다. awscli는 가치가있는 것보다 boto3보다 더 많은 기능을 보유합니다.

예를 들어 주어진 버킷과 관련된 "하위 폴더"에 저장된 객체가있는 경우 다음과 같이 모두 나열 할 수 있습니다.

1) 'mydata'= 버킷 이름

2) 'f1 / f2 / f3'= "파일"또는 개체로 이어지는 "경로"

3) 'foo2.csv, barfar.segy, gar.tar'= "내부"모든 개체 f3

따라서 이러한 객체로 이어지는 "절대 경로"는 'mydata / f1 / f2 / f3 / foo2.csv'...라고 생각할 수 있습니다.

awscli 명령을 사용하면 다음을 통해 지정된 "하위 폴더"내의 모든 객체를 쉽게 나열 할 수 있습니다.

aws s3 ls s3 : // mydata / f1 / f2 / f3 / --recursive

Question 14

다음은 많은 수의 S3 버킷 객체를 가져 오려는 경우 페이지 매김을 처리 할 수있는 코드입니다.

def get_matching_s3_objects(bucket, prefix="", suffix=""):

    s3 = boto3.client("s3")
    paginator = s3.get_paginator("list_objects_v2")

    kwargs = {'Bucket': bucket}

    # We can pass the prefix directly to the S3 API.  If the user has passed
    # a tuple or list of prefixes, we go through them one by one.
    if isinstance(prefix, str):
        prefixes = (prefix, )
    else:
        prefixes = prefix

    for key_prefix in prefixes:
        kwargs["Prefix"] = key_prefix

        for page in paginator.paginate(**kwargs):
            try:
                contents = page["Contents"]
            except KeyError:
                return

            for obj in contents:
                key = obj["Key"]
                if key.endswith(suffix):
                    yield obj

Question 15

Boto 1.13.3의 경우 다음과 같이 간단합니다 (다른 답변에서 다룬 모든 페이지 매김 고려 사항을 건너 뛰면).

def get_sub_paths(bucket, prefix):
s3 = boto3.client('s3')
response = s3.list_objects_v2(
    Bucket=bucket,
    Prefix=prefix,
    MaxKeys=1000)
return [item["Prefix"] for item in response['CommonPrefixes']]

boto3에서 S3 버킷의 하위 폴더 이름 검색

사용 boto3.resource

사용 boto3.client

사용 `boto3.resource`

사용 `boto3.client`