Boto3를 사용하여 S3 버킷에서 모든 파일 다운로드

Question 1

s3 버킷에서 파일을 가져 오기 위해 boto3를 사용하고 있습니다. 비슷한 기능이 필요합니다.aws s3 sync

내 현재 코드는

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
    s3.download_file('my_bucket_name', key['Key'], key['Key'])

버킷에 파일 만있는 한 제대로 작동합니다. 버킷 내부에 폴더가 있으면 오류가 발생합니다.

Traceback (most recent call last):
  File "./test", line 6, in <module>
    s3.download_file('my_bucket_name', key['Key'], key['Key'])
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
    self._get_object(bucket, key, filename, extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
    with self._osutil.open(filename, 'wb') as f:
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
    return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

boto3를 사용하여 완전한 s3 버킷을 다운로드하는 적절한 방법입니까? 폴더를 다운로드하는 방법.

Question 2

1000 개 이상의 객체가있는 버킷으로 작업 할 때는 NextContinuationToken최대 1000 개의 키를 순차적으로 사용하는 솔루션을 구현해야 합니다. 이 솔루션은 먼저 개체 목록을 컴파일 한 다음 지정된 디렉터리를 반복적으로 만들고 기존 개체를 다운로드합니다.

import boto3
import os

s3_client = boto3.client('s3')

def download_dir(prefix, local, bucket, client=s3_client):
    """
    params:
    - prefix: pattern to match in s3
    - local: local path to folder in which to place files
    - bucket: s3 bucket with target contents
    - client: initialized s3 client object
    """
    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket':bucket,
        'Prefix':prefix,
    }
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    for k in keys:
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        client.download_file(bucket, k, dest_pathname)

Question 3

나는 동일한 요구를 가지고 있으며 파일을 재귀 적으로 다운로드하는 다음 기능을 만들었습니다.

디렉토리는 파일이 포함 된 경우에만 로컬로 작성됩니다.

import boto3
import os

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            dest_pathname = os.path.join(local, file.get('Key'))
            if not os.path.exists(os.path.dirname(dest_pathname)):
                os.makedirs(os.path.dirname(dest_pathname))
            resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)

함수는 다음과 같이 호출됩니다.

def _start():
    client = boto3.client('s3')
    resource = boto3.resource('s3')
    download_dir(client, resource, 'clientconf/', '/tmp', bucket='my-bucket')

Question 4

Amazon S3에는 폴더 / 디렉터리가 없습니다. 그것은이다 플랫 파일 구조 .

디렉토리의 모양을 유지하기 위해 경로 이름은 오브젝트 키 (파일 이름)의 일부로 저장됩니다 . 예를 들면 :

images/foo.jpg

이 경우, 전체 키는 images/foo.jpg오히려 단지보다 foo.jpg.

귀하의 문제는 boto라는 파일을 반환 my_folder/.8Df54234하고 로컬 파일 시스템에 저장하려고 시도하는 것입니다. 그러나 로컬 파일 시스템은 my_folder/부분을 디렉토리 이름으로 해석 하고 해당 디렉토리는 로컬 파일 시스템에 존재하지 않습니다 .

당신도 할 수 절단 에만 저장에 파일 이름을 .8Df54234부분, 또는 당신이해야 필요한 디렉토리를 작성 파일을 작성하기 전에. 다중 레벨 중첩 디렉토리 일 수 있습니다.

더 쉬운 방법은 다음과 같은 모든 작업을 수행 하는 AWS 명령 줄 인터페이스 (CLI) 를 사용하는 것입니다.

aws s3 cp --recursive s3://my_bucket_name local_folder

sync새 파일과 수정 된 파일 만 복사 하는 옵션 도 있습니다.

Question 5

import os
import boto3

#initiate s3 resource
s3 = boto3.resource('s3')

# select bucket
my_bucket = s3.Bucket('my_bucket_name')

# download file into current directory
for s3_object in my_bucket.objects.all():
    # Need to split s3_object.key into path and file name, else it will give error file not found.
    path, filename = os.path.split(s3_object.key)
    my_bucket.download_file(s3_object.key, filename)

Question 6

현재 다음을 사용하여 작업을 수행하고 있습니다.

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='bucket')['Contents']
for s3_key in list:
    s3_object = s3_key['Key']
    if not s3_object.endswith("/"):
        s3.download_file('bucket', s3_object, s3_object)
    else:
        import os
        if not os.path.exists(s3_object):
            os.makedirs(s3_object)

비록 그것이 일을하지만, 나는 이런 식으로하는 것이 좋은지 잘 모르겠습니다. 나는 이것을 달성하는 더 나은 방법으로 다른 사용자와 추가 답변을 돕기 위해 여기에 남겨 둡니다.

Question 7

결코 늦지 않는 것보다 낫습니다 :) 페이지 매김에 대한 이전 답변은 정말 좋습니다. 그러나 재귀 적이므로 Python의 재귀 제한에 도달 할 수 있습니다. 다음은 몇 가지 추가 검사가 포함 된 대체 방법입니다.

import os
import errno
import boto3


def assert_dir_exists(path):
    """
    Checks if directory tree in path exists. If not it created them.
    :param path: the path to check if it exists
    """
    try:
        os.makedirs(path)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise


def download_dir(client, bucket, path, target):
    """
    Downloads recursively the given S3 path to the target directory.
    :param client: S3 client to use.
    :param bucket: the name of the bucket to download from
    :param path: The S3 directory to download.
    :param target: the local directory to download the files to.
    """

    # Handle missing / at end of prefix
    if not path.endswith('/'):
        path += '/'

    paginator = client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
        # Download each file individually
        for key in result['Contents']:
            # Calculate relative path
            rel_path = key['Key'][len(path):]
            # Skip paths ending in /
            if not key['Key'].endswith('/'):
                local_file_path = os.path.join(target, rel_path)
                # Make sure directories exist
                local_file_dir = os.path.dirname(local_file_path)
                assert_dir_exists(local_file_dir)
                client.download_file(bucket, key['Key'], local_file_path)


client = boto3.client('s3')

download_dir(client, 'bucket-name', 'path/to/data', 'downloads')

Question 8

동일한 프로세스에서 AWS CLI를 실행하는 해결 방법이 있습니다.

awsclipython lib로 설치합니다 .

pip install awscli

그런 다음이 함수를 정의하십시오.

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

실행하다:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

Question 9

모든 파일을 한 번에 가져 오는 것은 매우 나쁜 생각입니다. 오히려 일괄 적으로 가져와야합니다.

S3에서 특정 폴더 (디렉토리)를 가져 오는 데 사용하는 한 가지 구현은 다음과 같습니다.

def get_directory(directory_path, download_path, exclude_file_names):
    # prepare session
    session = Session(aws_access_key_id, aws_secret_access_key, region_name)

    # get instances for resource and bucket
    resource = session.resource('s3')
    bucket = resource.Bucket(bucket_name)

    for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
        s3_object = s3_key['Key']
        if s3_object not in exclude_file_names:
            bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

여전히 전체 버킷을 얻으려면 CIL을 통해 @John Rotenstein 이 아래와 같이 언급했듯이 사용하십시오 .

aws s3 cp --recursive s3://bucket_name download_path

Question 10

for objs in my_bucket.objects.all():
    print(objs.key)
    path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
    try:
        if not os.path.exists(path):
            os.makedirs(path)
        my_bucket.download_file(objs.key, '/tmp/'+objs.key)
    except FileExistsError as fe:                          
        print(objs.key+' exists')

이 코드는 /tmp/디렉토리 의 콘텐츠를 다운로드합니다 . 원하는 경우 디렉토리를 변경할 수 있습니다.

Question 11

Python을 사용하여 bash 스크립트를 호출하려는 경우 다음은 S3 버킷의 폴더에서 로컬 폴더 (Linux 시스템의 경우)로 파일을로드하는 간단한 방법입니다.

import boto3
import subprocess
import os

###TOEDIT###
my_bucket_name = "your_my_bucket_name"
bucket_folder_name = "your_bucket_folder_name"
local_folder_path = "your_local_folder_path"
###TOEDIT###

# 1.Load thes list of files existing in the bucket folder
FILES_NAMES = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('{}'.format(my_bucket_name))
for object_summary in my_bucket.objects.filter(Prefix="{}/".format(bucket_folder_name)):
#     print(object_summary.key)
    FILES_NAMES.append(object_summary.key)

# 2.List only new files that do not exist in local folder (to not copy everything!)
new_filenames = list(set(FILES_NAMES )-set(os.listdir(local_folder_path)))

# 3.Time to load files in your destination folder 
for new_filename in new_filenames:
    upload_S3files_CMD = """aws s3 cp s3://{}/{}/{} {}""".format(my_bucket_name,bucket_folder_name,new_filename ,local_folder_path)

    subprocess_call = subprocess.call([upload_S3files_CMD], shell=True)
    if subprocess_call != 0:
        print("ALERT: loading files not working correctly, please re-check new loaded files")

Question 12

나는 비슷한 요구 사항을 얻었고 위의 솔루션 중 일부를 읽고 다른 웹 사이트에서 도움을 얻었으며 아래 스크립트를 생각해 냈습니다. 누구에게 도움이 될지 공유하고 싶었습니다.

from boto3.session import Session
import os

def sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path):    
    session = Session(aws_access_key_id=access_key_id,aws_secret_access_key=secret_access_key)
    s3 = session.resource('s3')
    your_bucket = s3.Bucket(bucket_name)
    for s3_file in your_bucket.objects.all():
        if folder in s3_file.key:
            file=os.path.join(destination_path,s3_file.key.replace('/','\\'))
            if not os.path.exists(os.path.dirname(file)):
                os.makedirs(os.path.dirname(file))
            your_bucket.download_file(s3_file.key,file)
sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path)

Question 13

@glefait의 답변을 마지막에 if 조건으로 다시 게시하여 os 오류 20을 방지합니다. 첫 번째 키는 대상 경로에 쓸 수없는 폴더 이름 자체입니다.

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            print("Content: ",result)
            dest_pathname = os.path.join(local, file.get('Key'))
            print("Dest path: ",dest_pathname)
            if not os.path.exists(os.path.dirname(dest_pathname)):
                print("here last if")
                os.makedirs(os.path.dirname(dest_pathname))
            print("else file key: ", file.get('Key'))
            if not file.get('Key') == dist:
                print("Key not equal? ",file.get('Key'))
                resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)enter code here

Question 14

나는 잠시 동안이 문제에 직면 해 왔으며 내가 겪은 모든 다른 포럼에서 작동하는 것에 대한 완전한 종단 간 단편을 보지 못했습니다. 그래서 나는 계속해서 모든 조각을 가져 와서 (내가 직접 몇 가지 추가) 완전한 엔드 투 엔드 S3 다운로더를 만들었습니다!

이렇게하면 파일이 자동으로 다운로드 될뿐만 아니라 S3 파일이 하위 디렉터리에있는 경우 로컬 스토리지에 생성됩니다. 내 애플리케이션의 인스턴스에서 권한과 소유자를 설정해야하므로이 항목도 추가했습니다 (필요하지 않은 경우 주석 처리 가능).

이것은 Docker 환경 (K8)에서 테스트되고 작동하지만 로컬에서 테스트 / 실행하려는 경우를 대비하여 스크립트에 환경 변수를 추가했습니다.

이 정보가 S3 다운로드 자동화를 찾는 데 도움이되기를 바랍니다. 필요한 경우 이것이 어떻게 더 잘 최적화 될 수 있는지에 대한 조언, 정보 등을 환영합니다.

#!/usr/bin/python3
import gc
import logging
import os
import signal
import sys
import time
from datetime import datetime

import boto
from boto.exception import S3ResponseError
from pythonjsonlogger import jsonlogger

formatter = jsonlogger.JsonFormatter('%(message)%(levelname)%(name)%(asctime)%(filename)%(lineno)%(funcName)')

json_handler_out = logging.StreamHandler()
json_handler_out.setFormatter(formatter)

#Manual Testing Variables If Needed
#os.environ["DOWNLOAD_LOCATION_PATH"] = "some_path"
#os.environ["BUCKET_NAME"] = "some_bucket"
#os.environ["AWS_ACCESS_KEY"] = "some_access_key"
#os.environ["AWS_SECRET_KEY"] = "some_secret"
#os.environ["LOG_LEVEL_SELECTOR"] = "DEBUG, INFO, or ERROR"

#Setting Log Level Test
logger = logging.getLogger('json')
logger.addHandler(json_handler_out)
logger_levels = {
    'ERROR' : logging.ERROR,
    'INFO' : logging.INFO,
    'DEBUG' : logging.DEBUG
}
logger_level_selector = os.environ["LOG_LEVEL_SELECTOR"]
logger.setLevel(logger_level_selector)

#Getting Date/Time
now = datetime.now()
logger.info("Current date and time : ")
logger.info(now.strftime("%Y-%m-%d %H:%M:%S"))

#Establishing S3 Variables and Download Location
download_location_path = os.environ["DOWNLOAD_LOCATION_PATH"]
bucket_name = os.environ["BUCKET_NAME"]
aws_access_key_id = os.environ["AWS_ACCESS_KEY"]
aws_access_secret_key = os.environ["AWS_SECRET_KEY"]
logger.debug("Bucket: %s" % bucket_name)
logger.debug("Key: %s" % aws_access_key_id)
logger.debug("Secret: %s" % aws_access_secret_key)
logger.debug("Download location path: %s" % download_location_path)

#Creating Download Directory
if not os.path.exists(download_location_path):
    logger.info("Making download directory")
    os.makedirs(download_location_path)

#Signal Hooks are fun
class GracefulKiller:
    kill_now = False
    def __init__(self):
        signal.signal(signal.SIGINT, self.exit_gracefully)
        signal.signal(signal.SIGTERM, self.exit_gracefully)
    def exit_gracefully(self, signum, frame):
        self.kill_now = True

#Downloading from S3 Bucket
def download_s3_bucket():
    conn = boto.connect_s3(aws_access_key_id, aws_access_secret_key)
    logger.debug("Connection established: ")
    bucket = conn.get_bucket(bucket_name)
    logger.debug("Bucket: %s" % str(bucket))
    bucket_list = bucket.list()
#    logger.info("Number of items to download: {0}".format(len(bucket_list)))

    for s3_item in bucket_list:
        key_string = str(s3_item.key)
        logger.debug("S3 Bucket Item to download: %s" % key_string)
        s3_path = download_location_path + "/" + key_string
        logger.debug("Downloading to: %s" % s3_path)
        local_dir = os.path.dirname(s3_path)

        if not os.path.exists(local_dir):
            logger.info("Local directory doesn't exist, creating it... %s" % local_dir)
            os.makedirs(local_dir)
            logger.info("Updating local directory permissions to %s" % local_dir)
#Comment or Uncomment Permissions based on Local Usage
            os.chmod(local_dir, 0o775)
            os.chown(local_dir, 60001, 60001)
        logger.debug("Local directory for download: %s" % local_dir)
        try:
            logger.info("Downloading File: %s" % key_string)
            s3_item.get_contents_to_filename(s3_path)
            logger.info("Successfully downloaded File: %s" % s3_path)
            #Updating Permissions
            logger.info("Updating Permissions for %s" % str(s3_path))
#Comment or Uncomment Permissions based on Local Usage
            os.chmod(s3_path, 0o664)
            os.chown(s3_path, 60001, 60001)
        except (OSError, S3ResponseError) as e:
            logger.error("Fatal error in s3_item.get_contents_to_filename", exc_info=True)
            # logger.error("Exception in file download from S3: {}".format(e))
            continue
        logger.info("Deleting %s from S3 Bucket" % str(s3_item.key))
        s3_item.delete()

def main():
    killer = GracefulKiller()
    while not killer.kill_now:
        logger.info("Checking for new files on S3 to download...")
        download_s3_bucket()
        logger.info("Done checking for new files, will check in 120s...")
        gc.collect()
        sys.stdout.flush()
        time.sleep(120)
if __name__ == '__main__':
    main()

Question 15

AWS S3 문서에서 (S3 버킷의 폴더를 어떻게 사용합니까?) :

Amazon S3에서 버킷과 객체는 기본 리소스이고 객체는 버킷에 저장됩니다. Amazon S3는 파일 시스템에서 볼 수있는 계층 구조가 아닌 평면 구조 를 가지고 있습니다. 그러나 조직의 단순화를 위해 Amazon S3 콘솔은 객체 그룹화 수단으로 폴더 개념을 지원합니다. Amazon S3는 객체에 대해 공유 이름 접두사를 사용하여이를 수행합니다 (즉, 객체에는 공통 문자열로 시작하는 이름이 있음). 객체 이름은 키 이름이라고도합니다.

예를 들어 콘솔에 photos라는 폴더를 만들고 여기에 myphoto.jpg라는 개체를 저장할 수 있습니다. 그런 다음 개체는 photos / myphoto.jpg 키 이름으로 저장됩니다. 여기서 photos /는 접두사입니다.

버킷의 에뮬레이션 된 디렉터리 구조 를 고려하여 'mybucket'에서 현재 디렉터리로 모든 파일을 다운로드하려면 (이미 로컬에없는 경우 버킷에서 폴더 생성) :

import boto3
import os

bucket_name = "mybucket"
s3 = boto3.client("s3")
objects = s3.list_objects(Bucket = bucket_name)["Contents"]
for s3_object in objects:
    s3_key = s3_object["Key"]
    path, filename = os.path.split(s3_key)
    if len(path) != 0 and not os.path.exists(path):
        os.makedirs(path)
    if not s3_key.endswith("/"):
        download_to = path + '/' + filename if path else filename
        s3.download_file(bucket_name, s3_key, download_to)