urllib 및 python을 통해 사진 다운로드

183

그래서 웹 코믹스를 다운로드하여 데스크탑의 폴더에 넣는 Python 스크립트를 만들려고합니다. 나는 비슷한 것을하는 몇 가지 유사한 프로그램을 찾았지만 필요한 것은 아닙니다. 내가 가장 비슷한 것으로 찾은 것은 바로 여기 ( http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-images )입니다. 이 코드를 사용해 보았습니다.

>>> import urllib
>>> image = urllib.URLopener()
>>> image.retrieve("http://www.gunnerkrigg.com//comics/00000001.jpg","00000001.jpg")
('00000001.jpg', <httplib.HTTPMessage instance at 0x1457a80>)

그런 다음 컴퓨터에서 "00000001.jpg"파일을 검색했지만 캐시 된 사진 만 발견했습니다. 파일을 컴퓨터에 저장했는지 확실하지 않습니다. 파일 다운로드 방법을 이해하면 나머지를 처리하는 방법을 알고 있다고 생각합니다. 본질적으로 for 루프를 사용하고 '00000000'. 'jpg'에서 문자열을 분할하고 '00000000'을 가장 큰 숫자까지 늘리면 어떻게 든 결정해야합니다. 이 작업을 수행하는 가장 좋은 방법이나 파일을 올바르게 다운로드하는 방법에 대한 권장 사항은 무엇입니까?

감사!

6/15/10 편집

완성 된 스크립트는 다음과 같습니다. 선택한 디렉토리에 파일을 저장합니다. 이상한 이유로 파일이 다운로드되지 않았고 방금 완료되었습니다. 그것을 청소하는 방법에 대한 제안은 대단히 감사하겠습니다. 현재 사이트에 많은 만화가 있는지 확인하는 방법을 찾고 있으므로 특정 수의 예외가 발생한 후에 프로그램을 종료하지 않고 최신 만화를 얻을 수 있습니다.

import urllib
import os

comicCounter=len(os.listdir('/file'))+1  # reads the number of files in the folder to start downloading at the next comic
errorCount=0

def download_comic(url,comicName):
    """
    download a comic in the form of

    url = http://www.example.com
    comicName = '00000000.jpg'
    """
    image=urllib.URLopener()
    image.retrieve(url,comicName)  # download comicName at URL

while comicCounter <= 1000:  # not the most elegant solution
    os.chdir('/file')  # set where files download to
        try:
        if comicCounter < 10:  # needed to break into 10^n segments because comic names are a set of zeros followed by a number
            comicNumber=str('0000000'+str(comicCounter))  # string containing the eight digit comic number
            comicName=str(comicNumber+".jpg")  # string containing the file name
            url=str("http://www.gunnerkrigg.com//comics/"+comicName)  # creates the URL for the comic
            comicCounter+=1  # increments the comic counter to go to the next comic, must be before the download in case the download raises an exception
            download_comic(url,comicName)  # uses the function defined above to download the comic
            print url
        if 10 <= comicCounter < 100:
            comicNumber=str('000000'+str(comicCounter))
            comicName=str(comicNumber+".jpg")
            url=str("http://www.gunnerkrigg.com//comics/"+comicName)
            comicCounter+=1
            download_comic(url,comicName)
            print url
        if 100 <= comicCounter < 1000:
            comicNumber=str('00000'+str(comicCounter))
            comicName=str(comicNumber+".jpg")
            url=str("http://www.gunnerkrigg.com//comics/"+comicName)
            comicCounter+=1
            download_comic(url,comicName)
            print url
        else:  # quit the program if any number outside this range shows up
            quit
    except IOError:  # urllib raises an IOError for a 404 error, when the comic doesn't exist
        errorCount+=1  # add one to the error count
        if errorCount>3:  # if more than three errors occur during downloading, quit the program
            break
        else:
            print str("comic"+ ' ' + str(comicCounter) + ' ' + "does not exist")  # otherwise say that the certain comic number doesn't exist
print "all comics are up to date"  # prints if all comics are downloaded

python urllib2 urllib

— 마이크
소스

좋아, 모두 다운로드 받았어! 이제 나는 얼마나 많은 만화가 온라인에 있는지를 결정하는 매우 우아한 솔루션에 갇혀 있습니다 ... 나는 기본적으로 만화 수보다 많은 수의 프로그램을 실행하고 만화가 발생했을 때 예외를 발생시킵니다. 존재하지 않으며 예외가 두 번 이상 발생하면 (만화가 두 개 이상 누락 될 것이라고 생각하지 않기 때문에) 더 이상 다운로드 할 것이 없다고 생각하면서 프로그램을 종료합니다. 웹 사이트에 액세스 할 수 없으므로 웹 사이트에 몇 개의 파일이 있는지 확인하는 가장 좋은 방법이 있습니까? 잠시 후 코드를 게시하겠습니다.

— Mike

creativebe.com/icombiner/merge-jpg.html 이 프로그램을 사용하여 모든 .jpg 파일을 하나의 PDF로 병합했습니다. 훌륭하게 작동하며 무료입니다!

— Mike

7

솔루션을 답변으로 게시하고 질문에서 제거하는 것을 고려하십시오. 질문 게시물은 질문하고 답변을위한 답변 게시물 :-)

— BartoszKP

왜이 태그가 beautifulsoup있습니까? 이 게시물은 상위 beautifulsoup질문 목록에 표시됩니다

— P0W

1

@ P0W 토론 한 태그를 제거했습니다.

— kmonsoor

252

파이썬 2

urllib.urlretrieve 사용

import urllib
urllib.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

파이썬 3

urllib.request.urlretrieve 사용 (Python 3의 레거시 인터페이스의 일부, 정확히 동일하게 작동)

import urllib.request
urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

— 매튜 플라 첸
소스

인수로 전달되면 파일 확장자가 잘리지 않는 것 같습니다 (확장자는 원래 URL에 있습니다). 왜 그런지 알아?

— JeffThompson

1

네 그렇습니다. 파일 확장자가 없으면 파일 확장자가 추가 될 것이라고 생각합니다. 당시에는 이해가되었지만 지금 무슨 일이 일어나고 있는지 이해하고 있습니다.

— JeffThompson

65

Python 3의 경우 [url.request] ( docs.python.org/3.0/library/… )를 가져와야 합니다.import urllib.request urllib.request.retrieve("http://...")

— wasabigeek

1

하는 것으로 파이썬 3 문서 목록은 "레거시 인터페이스"의 일환으로 ()을 검색 하고 미래에 사용되지 않는 될 수 있습니다 말한다.

— Nathan Wailes

18

Python 3의 경우 실제로 import urllib.request urllib.request.urlretrieve("http://...jpg", "1.jpg")입니다. 그것은이다 urlretrieve3.x를 기준으로 지금

— user1032613

81

import urllib
f = open('00000001.jpg','wb')
f.write(urllib.urlopen('http://www.gunnerkrigg.com//comics/00000001.jpg').read())
f.close()

— 디지 미
소스

70

요청 라이브러리를 사용하여 레코드 전용.

import requests
f = open('00000001.jpg','wb')
f.write(requests.get('http://www.gunnerkrigg.com//comics/00000001.jpg').content)
f.close()

requests.get () 오류를 확인해야하지만.

— 엘리 밀리
소스

1

이 솔루션이 urllib을 사용하지 않더라도 이미 파이썬 스크립트에 이미 요청 라이브러리를 사용하고있을 수 있습니다 (이를 검색하는 동안 제 경우였습니다). 사진을 얻는 데 사용할 수도 있습니다.

— Iam Zesh

이 답변을 다른 사람 위에 게시 해 주셔서 감사합니다. 다운로드가 작동하도록하려면 사용자 지정 헤더가 필요했고 요청 라이브러리에 대한 포인터가 모든 것이 제대로 작동하는 프로세스를 단축했습니다.

— kuzzooroo 2009 년

python3에서 urllib을 작동시키지 못했습니다. 요청에 문제가 없었으며 이미로드되었습니다! 내가 생각하는 것보다 훨씬 더 나은 선택.

— user3023715

python3에서 @ user3023715 urllib에서 요청을 가져와야 함 여기를 참조하십시오

— Yassine Sedrani

34

Python 3의 경우 다음을 가져와야합니다 import urllib.request.

import urllib.request 

urllib.request.urlretrieve(url, filename)

자세한 내용은 링크를 확인 하십시오

— 히시
소스

15

@DiGMi의 답변에 대한 Python 3 버전 :

from urllib import request
f = open('00000001.jpg', 'wb')
f.write(request.urlopen("http://www.gunnerkrigg.com/comics/00000001.jpg").read())
f.close()

— 데니스 골 로마 조프
소스

10

이 답변 을 찾았 으며 더 신뢰할 수있는 방식으로 편집했습니다.

def download_photo(self, img_url, filename):
    try:
        image_on_web = urllib.urlopen(img_url)
        if image_on_web.headers.maintype == 'image':
            buf = image_on_web.read()
            path = os.getcwd() + DOWNLOADED_IMAGE_PATH
            file_path = "%s%s" % (path, filename)
            downloaded_image = file(file_path, "wb")
            downloaded_image.write(buf)
            downloaded_image.close()
            image_on_web.close()
        else:
            return False    
    except:
        return False
    return True

여기에서 다운로드하는 동안 다른 리소스 나 예외가 발생하지 않습니다.

— 재니스 친 타나
소스

1

'self'를 제거해야합니다

— Euphe

8

파일이 dir웹 사이트 의 동일한 디렉토리 에 있고 sitefilename_01.jpg, ..., filename_10.jpg 형식 인 경우 모두 다운로드하십시오.

import requests

for x in range(1, 10):
    str1 = 'filename_%2.2d.jpg' % (x)
    str2 = 'http://site/dir/filename_%2.2d.jpg' % (x)

    f = open(str1, 'wb')
    f.write(requests.get(str2).content)
    f.close()

— 렌
소스

7

.read()부분 또는 전체 응답을 읽는 데 사용 하고 알려진 좋은 위치에서 연 파일에 쓰는 것이 가장 쉬운 방법 입니다.

— 이그나시오 바스케스-아 브람스
소스

5

'User-Agent'가 필요할 수 있습니다.

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36')]
response = opener.open('http://google.com')
htmlData = response.read()
f = open('file.txt','w')
f.write(htmlData )
f.close()

— 알렉산더
소스

어쩌면 페이지를 사용할 수 없습니까?

— Alexander

3

retrieve()주의 깊게 문서를 읽도록 제안하는 것 외에도 ( http://docs.python.org/library/urllib.html#urllib.URLopener.retrieve ) 실제로 read()응답 내용을 호출 한 다음 저장하는 것이 좋습니다. 검색하는 임시 파일에 그대로 두지 않고 선택한 파일.

— 가브리엘 헐리
소스

3

위의 모든 코드는 원본 이미지 이름을 보존 할 수 없으며 때로는 필요합니다. 원본 이미지 이름을 유지하면서 이미지를 로컬 드라이브에 저장하는 데 도움이됩니다.

    IMAGE = URL.rsplit('/',1)[1]
    urllib.urlretrieve(URL, IMAGE)

자세한 내용은 이것을 시도 하십시오.

— 오 하스
소스

3

이것은 파이썬 3을 사용하여 저에게 효과적이었습니다.

csv 파일에서 URL 목록을 가져 와서 폴더로 다운로드하기 시작합니다. 내용이나 이미지가 존재하지 않으면 예외가 발생하여 계속 마법을 만듭니다.

import urllib.request
import csv
import os

errorCount=0

file_list = "/Users/$USER/Desktop/YOUR-FILE-TO-DOWNLOAD-IMAGES/image_{0}.jpg"

# CSV file must separate by commas
# urls.csv is set to your current working directory make sure your cd into or add the corresponding path
with open ('urls.csv') as images:
    images = csv.reader(images)
    img_count = 1
    print("Please Wait.. it will take some time")
    for image in images:
        try:
            urllib.request.urlretrieve(image[0],
            file_list.format(img_count))
            img_count += 1
        except IOError:
            errorCount+=1
            # Stop in case you reach 100 errors downloading images
            if errorCount>100:
                break
            else:
                print ("File does not exist")

print ("Done!")

— 승리자
소스

2

더 간단한 해결책은 다음과 같습니다 (파이썬 3).

import urllib.request
import os
os.chdir("D:\\comic") #your path
i=1;
s="00000000"
while i<1000:
    try:
        urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/"+ s[:8-len(str(i))]+ str(i)+".jpg",str(i)+".jpg")
    except:
        print("not possible" + str(i))
    i+=1;

— 아유 쉬
소스

이를 제외하고 베어 사용에주의하십시오 . stackoverflow.com/questions/54948548/…을 참조하십시오 .

— AMC

1

이건 어때?

import urllib, os

def from_url( url, filename = None ):
    '''Store the url content to filename'''
    if not filename:
        filename = os.path.basename( os.path.realpath(url) )

    req = urllib.request.Request( url )
    try:
        response = urllib.request.urlopen( req )
    except urllib.error.URLError as e:
        if hasattr( e, 'reason' ):
            print( 'Fail in reaching the server -> ', e.reason )
            return False
        elif hasattr( e, 'code' ):
            print( 'The server couldn\'t fulfill the request -> ', e.code )
            return False
    else:
        with open( filename, 'wb' ) as fo:
            fo.write( response.read() )
            print( 'Url saved as %s' % filename )
        return True

##

def main():
    test_url = 'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'

    from_url( test_url )

if __name__ == '__main__':
    main()

— gmas80
소스

0

프록시 지원이 필요한 경우 다음을 수행 할 수 있습니다.

  if needProxy == False:
    returnCode, urlReturnResponse = urllib.urlretrieve( myUrl, fullJpegPathAndName )
  else:
    proxy_support = urllib2.ProxyHandler({"https":myHttpProxyAddress})
    opener = urllib2.build_opener(proxy_support)
    urllib2.install_opener(opener)
    urlReader = urllib2.urlopen( myUrl ).read() 
    with open( fullJpegPathAndName, "w" ) as f:
      f.write( urlReader )

— 이혼 케니
소스

0

이를 수행하는 또 다른 방법은 fastai 라이브러리를 이용하는 것입니다. 이것은 저에게 매력처럼 작용했습니다. 나는 그것을 SSL: CERTIFICATE_VERIFY_FAILED Error사용 urlretrieve하고 있었으므로 그것을 시도했다.

url = 'https://www.linkdoesntexist.com/lennon.jpg'
fastai.core.download_url(url,'image1.jpg', show_progress=False)

— 시드
소스

SSL에 직면했습니다. CERTIFICATE_VERIFY_FAILED 오류 stackoverflow.com/questions/27835619/…

— AMC

0

요청 사용

import requests
import shutil,os

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
currentDir = os.getcwd()
path = os.path.join(currentDir,'Images')#saving images to Images folder

def ImageDl(url):
    attempts = 0
    while attempts < 5:#retry 5 times
        try:
            filename = url.split('/')[-1]
            r = requests.get(url,headers=headers,stream=True,timeout=5)
            if r.status_code == 200:
                with open(os.path.join(path,filename),'wb') as f:
                    r.raw.decode_content = True
                    shutil.copyfileobj(r.raw,f)
            print(filename)
            break
        except Exception as e:
            attempts+=1
            print(e)

if __name__ == '__main__':
    ImageDl(url)

— 소한 다스
소스

0

urllib을 사용하면이 작업을 즉시 완료 할 수 있습니다.

import urllib.request

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)

urllib.request.urlretrieve(URL, "images/0.jpg")

— 스레 칸트 쉬 노이
소스