OGR과 Shapey를 더 효율적으로 사용하십니까? [닫은]

29

파이썬 코드를보다 효율적으로 만드는 방법에 대한 제안을 찾고 있습니다. 일반적으로 효율성은 중요하지 않지만 현재 150 만 점이 넘는 미국 위치의 텍스트 파일로 작업하고 있습니다. 주어진 설정으로 한 지점에서 작업을 실행하는 데 약 5 초가 걸립니다. 이 그림을 내려야합니다.

나는 세 가지 다른 파이썬 GIS 패키지를 사용하여 포인트에서 몇 가지 다른 작업을 수행하고 새로운 구분 된 텍스트 파일을 출력합니다.

OGR을 사용하여 카운티 경계 shapefile을 읽고 경계 지오메트리에 액세스합니다.
포인트가 해당 카운티 내에 있는지 확인합니다.
하나 안에 있으면 Python Shapefile Library를 사용하여 경계 .dbf에서 속성 정보를 가져옵니다.
그런 다음 두 소스의 정보를 텍스트 파일에 씁니다.

나는 비 효율성이 2-3 계층 루프를 갖는 데 있다고 생각합니다 ... 어떻게 해야할지 확실하지 않습니다. 특히이 3 가지 패키지를 사용해 본 경험이있는 사람에게 도움을 구하고 있습니다.

import os, csv
from shapely.geometry import Point
from shapely.geometry import Polygon
from shapely.wkb import loads
from osgeo import ogr
import shapefile

pointFile = "C:\\NSF_Stuff\\NLTK_Scripts\\Gazetteer_New\\NationalFile_20110404.txt"
shapeFolder = "C:\NSF_Stuff\NLTK_Scripts\Gazetteer_New"
#historicBounds = "C:\\NSF_Stuff\\NLTK_Scripts\\Gazetteer_New\\US_Counties_1860s_NAD"
historicBounds = "US_Counties_1860s_NAD"
writeFile = "C:\\NSF_Stuff\\NLTK_Scripts\\Gazetteer_New\\NewNational_Gazet.txt"

#opens the point file, reads it as a delimited file, skips the first line
openPoints = open(pointFile, "r")
reader = csv.reader(openPoints, delimiter="|")
reader.next()

#opens the write file
openWriteFile = open(writeFile, "w")

#uses Python Shapefile Library to read attributes from .dbf
sf = shapefile.Reader("C:\\NSF_Stuff\\NLTK_Scripts\\Gazetteer_New\\US_Counties_1860s_NAD.dbf")
records = sf.records()
print "Starting loop..."

#This will loop through the points in pointFile    
for row in reader:
    print row
    shpIndex = 0
    pointX = row[10]
    pointY = row[9]
    thePoint = Point(float(pointX), float(pointY))
    #This section uses OGR to read the geometry of the shapefile
    openShape = ogr.Open((str(historicBounds) + ".shp"))
    layers = openShape.GetLayerByName(historicBounds)
    #This section loops through the geometries, determines if the point is in a polygon
    for element in layers:
        geom = loads(element.GetGeometryRef().ExportToWkb())
        if geom.geom_type == "Polygon":
            if thePoint.within(geom) == True:
                print "!!!!!!!!!!!!! Found a Point Within Historic !!!!!!!!!!!!"
                print str(row[1]) + ", " + str(row[2]) + ", " + str(row[5]) + " County, " + str(row[3])
                print records[shpIndex]
                openWriteFile.write((str(row[0]) + "|" + str(row[1]) + "|" + str(row[2]) + "|" + str(row[5]) + "|" + str(row[3]) + "|" + str(row[9]) + "|" + str(row[10]) + "|" + str(records[shpIndex][3]) + "|" + str(records[shpIndex][9]) + "|\n"))
        if geom.geom_type == "MultiPolygon":
            for pol in geom:
                if thePoint.within(pol) == True:
                    print "!!!!!!!!!!!!!!!!! Found a Point Within MultiPolygon !!!!!!!!!!!!!!"
                    print str(row[1]) + ", " + str(row[2]) + ", " + str(row[5]) + " County, " + str(row[3])
                    print records[shpIndex]
                    openWriteFile.write((str(row[0]) + "|" + str(row[1]) + "|" + str(row[2]) + "|" + str(row[5]) + "|" + str(row[3]) + "|" + str(row[9]) + "|" + str(row[10]) + "|" + str(records[shpIndex][3]) + "|" + str(records[shpIndex][9]) + "|\n"))
        shpIndex = shpIndex + 1
    print "finished checking point"
    openShape = None
    layers = None


pointFile.close()
writeFile.close()
print "Done"

— GrantD
소스

3

이 @ 코드 검토를 게시하는 것을 고려할 수 있습니다 : codereview.stackexchange.com

— RyanDalton

21

첫 번째 단계는 shapefile을 행 루프 외부로 이동시키는 것입니다. 150 만 번 shapefile을 열고 닫습니다.

솔직히 PostGIS에 많은 것을 넣고 인덱스 테이블에서 SQL을 사용하여 수행합니다.

— 이안 터턴
소스

19

코드를 간단히 살펴보면 몇 가지 최적화가 필요합니다.

명백한 특이 치를 제거하기 위해 다각형의 경계 상자 / 봉투에 대해 각 점을 먼저 확인하십시오. 한 단계 더 나아가 포인트가있는 bbox 수를 계산할 수 있습니다. 정확히 하나라면 더 복잡한 지오메트리에 대해 테스트 할 필요가 없습니다. 하나 이상의 경우에는 추가 테스트가 필요합니다. 복잡한 사례에서 간단한 사례를 제거하기 위해 두 번의 패스를 수행 할 수 있습니다).
각 점을 반복하고 다각형에 대해 테스트하는 대신 다각형을 반복하여 각 점을 테스트하십시오. 지오메트리로드 / 변환이 느리므로 가능한 한 적게 수행하려고합니다. 또한 처음에 CSV에서 포인트 목록을 작성하십시오. 다시 포인트 당 여러 번 수행하지 말고 반복이 끝날 때 결과를 버리지 마십시오.
점을 공간적으로 색인화합니다.이 점은 shapefile, SpatialLite 파일 또는 PostGIS / PostgreSQL 데이터베이스 와 같은 것으로 변환 합니다. 이는 OGR 과 같은 도구 가 대부분의 작업을 수행 할 수 있다는 이점이 있습니다.
끝날 때까지 출력을 쓰지 마십시오. print ()는 최상의 값 비싼 함수입니다. 대신 데이터를 목록으로 저장하고 표준 Python 산세 함수 또는 목록 덤핑 함수를 사용하여 맨 끝에 작성하십시오.

— 머지 바이킹
소스

5

처음 두 사람은 큰 돈을 지불합니다. Shapely 및 Shapefile 대신 ogr을 사용하여 모든 것을 빠르게 처리 할 수도 있습니다.

— sgillies

2

"Python"및 "spatial index"와 관련하여 Rtree 는 다른 모양 근처에서 모양을 찾는 데 매우 빠르기 때문에 더 이상 보지 마십시오.

— Mike T