무릎을 꿇고 서버를 가져 오는 ASP.NET High CPU


8

새로운 빌드는 각 서버마다 임의의 간격으로 100 % CPU 스파이크를가집니다. 오랜 시간 동안 사이트가 완전히 응답하지 않게됩니다. 다른 국가의 사람들이 사이트 등에 로그온 할 때 가장 많은 시간이 소요됩니다.

우리는 perfmom, 메모리 프로파일 러, CLR 프로파일 러, sql 프로파일 러, Red gate ants 프로파일 러, UAT에서로드 테스트를 시도했지만 문제를 재현 할 수 없었습니다. 이는 실제 사이트를 방문하는 사용자가 수천 명에 불과하다는 것을 의미 할 수 있습니다.

우리가 알아 차린 한 가지 패턴은 새로운 코드 (깨진 빌드)가 실제로 현저히 적은 스레드를 사용한다는 것입니다.

우리는 또한 IOC에 스프링을 사용하고 있습니다-이것은 평판이 있습니까?

설상가상으로, 비즈니스 영향으로 인해 배포 할 수 없으므로 추가 한 새로운 기능의 하위 집합으로 문제를 좁힐 수 없습니다.

우리는 진정으로 파괴되었습니다. 누군가 생명을 구할 수있는 전투 흉터가 있습니까?


온도 센서는 무엇을보고합니까? 전원 공급 장치를 유지할 수 없는지 궁금합니다. (이것을 확인하는 방법을 모른다.)
sarnold

2
서버를 다운시킬 때 더 자세한 내용을 추가 할 수 있습니까? BSOD입니까? 다시 시작되거나 앱 도메인이 다시 시작되었음을 의미합니까?

There is no way at all a "100% cpu spike" could "bring down" the server. It would have to be pegged at 100% for quite a long while, coupled with trouble with heat dissipation.
Andrew Barber

1
What is it doing?? Which process is using the CPU at the peak? This is the most important question.
Aliostad

Updated my question - is this better? Thanks for the -1 :)

답변:


3

I suggest doing memory dumps and analyzing them in WinDdg with Sos. I fixed some problems on our production I probably wouldn't be able to diagnose without WinDbg.

Tess Fernandez has great blog where you can learn how to analyze memory dumps.


that blog is an excellent resource and we have been using it. Our problem is we can't recreate the problem again and get the dumps.

1
To recreate the problem, you may hammer your test system with jmeter (jmeter.apache.org) and ab (httpd.apache.org/docs/2.0/programs/ab.html). With these, multicores, a fast LAN and some colleagues, you should be able to stress the server enough.
Roman

1

This is typically caused by large long-lived object cleanup in the GC(stackoverflow had this problem, see link). Are you storing lots of object collections in cache or session?

Assault by GC

I also recommend you build and configure a new server in production to test. If you have random craziness and don't know why and can't reproduce it, I'd point the finger to hardware or configuration, not code.


We can't put any new code live because it adds news features. When the code was live, the GC usage was the same - including for generation 2. Thanks though - do yo have any more suggestions?

It's not impossible, but the hardware and configuration are nearly the same as the last deploy which we have reverted back to and is working successfully.

1

Is this a virtual server with shared resources or a physical server? If it is the former perhaps you could look at dedicating resources to this server. Good luck...


0

Try using a cache server as a frontend like Apache Traffic Server (ATS).

While this will not resolve the problem, it may help to identify it because you will at the same time move the potentially harmful load from the backend (seeing if the frontend also has problems) and make things less heated on the backend so it will be easier to see what's wrong.


0

Trying to guess the fault without the data is pointless. Yes someone on stackoverflow or in your engineering team might get lucky but that's just bad engineering, and you can't put a plan on how long it will take you to try every guess, and if thy would even find the problem.

  1. You have to repro the problem. Jmeter is a good start because of its breadth, but we can't recommend the right tool without knowing our architecture.
  2. Logging specially in your application layer is a must. You can enable IIS traces for slow performance, but the muppets at Microsoft made it so you can't capture the entire pipeline flow when it's slow. If it is so difficult to repro, you'd really like some logs to help you narrow down where the problem is. (like oh, it's whenever we call this stored proc).

The 100% CPU is a little suspicious in the sense that it's unlikely to be I/O (providing the db is another box, a slow database should not cause 100% CPU on the webservers). You need to look closer to home.

당사 사이트를 사용함과 동시에 당사의 쿠키 정책개인정보 보호정책을 읽고 이해하였음을 인정하는 것으로 간주합니다.
Licensed under cc by-sa 3.0 with attribution required.