컴퓨터는 바이트의 데이터 유형을 어떻게 결정합니까?

31

예를 들어 컴퓨터가 10111100하나의 특정 RAM 바이트에 저장된 경우 컴퓨터는 이 바이트를 정수, ASCII 문자 또는 다른 것으로 해석하는 방법을 어떻게 알 수 있습니까? 타입 데이터가 인접 바이트에 저장되어 있습니까? (1 바이트에 두 배의 공간을 사용하기 때문에 이것이 사실이라고 생각하지 않습니다.)

아마도 컴퓨터는 데이터 유형을 알지 못하고 그것을 사용하는 프로그램 만 알고 있다고 생각합니다. 내 생각 엔 RAM이기 때문이다 R AM 따라서 특정 프로그램이 단지 특정 주소에서 정보와이를 처리하는 방법을 프로그램 정의를 가져 CPU를 알 수 있음을 순차적으로 읽을 수 없습니다. 이것은 타입 캐스팅의 필요성과 같은 프로그래밍에 적합합니다.

내가 올바른 길을 가고 있습니까?

memory-hardware memory-access type-checking

— 베이시 네이터
소스

4

참고로 : 유형에 대해 이야기하는 경우 언어 컨텍스트에서 수행해야합니다. 컴파일러는 그러한 종류의 것을 처리해야합니다 (기호, 검사 유형, 작업, 캐스팅, 주소 램 등). CPU와 RAM은 바이트 만 알고 있습니다.

— jean

4

바이트의 데이터 타입은 바이트입니다. 그 외에도 컴퓨터는 아무것도 모릅니다. 프로그램은 바이트 또는 바이트 그룹을 특정 데이터 유형으로 해석하여 해당 유형에 대한 조작을 시도하지만 제한이 없습니다. 동일한 바이트 그룹은 둘 이상의 데이터 유형으로 해석 될 수 있습니다 (예 : 값 유형에 대한 포인터 캐스트, C와 같은 공용체 등). RAM이 순차적으로 읽히지 않는 것은 실제로 관련이 없습니다. -RAM이 범용이기 때문에 더 많습니다. -예를 들어 레지스터도 순차적으로 읽지 않지만 입력됩니다.

— BrainSlugs83

5

뻔뻔한 플러그,하지만이 질문은 기본적으로 프로그래머 SE에 대해 한 달 전에 요청되었습니다. 여기에 내 대답이 있습니다 . 이 시점에서는 오래 걸리지 만 여러 각도에서 공격합니다.

— Shaz

2

하드웨어가 데이터 유형에 구애받지 않는다는 사실의 한 가지 유용한 결과는 단일 바이트 (또는 워드 등)가 프로그램에 의해 여러 방식으로 해석 될 수 있다는 것입니다. 특히 부동 소수점 숫자를 정수로 임시 해석하면 빠른 역 제곱근 을 계산하는 데 사용됩니다 .

— Aoeuid

@ BrainSlugs83, 답변으로 변환하는 것을 고려할 수 있습니까?

— DW

38

당신의 의심이 맞습니다. CPU는 데이터의 의미를 신경 쓰지 않습니다. 그러나 때로는 차이가 있습니다. 예를 들어 인수가 의미 적으로 부호가 있거나 부호가없는 경우 일부 산술 연산은 다른 결과를 생성합니다. 이 경우 CPU에 의도 한 해석을 알려 주어야합니다.

데이터를 이해하는 것은 프로그래머의 몫입니다. CPU는 주문에 복종하고 그들의 의미 나 목표를 행복하게 인식하지 못합니다.

— Yuval Filmus
소스

1

Regarding "when the arguments are semantically signed or unsigned", how would the CPU know? The CPU operations just see parameter bytes and lack that sort of data type context awareness. You imply the data type by choosing the appropriate CPU operation (or your compiler does).

— Shiv

4

@Shiv In such cases, the CPU is actually issued a different instruction to process signed numbers versus unsigned numbers. As in the OP's suspicions, the program is obliged to provide those details, because the CPU is unaware.

— Cort Ammon - Reinstate Monica

2

I've been working with computers as long as I remember myself, and even though I know that CPU doesn't care about the high level constructs we use on high level programming, but this separation of concepts still freaks me out from time to time

— Loupax

1

@Loupax Well, working with a really low-level assembly helps quite a bit - even mov al, 42 is kind of high-level - it's obvious there's only one possible instruction this could call, but it's still somewhat abstracted away. However, using mov.8 al, 42 explicitly makes this painfully obvious :)

— Luaan

1

@Shiv: I'd like to note that there are machines where the data in memory are typed. These are called tagged memory architectures (or simply tagged architectures) but they've not been as successful commercially as regular architectures partly because we now program mostly in compiled languages instead of assembly and the compiler takes care of typing. See: en.wikipedia.org/wiki/Tagged_architecture

— slebetman

14

As others have already answered, today's common CPUs do not know what a given memory position contains; the software decides.

However, there are other possibilities. Lisp Machines for example used a tagged architecture which stored the type of each memory position; that way the hardware itself could do some of the work of high-level languages.

And even now, I guess you could consider the NX bit in Intel, AMD, ARM and other architectures to follow the same principle: distinguish at the hardware level whether a given memory zone contains data or instructions.

Also, just for completeness, in Harvard architectures (like some microcontrollers) data and instructions are physically separated, so the CPU does have some idea of what it is reading.

In this Quora question there's some commentary on how the tagged memory worked, its performance implications and demise, and more.

— hmijail
소스

Tagged architecture is an interesting note. Would it be significantly faster?

— Bassinator

4

Yes. The program just gets a byte from the memory and it can interpret it however it wants.

— David Richerby
소스

3

There are no type annotations.
RAM stores pure data, and then program defines what to do.

With CPU registers is a bit harder, if you have registers of given type (like FPU), you tell what is inside.
Operations on floating point registers are explicitly using typed data. You or your compiler tell what and when should be put there, so you not have such freedom.
Computer does not make any assumptions on underlying data in RAM, and in registers with one exception - typed registers in CPU are of known type, optimised to deal with it. This is only to show that there are places where data is to be of expected type, but nothing stops you from casting strings to floats and multiply them.

In programming languages you specify type, or in higher level languages data is general and compiler / interpreter / VM encodes what is inside with overhead.
For example in C your pointer type tells what to do with data, how to access it.

Of course you can read string (characters) and treat then as floating point values, integers and mix them.

— Evil
소스

Even bits in an FPU register don't always represent floating point values. In the old days (maybe not so much anymore?), a common optimization was to use floating point registers (64-bits or larger) to copy data faster than general purpose/integer registers (32-bit), being twice as big, they were generally able to copy data twice as fast.

— Seth

1

I totally agree with you, that is why I wrote somebody might push strings there. And in the same times people did floating point operations on integers, because it was faster. That is the point!

— Evil

@HCBPshenanigans there are instructions that manipulate floating-point values. If FADD is used it only makes sense that the (4,8,or 10)-byte groups of memory held floating-point numbers. That's true for several kinds of instruction: multiply two integers only makes sense if they are integers, jump only makes sense if it's an address.

— JDługosz

@seth and evilJS that's not assumed to be the case for legacy floating point stacked 8087 instructions, but is the case for newer CIMD registers which may be used just for loading/saving with no interpretation (though they must be aligned), and a caveat that if the CIMD registers were never used than they don't need to be saved in a context switch. If you (only) move 8 bytes via XMM register it's a net loss as the whole set needs to be saved.

— JDługosz

3

The CPU doesn't care, it executes assembly code, which justs merely moves data around, shift it, add it or multiply it...

Data Types are a higher level language concept: in C or C++ you need to specify Types for every single piece of data you manipulate; the C/C++ Compiler takes care of transforming these pieces of data into the right commands for the CPU to process (compilers write assembly code)

In some even higher level languages, Types may be inferred: in Python or Javascript, for example, one does not have to specify data types, yet data has a type and you can't add a string with an integer, but you can add a float with an integer: the 'compiler' (which in the case of Javascript is a JIT (Just in Time) Compiler. Javascript is often called an 'interpreted' language because historically browsers interpreted Javascript code, but nowadays Javascript engines are compilers.

Code, always ends up being compiled to machine code, but obviously machine code format depends on the machine you're targeting (x86 64bit code won't work on a x86 32 bits machine or a ARM processor for example)

So there is actually a lot of layers involved in running interpreted code.

Java and C# are other interesting ones, as Java or C# code is technically 'compiled' to a Java binary (bytecode), but that code itself is then interpreted by the Java Runtime, which is specific to the underlying hardware (one needs to install the JRE targeting the right machine to run Java binaries (Jars) )

— MrE
소스

A compiler compiles, be it JIT or not; and an interpreter interprets without compiling (because if not it would be a compiler!). They are very different things. And regarding "Java being funny" because of bytecode interpretation, consider that even x86 machine code will actually be interpreted (or even compiled?) by the very microprocessor into microcode.

— hmijail

Thanks for the clarification... Agreed: a compiler compiles, and an interpreter interprets. In the case of Javascript though the story is a bit complicated since some older browser interpret the code, while more modern browsers actually compile just-in-time, which is probably why it is still referred to as an 'interpreted' language even though it is technically not anymore.

— MrE

But AFAIK, JS starts interpreted, and then might get compiled as needed. And JITs can switch from interpreted to compiled to interpreted again, depending on lots of things. For example, a piece of code might get compiled for a variable having a given type; but then the code is run again with that variable having a different type, so the existing compiled code can't be used so the interpreter jumps in - until the code gets compiled again for the new type...

— hmijail

You're citing me on something I didn't say, please remove it because it's totally wrong. Microcode has NOTHING to do with the OS; it's something internal to the microprocessor. 32 bit or 64 bit also has nothing to do with it.

— hmijail

3

Datatypes are not a hardware feature. The CPU knows a couple (well, a lot) of different commands. Those are called the instruction set of a CPU.

One of the best known ones is the x86 instruction set. If you search for "multiply" on this page, you get 50 results. MULPD and MULSD for the multiplication of doubles, FIMUL for integer multiplication, ...

Those commands work on registers. Registers are memory slots which can contain a fixed number of bits (often 32 or 64, depending on which architecture your CPU uses), no matter what these bits represent. Hence the CPU instruction interprets the values of the registers in a different way, but the values themselves don't have types.

An example was given at PyCon 2017 by Stuart Williams:

— Martin Thoma
소스

1

Note that this isn't strictly true: there are special-purpose registers that can't contain arbitrary values (for example, pointer registers that aren't just any address and don't allow arbitrary additions, or floating point registers where you can't store non-normalized values). But your answer is correct for general-purpose registers on most architectures.

— Gilles 'SO- stop being evil'

2

...that a particular program just tells the CPU to fetch the info from a specific address and the program defines how to treat it.

Exactly. But RAM is not read "sequentially", and it stands for Random Access Memory which is exactly the opposite.

Besides knowing what a byte is, you don't even know if it's a byte, or a fragment of a larger item like a floating-point number.

I'd like to add to other answers by giving some specific examples.

Consider 01000001. The program might copy it from one place to another as part of a large parcel of data without any regard to its meaning. But copying that to the address used by the text-mode video buffer will cause the letter A to show in some position on the screen. The exact same action when the card is in a CGA graphics mode will display a red pixel and a blue pixel.

In a register, it could be the number 65 as an integer. Doing arithmetic to set the 32's bit could mean anything without context, but might specifically be changing a letter to lower case.

The 8086 CPU (still) has special instructions called DAA^※ that is used when the register holds 2 decimal digits, so if you just used that instruction you are interpreting it as two digits 41.

Programs crash because a memory word is read thinking it is a pointer when something otherwise was stored there.

Using a debugger, inspecting memory, a map is used to guide the interpretation for display. Without this symbol information, a low-level debugger lets you specify: show this address as 16-bit words, show this address as long floating point, as strings... whatever. Looking at a network packet dump or unknown file format, puzzling it out is a challenge.

That's a major source of power and flexibility in modern computer architecture: a memory cell can mean anything, data or instruction, implicit only in what it "means" to the program by what it does with the value and how that affects subsequent operations. meaning is deeper than integer width: are these characters ... characters in ascii or ebcdic? Forming words in English or SQU product codes? The address to send to or the return address it came from? The lowest level interpretation (logical bits; integer-like, signed or unsigned; float; bcd; pointer) is contextual at the instruction-set level, but you see that it's all context at some level: the to address is what it is because of the location it's printed on the envelope. It is contextual to the rules of the postman, not the CPU. The context is one big continuum, with bits on one end of it.

※ Footnote: The DAA instruction is encoded as a byte 00100111. So that byte is the aforenamed instruction if read in the instruction stream, and the digits 27 if interpreted as bcd digits, and 0x27 = 39 as an integer, which is the numeral 9 in ASCII, and part of the interrupt table (half of INT 13 2-byte address, used for BIOS service routines).

— JDługosz
소스

1

The only way the computer knows that a memory location is an instruction is that a special-purpose register called the instruction pointer points to them at one point or another. If the instruction pointer points to a memory word, it is loaded as an instruction. Other than that, the computer has no way of knowing the difference between programs and other types of data.

— Dummy Dum
소스