Anything that involves optimization to minimize a loss function will, if sufficiently convex, give a solution that is a global minimum of that loss function. I say 'sufficiently convex' since deep networks are not on the whole convex, but give reasonable minimums in practice, with careful choices of learning rate etc.
Therefore, the behavior of such models is defined by whatever we put in the loss function.
Imagine that we have a model, F, that assigns some arbitrary real scalar to each example, such that more negative values tend to indicate class A, and more positive numbers tend to indicate class B.
yf=f(x)
We use F to create model G, which assigns a threshold, b, to the output of F, implicitly or explicitly, such that when F outputs a value greater than b then model G predicts class B, else it predicts class A.
yg={BAif f(x)>botherwise
By varying the threshold b that model G learns, we can vary the proportion of examples that are classified as class A or class B. We can move along a curve of precision/recall, for each class. A higher threshold gives lower recall for class B, but probably higher precision.
Imagine that the model F is such that if we choose a threshold that gives equal precision and recall to either class, then the accuracy of model G is 90%, for either class (by symmetry). So, given a training example, G would get the example right 90% of the time, no matter what is the ground truth, A or B. This is presumably where we want to get to? Let's call this our 'ideal threshold', or 'ideal model G', or perhaps G∗.
Now, let's say we have a loss function which is:
L=1N∑n=1NIyi≠g(xi)
where Ic is an indicator variable that is 1 when c is true, else 0, yi is the true class for example i, and g(xi) is the predicted class for example i, by model G.
Imagine that we have a dataset that has 100 times as many training examples of class A than class B. And then we feed examples through. For every 99 examples of A, we expect to get 99∗0.9=89.1 examples correct, and 99∗0.1=9.9 examples incorrect. Similarly, for every 1 example of B, we expect to get 1∗0.9=0.9 examples correct, and 1∗0.1=0.1 examples incorrect. The expected loss will be:
L=(9.9+0.1)/100=0.1
Now, lets look at a model G where the threshold is set such that class A is systematically chosen. Now, for every 99 examples of A, all 99 will be correct. Zero loss. But each example of B will be systematically not chosen, giving a loss of 1/100, so the expected loss over the training set will be:
L=0.01
Ten times lower than the loss when setting the threshold such as to assign equal recall and precision to each class.
Therefore, the loss function will drive model G to choose a threshold which chooses A with higher probability than class B, driving up the recall for class A, but lowering that for class B. The resulting model no longer matches what we might hope, no longer matches our ideal model G∗.
To correct the model, we'd need to for example modify the loss function such that getting B wrong costs a lot more than getting A wrong. Then this will modify the loss function to have a minimum closer to the earlier ideal model G∗, which assigned equal precision/recall to each class.
Alternatively, we can modify the dataset by cloning every B example 99 times, which will also cause the loss function to no longer have a minimum at a position different from our earlier ideal threshold.