回帰分析では、なぜ独立変数を「独立」と呼ぶのですか？

30

これらの変数のいくつかは、それらの間で強く相関していることを意味します。どのように/なぜ/どのような文脈でそれらを独立変数として定義しますか？

regression terminology predictor

— アマルプリート・シン
ソース

1

それは歴史的であり、フランスの科学的研究から来ています。参照を見つけようとしています。

— アレコスパパドプロス

1

I would call a set of variables "potentially co-dependent" to avoid inferring the causality.

— qed

1

A good question!

— Rafael Marazuela

1

Of interest: Left-hand & right-hand side nomenclature in regression models

— Alexis

29

If we pull back from today's emphasis on machine learning and recall how much of statistical analysis was developed for controlled experimental studies, the phrase "independent variables" makes a good deal of sense.

In controlled experimental studies, the choices of a drug and its concentrations, or the choices of a fertilizer and its amounts per acre, are made independently by the investigator. The interest is in how a response variable of interest (e.g., blood pressure, crop yield) depends on these experimental manipulations. Ideally, the characteristics of the independent variables are tightly specified, with essentially no errors in knowing their values. Then standard linear regression, for example, models the differences among values of dependent variables in terms of the values of the independent variables plus residual errors.

The same mathematical formalism used for regression in the context of controlled experimental studies also can be applied to analysis of observed data sets with little to no experimental manipulation, so it's perhaps not surprising that the phrase "independent variables" has carried over to such types of studies. But, as others on this page note, that's probably an unfortunate choice, with "predictors" or "features" more appropriate in such contexts.

— EdM
ソース

2

But the choice of the levels of the drug is dependent on what the investigator does which is why I can never remember which is which.

— mdewey

In machine learning, "features" are often latent, unobserved variables. “Observed features” is more common.

— Neil G

18

In many ways, "independent variable" is an unfortunate choice. The variables need not be independent of each other, and of course need not be independent of the dependent variable $Y$ . In teaching and in my book Regression Modeling Strategies I use the word predictor. In some situations that word is not strong enough, but it works well on the average. A full description of the role of the $X$ (right hand side) variables in a statistical model might be too long to use each time: the set of variables or measurements upon which the distribution of $Y$ is conditioned. This is another way of saying the set of variables whose distributions we are currently not interested in, but whose values we treat as constants.

— Frank Harrell
ソース

So all you are saying that calling input variables as "independent" is wrong practice? @Frank

— Amarpreet Singh

11

They are definitely not assumed to be independent of ANYTHING so it's wrong practice, used only because of habit.

— Frank Harrell

1

"the set of variables or measurements upon which the distribution of Y is conditioned" ... actually I do think of them as (and sometimes call them) the "conditioning variables" or "variables conditioned on", which is not too long a description and works naturally with the notation

E (Y | X)

$E(Y|X)$

— Silverfish

11

I agree with the other answers here that "independent" and "dependent" is poor terminology. As EdM explains, this terminology arose in the context of controlled experiments where the researcher could set the regressors independently of each other. There are many preferable terms that do not have this loaded causal connotation, and in my experience, statisticians tend to prefer the more neutral terms. There are many other terms used here, including the following:

\begin{matrix} Y_{i} & x_{i, 1}, . . ., x_{i, m} \\ Response & Predictors \\ Regressand & Regressors \\ Output variable & Input variables \\ Predicted variable & Explanatory variables \end{matrix}

$\begin{matrix} Y_i & & & x_{i,1},...,x_{i,m} \\ \hline \text{Response} & & & \text{Predictors} \\ \text{Regressand} & & & \text{Regressors} \\ \text{Output variable} & & & \text{Input variables} \\ \text{Predicted variable} & & & \text{Explanatory variables} \\ \end{matrix}$

Personally, I use the terms explanatory variables, and response variable, since those terms have no connotation of statistical independence or control, etc. (One might argue that 'response' has a causal connotation, but this is a fairly weak connotation, so I have not found it problematic.)

— Reinstate Monica
ソース

1

(+1) I suppose regressor/regressand are the most neutral terms, but I also prefer to explain using explanatory/response.

— Frans Rodenburg

2

I agree with the tendency to prefer neutral terms, but "explanatory" sounds pretty causal to me as in: "The X variables explain why the Y variable acts in the way it does."

— timwiz

1

I take it to mean explanatory in a probabilistic sense -- i.e., it explains changes in the distribution of the response variable. You might be right, but in all these cases the connotation to any causality is weak.

— Reinstate Monica

2

Explanatory implies causal so is inappropriate.

— Frank Harrell

1

@Frank: I don't necessarily agree with that view. Explanatory is derived from the word "explain" so I take it to imply only that the variables explain the response variable somehow. That explanation could be causal, or it could merely be statistical, and I take it to be the latter. Nevertheless, it does appear that people are interpreting the connotations of these words differently, so I will concede that some will read it as having causal connotations.

— Reinstate Monica

9

To add to Frank Harrell's and Peter Flom's answers:

I agree that calling a variable "independent" or "dependent" is often misleading. But some people still do that. I once heard an answer why:

In regression analysis we have one "special" variable (usually denoted by $Y$ ) and many "not-so-special" variables ( $X$ 's) and we want to see how changes in $X$ 's affect $Y$ . In other words, we want to see how $Y$ depends on $X$ 's.

That is why $Y$ is called "dependent". And if one is called "dependent" how would you call another one?

— Łukasz Deryło
ソース

You are saying that Y depends on X's, (so Y is called dependent variable) and by that you mean that X doesn't depend on Y. But there can be cases where X can depend on Y or correlate with Y (so it can't be called "independent" anymore). Any views on this?

— Amarpreet Singh

No, I don't mean that X doesn't depend on Y. I just mean that the most basic explanation of what regression analysis does is that it describes how Y depend on X. So the most basic name for Y would be "dependent"

— Łukasz Deryło

6

I'm not trying to answer the question "should we call X independent?" but rather "why do we call it independent?", just like in title of your post

— Łukasz Deryło

5

"Dependent" and "independent" can be confusing terms. One sense is pseudo-causal or even causal and this is the one that is meant when saying "independent variable" and "dependent variable". We mean that the DV, in some sense, depends on the IV. So, for example, when modeling the relationship of height and weight in adult humans, we say weight is the DV and height is the IV.

This does capture something that "predictor" does not - namely, the direction of the relationship. Height predicts weight, but weight also predicts height. That is, if you were told to guess the height of people and were told their weights, that would be useful.

But we wouldn't say that height depends on weight.

— Peter Flom - Reinstate Monica
ソース

Are you being specific about SEM model?

— Amarpreet Singh

No. I was thinking of regression.

— Peter Flom - Reinstate Monica

Ok, so it's just a matter of name. I got confused that calling input variables as "independent" means something.

— Amarpreet Singh

12

DV and IV are common abbreviations (which personally I dislike), but watch out for many economists and some other social scientists for whom IV can only mean instrumental variable. It is less common to encounter people for whom DV can only mean Deo volente (God willing).

— Nick Cox

0

Based on the above answers, yes , i agree that this dependent and independent variable are weak terminology. But I can explain the context in which it is being used by many of us. You say that for a general regression problem we have a Output variable, say Y, whose value depends on other input variables, say x1, x2, x3. That is why it is called a "Dependent Variable". And similarly depending upon this context only, and just to differentiate between Output and Input Variable, x1, x2, x3 are termed as independent variable. Because unlike Y it does not depend on any other variable(But yes here we are not talking about there dependency with themselves.)

— Pooja Sonkar
ソース

You answered similar to that of @Ramya R.

— Amarpreet Singh

-2

Independent variables are called independent because they do not depend on other variables. For example, consider the house price prediction problem. Assume we have data on house_size, location, and house_price. Here, house_price is determined based on the house_size and location but the location and house_size can vary for different houses.

— Ramya R
ソース

4

Sometimes the so-called "independent" variables in regression are correlated. So they are not necessarily statistically independent. It would be better to call them predictor variables.

— Michael R. Chernick

Micheal, Thanks for pointing that out. I have a follow-up question. In cases where we have two predictor variables that are collinear, don't we discard one of them to eliminate the multicollinearity problem so that our predictor variables are independent of each other?

— Ramya R

1

Not necessarily. It depends on whether or not it affects the stability of estimates and how much stronger the prediction is when both variables are included. If two variables have correlation 0.1 they are not independent but the relationship between them is weak.

— Michael R. Chernick