This project is opensource and all of the sourcecodes are publicly available on
.
The goal of this project is an exploration of trends in the publishing record of the American Physical Society (APS). Thanks to Sylvia Do for proofreading!
The total number of papers published between the year 1983 and 2016 in all of the APS journals is equal to 596 786.
The first published paper is titled:
by Samuel Sheldon and G. M. Downing, and has only 2 citations (according Google Scholar; 05/05/18).
NUMBER OF PUBLISHED PAPERS
In 1958, Physical Review Letters (PRL) emerged as a journal to communicate short and significant findings in physics.
PRL steadily grew, reaching around 4000 papers published per year by the year 2000.
Its sister journal, Physical Review (PR), was the major APS journal to communicate longer articles prior to 1970.
In 1969, the number of articles published in that journal
reached about 4000 per year (the same number as PRL around the year 2000). This lead to the journal's split off into Physical Review A,B, C, and D.
Originally, the findings in statistical physics and nonlinear dynamics were published in Physical Review A (PRA).
However the rapidly growing number of papers published in PRA lead to the emergence
of a new journal: Physical Review E (PRE), devoted to statistical mechanics, nonlinear dynamics and soft matter.
The inset (semilog plot) on the righthand side of the plot gives the total number of papers (for all 16 journals) published every year.
The data shows that the number of the papers published by APS grows roughly exponentially!
In the above plot, PRC journal is omitted as the number of published papers in this journal is much smaller than in the other journas, cf. Table below:
Journal 
Total # papers 
Start 
End 
Percent of total 
Phys. Rev. 
47940 
1913 
1969 
8% 
Phys. Rev. Lett. 
118126 
1970 
2016 
20% 
Phys. Rev. A 
73320 
1970 
2016 
12% 
Phys. Rev. B 
176681 
1970 
2016 
30% 
Phys. Rev. C 
37766 
1970 
2016 
6% 
Phys. Rev. D 
80119 
1970 
2016 
13% 
Phys. Rev. E 
53387 
1993 
2016 
9% 
NUMBER OF AUTHORS
Average number of authors for the journals PR, PRL, PRA, PRB, PRD, and PRE.
As expected, the average number of authors per paper steadily increases over time, reaching 4 authors on average around the year 2010.
In the early days (preWW2), papers had lower number of authors, with the maximum number of authors not exceeding 3.
In the '50's this trend changed to a faster pace of maximum number of authors coauthoring a paper. This number levels off in the '70's at the value 25.
This is due to the fact that APS does not store more than 25 authors for each entry in their databases.
Additionally, there are two trends if we look at the average number of coauthors for each journal separately. PRL and PRB currently have the same average number
of coauthors (~5). This is in contrast to the PRE and PRD journals that have ~3 coauthors on average.
COUNTRIES & COLLABORATIONS
The summary of the number of papers published by people affiliated with a given country (by university). Papers published prior to 1989 are not considered.
For papers with more than one author, a conditional probablity is given that if one of the authors has an affiliation in a country Y (yaxis),
there is at least one other author affiliated with the country X (xaxis). That is, the plot gives the probability of one country collaborating with another,
where the area of the dot represents the probability of that collaboration.
Full data (different journals and number of authors on a paper) can be obtained
here.
NUMPER OF PAPERS PER 1000 CITIZENS
Number of papers published by physicists affiliated with a given country per 1000 citizens in the year 2016.
This metric suggests how much each country invests in physics
per capita.
Surprisingly, USA, which is publishing the largest number of papers, is pretty low in the ranking, whereas countries like Switzerland, Iceland, Israel, Denmark, the UK and Sweden are on the top of the list.
EUROPE
In terms of the absolute number of papers published in APS journals, two country dominate in Europe: the UK and France.
The next countries in the ranking are Italy, Germany, Russia and Spain publishing roughly the same number of papers.
Summary of the total number of papers published by authors from the top 9 European countries in terms of published papers.
The number given in parenthesis is the total number of papers published by the scientists from that country in PRL, PRE, PRA, PRB, and PRD.
The publishing trends for each country correlate closely with the general trend for all countries together, with the largest number of papers
being published in PRB, and with PRL coming in second.
CITATIONS STATISTICS
For each paper published in any of the APS journal I counted a number of citations by other papers published in APS journals.
The histogram binning and distribution fitting is done with
powerlaw software.
The data suggests that the number of citations closely follows a powerlaw distribution with an exponent equal to 2.749. The emergence of the powerlaw distribution,
with an exponent within the range of [2,3], has been mostly explained by
"A preferential attachment process".
However, recently a more plausible model has been proposed in a
work by Ken Dill and coworkers.
Altough the fit to the powerlaw distribution looks great, we should be very cautious in claiming this distribution, especially in the context of the recent
paper by Aaron Clause,
who showed that in fact,
"scalefree networks are rare".
It turns out that the truncatedpowerlaw (with an exponent of 2.68 and a truncation exponent equal to 0.00025) is statistially more likely than powerlaw distribution.
Thus for the number of citations above 4000, the distribution has a fastdecaying exponential tail rather than a "fattail".
Below, in the left panel I plotted a cumulative number of citiations in a given year for 6 different journals (PRL, PRA, PRB, PRC, PRD, PRE).
We can clearly see that PRL and PRB significanlty stand out in terms of the number of citations. The remaning journals have roughly the same number of citations
with PRC being (cumulatively) the least cited journal. Around the year 2010 we can see a drop in the number of citations. This is not surprising taking into account
the fact that these papers had a shorter period of time to make an impact and be cited by other papers.
The cumulative number of citations is not a good metric of the journals quality. Simply put, larger journals have more papers.
Thus in the right panel I plotted a total number of citations of the papers published in a journal in a given year, normalized by the number of the papers published in that year and the following years. From this calculations we can
clearly see that papers published in PRL on average have the largest number of citations. Next, papers in PRA, PRB, PRC and PRD are roughly of the same quality having
similar number of citations. Lastly, PRE stands out with the lowest number of citations per paper. It could be because these papers are the lowest quality, but part of the
effect can also be attributed to the competition with PRA. We can see that the PRA split off affected the number of citiation for this jorunal, and these citations are likely transfered
to PRE.
Below I plot a number of citations for the most cited paper in 7 different journal in a given year (left panel).
We can see that most cited papers come from PR or PRB. However, in the case of PRB, we see large fluctations from year to year.
For some years there are papers that have tremendous number of citations, whereas in the next year we can see that the most cited paper being around the average of the other journals.
This is contrary to PRL, where we can see much higher consistency from year to year. The best papers published in PRL also dominate other papers except in the years when the PRB papers beat them.
On the right panel, I present a fraction of papers published in a given year that has no citations at all. Not surprisingly, the fraction of these papers rises as the years approach
the year 2016, as the papers have less time to be cited by the community. We can see that the lowest fraction of papers without a citation is in PRL, confiriming
that the journal is high quality in comparision to the others. Interestingly, we can also see that the fraction of papers without citations rises again as the papers are published in the 1920's through 1940's (for PR),
1960's through 1990's for PRL and 1980's through 1990's. This probably is because of the low accsessibility of the papers in the preInternet era.
TOP 5 MOST CITED PAPERS
The table below contains the five most cited papers by other APS papers. In parenthesis I give the number of citations accoring to Google Scholar (access date: 05/05/2018).
It is worth to mention that the papers ranked 1st and 5th have the same author:
John Perdew.
Likewise, the papers ranked 2nd and 3rd are by
Walter Kohn, the 1998 Nobel Prize winner in Physics.
Finally, it is interesting to note that all of these papers describe some novel computational technique, and most of them are related to a celebrated
Density Functional Theory (DFT).
Number of Citations 
Title 
Authors 
Year 
Journal 
Volume 
1. 7834 (GS: 83492)

Generalized Gradient Approximation Made Simple

J.P. Perdew, K. Burke, and M. Ernzerhof

1996

Phys. Rev. Lett.

77

2. 7016 (GS: 47843)

SelfConsistent Equations Including Exchange and Correlation Effects

W. Kohn and L. J. Sham

3. 1965

Phys. Rev.

140

3. 5615 (GS: 42547)

Inhomogeneous Electron Gas

P. Hohenberg and W. Kohn

1964

Phys. Rev.

136

4. 5527 (GS: 43316)

Efficient iterative schemes for ab initio totalenergy calculations using a planewave basis set

G. Kresse and J. Furthmüller

1996

Phys. Rev. B

54

5. 4261 (GS: 18148)

Selfinteraction correction to densityfunctional approximations for manyelectron systems

J. P. Perdew and Alex Zunger

1981

Phys. Rev. B

23

TOP 3 CITING PAPERS
I checked the papers that cite the largest number of other papers in the APS journals. The top 3 papers are in the table below.
Not surprisingly, all the three papers are review papers, and published in Reviews of Modern Physics. As we can see, the number of cited papers reach a stunning number  around 600.
Number of References 
Title 
Authors 
Year 
Journal 
Number 
607

Electrodynamics of correlated electron materials

D.N. Basov, R.D. Averitt, D. van der Marel, M. Dressel, and K. Haule

2011

RMP

83

602

Energy Levels of Light Nuclei. III

W. F. Hornyak, T. Lauritsen, P. Morrison, and W. A. Fowler

1950

RMP

22

582

Metalinsulator transitions

M. Imada, A. Fujimori, and Y. Tokura

1998

RMP

70

CORRELATIONS
In the search for the relations between variables, I calculated the correlation coefficients between parameters such as
the number of citations, number of references, publication date, journal volume, etc.
In particular, I was interested if there were any significant relationships between the number of citations and other parameters.
First I checked if there were any linear relations between variables
using the
Pearson correlation coefficient.
However, relying on this metric can lead to completely incorrect and silly
conclusions.
Thus alongside with the Pearson Coefficient (PC), I also calculated the
Distance Correlation (DC).
The advantage of the DC is that, contrary to the PC, two variables are independent if and only if the DC is equal to zero  regardless of any functional relationship variables.
The results of this calculations are presented in the figure below. Sadly, but not surprisingly, there are no obvious relationships between the number of citations and:
journal, number of pages, volume, issue, number of authors, number of affiliations, publication date, number of countries from which authors are coming from,
or number of references used in a paper. The full data can be found under this
link.
TITLES AND ABSTRACTS STATISTICS
Titles and abstracts pubilshed in RRA, PRB, PRC, PRD and PRE between the years 1990 and 2016 were tokenized as described in the next section.
Below, the distribution of the number of tokens in a title or an abstract is given. The main purpose of this calculation is to find a good length of the abstract for the
journal's classifcation with Deep Learning described in Section: "JOURNAL CLASSIFICATION  DEEP NEURAL NETWORK". For the forthcoming analysis, I chose the length 200 because most of the abstracts
are shorter than 200 words.
A distribution of word occurance in all abstracts closely follow a powerlaw with an exponent about 1.5 for about 4 orders of magnitude.
Similaritly to the number of citations, the distribution starts to decay rapidly for the number of occurances larger than 20 000, and the truncated powerlaw is again statistically more likely.
Word2Vec EMBEDDING
I embedded the words from the abstacts of papers published in PRA, PRB, PRC, PRD and PRE between the years of 1990 and 2016 using
Word2Vec from
Gensim toolkit.
Each abstract was split into a set of sentences. Then, all nonalphabetic characters were removed, and the remaining letter converted to lower case.
Next, all the stop words were removed, and the remaining words stemmed with
PorterStemmer. Finally, all words shorter than 2 letters were removed.
Such tokenized abstracts were next used for embedding. I embedded words into 200 dimensional space, using only words that occured at least 5 times (min_cout=5), with window size 5
and skipgram model. 5 iterations were executed to train the model. Such an embedding is visualized below. On the left we have
tSNE clustering.
Some cluster are visable suggesting
word aggregation in highly dimensional space. On the right
PCA is shown.
JOURNAL CLASSIFICATION  DEEP NEURAL NETWORK
Three different Deep Neural Nework architectures were used for journals classification based on abstract content.
These architectures are: FeedForward (FF), Convolution 1D (CNN1D) and LSTM (please refer to the dnn/train_keras.py file for more details).
Dropout layers were used to prevent overfitting.
The dataset was split into training and testing, each having 37 500 abstracts drawn equally from 5 journals: PRA, PRB, PRC, PRD, and PRE.
The first 200 tokenized words (padded with 0s if the abstract was shorter than 200) from every abstract was taken. For each of the NN architecture
Embedding layer was used
with 10 000 words in a dictionary, and words embedding model Word2Vec (size=200, window=5, min_count=5, iter=5, sgram).
The models were trained with 10 epochs and then evaluated on the testing dataset. The accuracy of the models are:
acc(FF)= 79.41%, acc(CNN1D)=86.13%, acc(LSTM)=89.06%. The performance for LSTM is very good, especially if we take into account that this is multiclass classification problem,
and the network has only a handful number of hidden layers. This performance can possibly be improved above 90% accuracy with more epochs.
Actually with 50 epochs, just for LSTM net the accuracy is acc(LSTM50)=89.52%.
Below, ROCs are given for these three tested architectures.
For the LSTM architecture, I additionally calculated ROCs for each classification separately (figure below).
To inspect what other kinds of misclassification the LSTM model makes, I plotted a confusion matrix (below).
We can see that papers published in PRA are sometimes classifed as PRE or PRB papers, and very rarely as PRC or PRD.
PRE is not surprising as this journal is a daughter of PRA. PRB, PRC, and PRD are classified very accurately, although with PRC and PRD sometimes interchanged. PRB
was misclassifed for PRA. As for PRE, the situation is similar to PRA.
>>> THE END <<<