last update: 05/20/2018
    APS Database Project

    This project is open-source and all of the source-codes are publicly available on GitHub. The goal of this project is an exploration of trends in the publishing record of the American Physical Society (APS). Thanks to Sylvia Do for proof-reading!

    OVERVIEW
    APS journals included in the database:
    • Physical Reviews: PR
    • Physical Review A: PRA
    • Physical Review Accelerators and Beams: PRAB
    • Physical Review Applied: PRAPPLIED
    • Physical Review B: PRB
    • Physical Review C: PRC
    • Physical Review D: PRD
    • Physical Review E: PRE
    • Physical Review (Series I): PRI
    • Physical Review Letters: PRL
    • Physical Review Fluids: PRFLUIDS
    • Physical Review Physics Education Research: PRPER
    • Physical Review Special Topics - Accelerators and Beams: PRSTAB
    • Physical Review Special Topics - Physics Education Research: PRSTPER
    • Physical Review X: PRX
    • Reviews of Modern Physics: RMP

    The total number of papers published between the year 1983 and 2016 in all of the APS journals is equal to 596 786. The first published paper is titled: "The Critical Current Density for Copper Deposition and the Absolute Velocity of Migration of Copper Ions", PR (Series I), 1(1), 1983 by Samuel Sheldon and G. M. Downing, and has only 2 citations (according Google Scholar; 05/05/18).

    NUMBER OF PUBLISHED PAPERS
    In 1958, Physical Review Letters (PRL) emerged as a journal to communicate short and significant findings in physics. PRL steadily grew, reaching around 4000 papers published per year by the year 2000. Its sister journal, Physical Review (PR), was the major APS journal to communicate longer articles prior to 1970. In 1969, the number of articles published in that journal reached about 4000 per year (the same number as PRL around the year 2000). This lead to the journal's split off into Physical Review A,B, C, and D. Originally, the findings in statistical physics and non-linear dynamics were published in Physical Review A (PRA). However the rapidly growing number of papers published in PRA lead to the emergence of a new journal: Physical Review E (PRE), devoted to statistical mechanics, non-linear dynamics and soft matter.

    f1

    The inset (semi-log plot) on the right-hand side of the plot gives the total number of papers (for all 16 journals) published every year. The data shows that the number of the papers published by APS grows roughly exponentially!

    In the above plot, PRC journal is omitted as the number of published papers in this journal is much smaller than in the other journas, cf. Table below:

    Journal Total # papers Start End Percent of total
    Phys. Rev. 47940 1913 1969 8%
    Phys. Rev. Lett. 118126 1970 2016 20%
    Phys. Rev. A 73320 1970 2016 12%
    Phys. Rev. B 176681 1970 2016 30%
    Phys. Rev. C 37766 1970 2016 6%
    Phys. Rev. D 80119 1970 2016 13%
    Phys. Rev. E 53387 1993 2016 9%


    NUMBER OF AUTHORS
    Average number of authors for the journals PR, PRL, PRA, PRB, PRD, and PRE.

    f2
    As expected, the average number of authors per paper steadily increases over time, reaching 4 authors on average around the year 2010. In the early days (pre-WW2), papers had lower number of authors, with the maximum number of authors not exceeding 3. In the '50's this trend changed to a faster pace of maximum number of authors coauthoring a paper. This number levels off in the '70's at the value 25. This is due to the fact that APS does not store more than 25 authors for each entry in their databases.

    Additionally, there are two trends if we look at the average number of coauthors for each journal separately. PRL and PRB currently have the same average number of coauthors (~5). This is in contrast to the PRE and PRD journals that have ~3 coauthors on average.

    COUNTRIES & COLLABORATIONS
    The summary of the number of papers published by people affiliated with a given country (by university). Papers published prior to 1989 are not considered.

    f3

    For papers with more than one author, a conditional probablity is given that if one of the authors has an affiliation in a country Y (y-axis), there is at least one other author affiliated with the country X (x-axis). That is, the plot gives the probability of one country collaborating with another, where the area of the dot represents the probability of that collaboration.

    f3
    Full data (different journals and number of authors on a paper) can be obtained here.

    NUMPER OF PAPERS PER 1000 CITIZENS
    Number of papers published by physicists affiliated with a given country per 1000 citizens in the year 2016. This metric suggests how much each country invests in physics per capita.
    Surprisingly, USA, which is publishing the largest number of papers, is pretty low in the ranking, whereas countries like Switzerland, Iceland, Israel, Denmark, the UK and Sweden are on the top of the list.

    f3

    EUROPE
    In terms of the absolute number of papers published in APS journals, two country dominate in Europe: the UK and France. The next countries in the ranking are Italy, Germany, Russia and Spain publishing roughly the same number of papers.

    f4

    Summary of the total number of papers published by authors from the top 9 European countries in terms of published papers. The number given in parenthesis is the total number of papers published by the scientists from that country in PRL, PRE, PRA, PRB, and PRD. The publishing trends for each country correlate closely with the general trend for all countries together, with the largest number of papers being published in PRB, and with PRL coming in second.

    f5

    CITATIONS STATISTICS
    For each paper published in any of the APS journal I counted a number of citations by other papers published in APS journals. The histogram binning and distribution fitting is done with powerlaw software. The data suggests that the number of citations closely follows a power-law distribution with an exponent equal to -2.749. The emergence of the power-law distribution, with an exponent within the range of [2,3], has been mostly explained by "A preferential attachment process". However, recently a more plausible model has been proposed in a work by Ken Dill and coworkers. Altough the fit to the power-law distribution looks great, we should be very cautious in claiming this distribution, especially in the context of the recent paper by Aaron Clause, who showed that in fact, "scale-free networks are rare". It turns out that the truncated-power-law (with an exponent of -2.68 and a truncation exponent equal to 0.00025) is statistially more likely than power-law distribution. Thus for the number of citations above 4000, the distribution has a fast-decaying exponential tail rather than a "fat-tail".

    f3

    Below, in the left panel I plotted a cumulative number of citiations in a given year for 6 different journals (PRL, PRA, PRB, PRC, PRD, PRE). We can clearly see that PRL and PRB significanlty stand out in terms of the number of citations. The remaning journals have roughly the same number of citations with PRC being (cumulatively) the least cited journal. Around the year 2010 we can see a drop in the number of citations. This is not surprising taking into account the fact that these papers had a shorter period of time to make an impact and be cited by other papers.

    The cumulative number of citations is not a good metric of the journals quality. Simply put, larger journals have more papers. Thus in the right panel I plotted a total number of citations of the papers published in a journal in a given year, normalized by the number of the papers published in that year and the following years. From this calculations we can clearly see that papers published in PRL on average have the largest number of citations. Next, papers in PRA, PRB, PRC and PRD are roughly of the same quality having similar number of citations. Lastly, PRE stands out with the lowest number of citations per paper. It could be because these papers are the lowest quality, but part of the effect can also be attributed to the competition with PRA. We can see that the PRA split off affected the number of citiation for this jorunal, and these citations are likely transfered to PRE.

    f3

    Below I plot a number of citations for the most cited paper in 7 different journal in a given year (left panel). We can see that most cited papers come from PR or PRB. However, in the case of PRB, we see large fluctations from year to year. For some years there are papers that have tremendous number of citations, whereas in the next year we can see that the most cited paper being around the average of the other journals. This is contrary to PRL, where we can see much higher consistency from year to year. The best papers published in PRL also dominate other papers except in the years when the PRB papers beat them.

    On the right panel, I present a fraction of papers published in a given year that has no citations at all. Not surprisingly, the fraction of these papers rises as the years approach the year 2016, as the papers have less time to be cited by the community. We can see that the lowest fraction of papers without a citation is in PRL, confiriming that the journal is high quality in comparision to the others. Interestingly, we can also see that the fraction of papers without citations rises again as the papers are published in the 1920's through 1940's (for PR), 1960's through 1990's for PRL and 1980's through 1990's. This probably is because of the low accsessibility of the papers in the pre-Internet era.

    f3

    TOP 5 MOST CITED PAPERS
    The table below contains the five most cited papers by other APS papers. In parenthesis I give the number of citations accoring to Google Scholar (access date: 05/05/2018). It is worth to mention that the papers ranked 1st and 5th have the same author: John Perdew. Likewise, the papers ranked 2nd and 3rd are by Walter Kohn, the 1998 Nobel Prize winner in Physics. Finally, it is interesting to note that all of these papers describe some novel computational technique, and most of them are related to a celebrated Density Functional Theory (DFT).

    Number of Citations Title Authors Year Journal Volume

    1. 7834 (GS: 83492)

    Generalized Gradient Approximation Made Simple

    J.P. Perdew, K. Burke, and M. Ernzerhof

    1996

    Phys. Rev. Lett.

     77 

    2. 7016 (GS: 47843)

    Self-Consistent Equations Including Exchange and Correlation Effects

    W. Kohn and L. J. Sham

    3. 1965

    Phys. Rev.

    140

    3. 5615 (GS: 42547)

    Inhomogeneous Electron Gas

    P. Hohenberg and W. Kohn

    1964

    Phys. Rev.

    136

    4. 5527 (GS: 43316)

    Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set

    G. Kresse and J. Furthmüller

    1996

    Phys. Rev. B

    54

    5. 4261 (GS: 18148)

    Self-interaction correction to density-functional approximations for many-electron systems

    J. P. Perdew and Alex Zunger

    1981

    Phys. Rev. B

    23


    TOP 3 CITING PAPERS
    I checked the papers that cite the largest number of other papers in the APS journals. The top 3 papers are in the table below. Not surprisingly, all the three papers are review papers, and published in Reviews of Modern Physics. As we can see, the number of cited papers reach a stunning number -- around 600.

    Number of References Title Authors Year Journal Number

    607

    Electrodynamics of correlated electron materials

    D.N. Basov, R.D. Averitt, D. van der Marel, M. Dressel, and K. Haule

    2011

    RMP

    83

    602

    Energy Levels of Light Nuclei. III

    W. F. Hornyak, T. Lauritsen, P. Morrison, and W. A. Fowler

    1950

    RMP

    22

    582

    Metal-insulator transitions

    M. Imada, A. Fujimori, and Y. Tokura

    1998

    RMP

    70


    CORRELATIONS
    In the search for the relations between variables, I calculated the correlation coefficients between parameters such as the number of citations, number of references, publication date, journal volume, etc. In particular, I was interested if there were any significant relationships between the number of citations and other parameters. First I checked if there were any linear relations between variables using the Pearson correlation coefficient. However, relying on this metric can lead to completely incorrect and silly conclusions. Thus alongside with the Pearson Coefficient (PC), I also calculated the Distance Correlation (DC). The advantage of the DC is that, contrary to the PC, two variables are independent if and only if the DC is equal to zero - regardless of any functional relationship variables. The results of this calculations are presented in the figure below. Sadly, but not surprisingly, there are no obvious relationships between the number of citations and: journal, number of pages, volume, issue, number of authors, number of affiliations, publication date, number of countries from which authors are coming from, or number of references used in a paper. The full data can be found under this link.

    f3

    TITLES AND ABSTRACTS STATISTICS
    Titles and abstracts pubilshed in RRA, PRB, PRC, PRD and PRE between the years 1990 and 2016 were tokenized as described in the next section. Below, the distribution of the number of tokens in a title or an abstract is given. The main purpose of this calculation is to find a good length of the abstract for the journal's classifcation with Deep Learning described in Section: "JOURNAL CLASSIFICATION - DEEP NEURAL NETWORK". For the forthcoming analysis, I chose the length 200 because most of the abstracts are shorter than 200 words.

    f3

    A distribution of word occurance in all abstracts closely follow a power-law with an exponent about -1.5 for about 4 orders of magnitude. Similaritly to the number of citations, the distribution starts to decay rapidly for the number of occurances larger than 20 000, and the truncated power-law is again statistically more likely.

    f3



    Word2Vec EMBEDDING
    I embedded the words from the abstacts of papers published in PRA, PRB, PRC, PRD and PRE between the years of 1990 and 2016 using Word2Vec from Gensim toolkit. Each abstract was split into a set of sentences. Then, all non-alphabetic characters were removed, and the remaining letter converted to lower case. Next, all the stop words were removed, and the remaining words stemmed with PorterStemmer. Finally, all words shorter than 2 letters were removed. Such tokenized abstracts were next used for embedding. I embedded words into 200 dimensional space, using only words that occured at least 5 times (min_cout=5), with window size 5 and skip-gram model. 5 iterations were executed to train the model. Such an embedding is visualized below. On the left we have t-SNE clustering. Some cluster are visable suggesting word aggregation in highly dimensional space. On the right PCA is shown.

    f3

    JOURNAL CLASSIFICATION - DEEP NEURAL NETWORK
    Three different Deep Neural Nework architectures were used for journals classification based on abstract content. These architectures are: Feed-Forward (FF), Convolution 1D (CNN-1D) and LSTM (please refer to the dnn/train_keras.py file for more details). Dropout layers were used to prevent overfitting. The dataset was split into training and testing, each having 37 500 abstracts drawn equally from 5 journals: PRA, PRB, PRC, PRD, and PRE. The first 200 tokenized words (padded with 0s if the abstract was shorter than 200) from every abstract was taken. For each of the NN architecture Embedding layer was used with 10 000 words in a dictionary, and words embedding model Word2Vec (size=200, window=5, min_count=5, iter=5, s-gram). The models were trained with 10 epochs and then evaluated on the testing dataset. The accuracy of the models are: acc(FF)= 79.41%, acc(CNN-1D)=86.13%, acc(LSTM)=89.06%. The performance for LSTM is very good, especially if we take into account that this is multi-class classification problem, and the network has only a handful number of hidden layers. This performance can possibly be improved above 90% accuracy with more epochs. Actually with 50 epochs, just for LSTM net the accuracy is acc(LSTM|50)=89.52%. Below, ROCs are given for these three tested architectures.

    f3

    For the LSTM architecture, I additionally calculated ROCs for each classification separately (figure below).

    f3

    To inspect what other kinds of misclassification the LSTM model makes, I plotted a confusion matrix (below). We can see that papers published in PRA are sometimes classifed as PRE or PRB papers, and very rarely as PRC or PRD. PRE is not surprising as this journal is a daughter of PRA. PRB, PRC, and PRD are classified very accurately, although with PRC and PRD sometimes interchanged. PRB was misclassifed for PRA. As for PRE, the situation is similar to PRA.

    f3


    >>> THE END <<<