Maciej Janicki's website

2020-11-08 science workflow

Following science journals via RSS

Working in academia means having to follow the current state of the research. The “usual” way to do it is downloading PDFs from websites of conferences and journals. It is needless to say that this is terribly inefficient: it requires us to remember when something new is expected to appear, manually navigating through the website etc. Even worse, recently there is a trend to increasingly rely on social media (Twitter, ResearchGate) and bloated, proprietary, data-mining “bibliography managers” (like Mendeley) to get suggestions on what to read. In this post, I describe an efficient, distraction-free and privacy-protecting workflow for keeping up with the latest research.

The general idea is to use RSS feeds to track updates of the journals in a text mode RSS reader. Furthermore, we want the link in the RSS item to send us directly to the PDF version of the paper (opening in our favourite reader), which usually requires a slight hack.

Use RSS to follow updates

Once upon a time, there was a universal way of following updates on any website: RSS feeds. The standard is dead-simple: each site exposes an XML file containing a list of recent “posts”. They typically contain a title, publication date, a short summary and a link to the full article. There are dozens of clients, both text-based and graphical, that you can use to subscribe to such feeds and gather all updates in one place. Without logging in to any accounts and giving away the information about which sources you follow and which articles you read.

Here’s how I browse the journal Journal for Data Mining & Digital Humanities (JDMDH) in my text-based client newsboat:

demo

In order to get the list of articles as an RSS feeds, all you need to do is add the feed URL (e.g. https://jdmdh.episciences.org/rss/papers for that journal) to the newsboat’s list of URLs. Most sites will have such feed, but occasionally it might be hard to find (look for “RSS” links).

Shorten the way from an RSS item to PDF

In an ideal world, I would get the PDF on my desktop by hitting just one keystroke in newsboat. This is sometimes possible. Where it is not, a terminal-based web browser can minimize the pain and distraction caused by having to pass through a website.

In practice, most RSS feeds contain a link that leads not straight to the PDF, but to a website from which a PDF can be downloaded. In the easiest case, the URL straight to the PDF can be deduced from website URL. For example, the “Journal of Data Mining & Digital Humanities” refers to papers using an URL like https://jdmdh.episciences.org/3905, while the PDF can be accessed under https://jdmdh.episciences.org/3905/pdf. Thus, we always need to add a trailing /pdf to the link.

In order to implement this, let’s start by setting the web browser used by newsboat (i.e. the program called when you hit the keybinding for “open”) to a custom script. Edit the file ~/.newsboat/config and add the line:

browser ~/.newsboat/browser.sh

Then create the file ~/.newsboat/browser.sh with the following content:

#!/bin/bash
case "$1" in
http://jdmdh.episciences.org/*)
    curl "$1"/pdf 2> /dev/null | zathura - &
    ;;
*)
    exec w3m "$1"
esac

Right now we have one rule for the “Journal of Data Mining & Digital Humanities” and a default rule at the end, which opens everything else using w3m. The rule for JDMDH executes the following command:

curl "$1"/pdf 2> /dev/null | zathura - &

which:

  • appends /pdf to the URL referenced in the RSS item,
  • downloads the PDF file and prints it to standard input,
  • launches zathura that reads the PDF from stdin (option -).

Now pressing o on an item in newsboat should pull up the PDF in zathura.

A little annoyance of this method is that you need to implement a separate rule for almost every source, because each of them might have a different way of changing the webpage URL to the one of the PDF - you need to figure out the rule by comparing those every time. Here’s one for arXiv:

http://arxiv.org/abs/*)
    PDFURL=$(echo "$1" | sed 's|^http://|https://|; s|/abs/|/pdf/|; s|$|.pdf|')
    curl "$PDFURL" 2>/dev/null | zathura - &
    ;;

Setup w3m to directly open PDFs

For sime sites, the relation between the URLs might be unpredictable or downloading with curl/wget might not work for some reason. Our default rule opens the site in w3m then. If there’s a “Download PDF” link somewhere, you can setup w3m to open the file directly in zathura by adding the following line:

application/pdf; zathura %s

to either ~/.mailcap or ~/.w3m/mailcap. Then all you need to do is to navigate the site to find the PDF link - if you open it in w3m, you can at least avoid looking at the ads and flashy webdesign.

Conclusion

The above workflow gives me a standard and minimum-distraction way to follow the newly appearing articles. Especially a look at arXiv, which has almost daily updates, allows me to keep up with the field. All sources are accessible with the same text-based interface, thanks to which I can better focus on the text. As in the other solutions that I present, it allows me to flexibly and seamlessly combine the most suitable applications for the job: newsboat for RSS, w3m for quick Web browsing without really looking at the site, zathura for reading PDFs. The fresh research is delivered straight to my desk.