What does your project do?

Analyzing the Vitals.com Doctor Ratings Website

For our project, we decided to assess the usefulness of the physician rating website Vitals. To accomplish this, we scraped the data from the website and then proceeded to analyze it using the pandas package. Ideally, we wanted to compare the ratings across specialties and locations to be able to draw conclusions about the perceived quality of doctors, but unfortunately limited analyses were able to be conducted due to biased data.

What modules did you use, if any?

  • requests
  • BeautifulSoup4
  • Selenium
  • time
  • re
  • csv
  • pandas
  • matplotlib
  • numpy

Explain how we can run your project

To run the project, several dependencies must be installed in addition to Python:

  • Through pip Selenium, Requests, BeautifulSoup4: e.g. pip install beautifulsoup4 selenium requests
  • Chrome Test Driver 2.9: http://chromedriver.storage.googleapis.com/index.html?path=2.9/ (this is required for Selenium to be able to automate the Chrome browser; this was necessary since on all scraped pages, results were loaded in by client-side Javascript calls that had to be executed in a browser)

Additionally, the file path for downloaded webpages and for the CSV must be set below:

In [1]:
#Be sure to include final forwardslash in filepath!
filepath = "C:/Desired/File/Path"

An evaluation of Python's suitability to your task

Some problems were encountered while scraping data from online. As mentioned prior, results were loaded in by client-side Javascript that needed to be excecuted in a browser. This posed initial complications, however the use of Selenium helped automate the process. Despite these problems, Python was absolutely suited to the task and allowed us to complete this step with relatively few issues.

In past analyses, we have typically used R or SAS for analysis. While we are more comfortable with those languages, python was more than suitable for this analysis as well.

Obtaining the Data

Data was scraped from the Vitals website, with only physicians from the state of New York being included in our dataset. To accomplish this using webdriver from the selenium package along with beautifulsoup. The scraped data was then exported into a csv file using the csv package.

Data scraping proceeded as follows:

Creating a function to save all the pages of reviews using selenium:

In [2]:
from selenium import webdriver
import requests
import time

def save_page(url, out_file, wait_for_class='search-card'):
    """ Saves a page, using the Selenium browser testing framework to run any Javascript
        , writing the resulting final DOM as HTML source to disk.
        Waits until element of a given class is loaded to make sure Javascript has actually
        completed before source is retrieved. """

    browser = webdriver.Chrome("C:/Path/To/Chromedriver.exe")
    browser.get(url)
    # Wait until elment with given class is loaded
    while True:
        try:
            # if the element cannot be found, this will throw
            browser.find_element_by_class_name(wait_for_class)
            break # otherwise the loop will end here
        except:
            pass # do nothing, try again
    # source has loaded - write out result
    open(out_file, 'w', encoding="utf-8").write(browser.page_source)

#loop through 42 pages of search results & save to working directory
page = 0    
while page <= 41:
    url = 'https://www.vitals.com/search?display_type=Doctor&state=NY&page=' + str(page)
    page += 1
    save_page(url, filepath + 'Vitals' + str(page) + ".html")

Create function to extract necessary data from downloaded pages using BeautifulSoup:

In [3]:
import re
from bs4 import BeautifulSoup

def parse_page(filepath):
    page_contents = open(filepath).read().strip()
    my_soup = BeautifulSoup(page_contents, "html.parser")
    data = []
    #Scrape physician name
    for card in my_soup.findAll("div", attrs = {"class": "search-card"}):
        name_element = card.find("span", attrs = {"class": "name"})
        if name_element == None:
            continue
        name = name_element.text.strip()
    #Scrape physician specialty
        specialty_element = card.find("span", attrs = {"class": "specialty"})
        if specialty_element == None:
            continue
        specialty = specialty_element.text.strip().split(',')[0]
    #Scrape physician location
        location_element = card.find("span", attrs = {"class": "address"})
        if location_element == None:
            continue
        location = location_element.text.strip().split(',')[0]
    #Scrape physician rating
        rating_element = card.find("span", attrs = {"class": "rating-text"})
        if rating_element == None:
            continue
        rating = rating_element.text.strip('')
    #Scrape number of reviews for each physican
        reviews_element = str(card.select("a[href*=reviews]"))
        if reviews_element == None:
            continue
        reviews = ''.join(re.findall(r'\d+', reviews_element))
        row = (name, specialty, location, rating, reviews)
        data.append(row)
        
    return (data)
    

Export scraped information to a csv file for ease of use:

In [4]:
import csv

with open(filepath + '/healthreviews.csv','w') as csvfile:
    writer = csv.writer(csvfile)
    titlerow = ("Name", "Specialty", "Location", "Rating", "Reviews")
    writer.writerow(titlerow)
    # Loop through all downloaded pages of the website; each "card" forms a row of the CSV
    for page in range(1, 43):
        all_rows = parse_page(filepath + "Vitals" + str(page) + ".html")
        for row in all_rows:
            writer.writerow(row)

Analyzing the data

The resulting csv was structured into a dataframe using the pandas package. The analysis and visualizations of the data were accomplished using a combination of pandas and pyplot from the matplotlib package. After the cleaning the dataset contained 958 records with the following variables:

  • Name: Name of physician
  • Specialty: Specialty of physician
  • Location: Location of physician
  • Rating: Rating of the physician (1-5)
  • Reviews: Number of Reviews for Physician

Exploratory analysis of the dataset proceeded as follows:

Reading the data into pandas and performing some cleaning

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Reading in the resulting csv. 1003 records for physicians
healthReview = pd.read_csv(filepath + "healthreviews.csv")

# Keeping doctors that had one or more reviews
healthReview = healthReview[healthReview['Reviews'] > 0]

# Checking the dataset for any missing values. 45 missing values returned
healthReview.isnull().sum().sum()

# Dropping missing values. 958 Rows remaining
healthReview = healthReview.dropna()

# Sorting dataset by number of reviews, descending
healthReview = healthReview.sort_values(['Reviews'], ascending = False)

Taking a peek into the dataset, looking at the 10 doctors with the most reviews

In [6]:
healthReview.head(10).style.bar(subset=['Reviews'], color='#5fba7d')
Out[6]:
Name Specialty Location Rating Reviews
14 Dr. John F Morrison Neurological Surgery Buffalo 5 315
166 Dr. Michael I Horowitz Surgery of the Hand Brooklyn 5 310
288 Dr. Paul S Cohen Internal Medicine Syracuse 5 210
229 Dr. Alexander H Tejani Orthopaedic Surgery Brooklyn 5 196
505 Yevgeniy Vaynkof M.D. Family Medicine New York 5 193
6 Dr. Shahriar Shayani Internal Medicine New Hyde Park 5 114
3 Dr. Daniel Weitz Clinical Cardiac Electrophysiology New York 5 61
4 Dr. Mehran Alagheband Dermatology Glen Cove 5 56
5 Dr. Justin Cohen MD Otolaryngology New York 5 49
381 Dr. Steven E Goldberg Internal Medicine Troy 5 36

General summary statistics for ratings and review variables

In [7]:
# Generating some summary statistics for rating and reviews variables
healthReview.describe().round(2)
Out[7]:
Rating Reviews
count 960.00 960.00
mean 5.00 9.39
std 0.01 18.75
min 4.80 3.00
25% 5.00 5.00
50% 5.00 6.00
75% 5.00 10.00
max 5.00 315.00

We can see that the average rating for a physician is approximately 5, which is the max rating available. There is an average of 9.4 reviews for each physician. To get a visual on these trends, we next created box plots for the two variables.

Creating box plots to examine the distribution for rating and reviews varables

In [8]:
# Box plot for ratings variable (On a 1-5 scale)
healthReview['Rating'].plot(kind = 'box')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x15f7e6a96d8>

This boxplot is concerning, as the interquartile range is non-existent for ratings. The purpose of the website is to view reviews for physicians, yet the majority appear to be rated 5! To examine this more closely, we decided to take a look at the 5 physicians rated the lowest.

In [9]:
healthReview.sort_values(['Rating']).head(5)
Out[9]:
Name Specialty Location Rating Reviews
2 Dmitriy Fuzaylov M.D. Pain Management Malverne 4.8 6
1 Michael Nguyen M.D. Pain Management New York 4.9 27
843 Dr. Jill A Jacobson Psychiatry New York 5.0 5
842 Dr. Khalida Itriyeva Pediatrics New Hyde Park 5.0 5
841 Dr. Morton D Borg Pediatrics New York 5.0 5

Only two physicians out of the 958 in our dataset have a rating of less than 5. It seems clear from this that the Vitals website cannot be used to accurately gauge the quality of the physician.

In [10]:
# Box plot for reviews variable
healthReview['Reviews'].plot(kind = 'box', ylim = (0,50))
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x15f7ea712b0>

It appears that the majority of physicians in our dataset have less than 10 reviews. Still, that is a rather large number of 5 rating reviews.

Creating a bar graph for location, including the 10 locations with the highest number of doctors with reviews

In [11]:
# Preparing the data to visualize the number of physicians in each location
reviewLocation = pd.DataFrame(healthReview.groupby('Location')['Location'].count())
reviewLocation = reviewLocation.sort_values(['Location'], ascending = False)
reviewLocation.columns = ['Physicians Reviewed']
reviewLocation.head(10).plot(kind = 'bar')
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x15f78401588>

We filtered our selection of doctors to those residing in the state of New York, but the resulting locations aren't completely clear. Flushing is its own location despite residing in Queens, and the manhattan grouping (New York) includes some doctors in other New York City locations. We can see that Manhattan is the location with the largest amount of physicians in our dataset, which is perhaps a little surprising given the breakdown of population numbers in the boroughs. Overall, we would expect the highest number of reviews to be within NYC and that is reflected in the dataset.

Examing the most common specialties to receive reviews

In [12]:
# Bar graphing count of physicians for each specialty
healthReview.groupby('Specialty')['Reviews'].count().sort_values(ascending=False).head(10).plot(kind = 'bar')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x15f7ef27240>

Internal medicine, pediatrics, and family medicine have the largest number of physicians. This seems consistent with what we would expect, as these specialties are among the most commonly chosen among rising medical students. The one outlier seems to be surgery, which is another often chosen specialty.

Conclusion

Unfortunately, the Vitals website does not seem to be an effective resource when attempting to make an informed choice of physician. Only two reviews out of 958 deviated from a perfect 5 star rating. The proportion of physicians in each location and specialty at least makes some sense, which lends a little credibility to physician participation in the website. The sheer number of 5-star reviews might suggest some manipulation of the ratings system by the physicians themselves. While typically many rating systems such as Uber, Airbnb, and Amazon trend towards the upper end, they generally have some variation. Vitals has essentially all perfect scores, and doesn't really have variation. This makes it inneffective as an analysis tool. It might be a question of website maturity or public awareness, but it seems that they need to develop more of a presence to compete with healthgrades and ZocDoc.