Searching for Standards

So during an external QA review on a project, one of the comments which came back was a list of current standards listed in our specifications with commentary as to which ones were out of date or superseded. This is very useful to have, but also a hugely laborious task to do manually.

So why get a person spend hours to do what a machine could do in seconds?

Now taking a step back, this would not be necessary if the spec writing software would update these automatically, however a second check is definitely useful considering we all know this is not the case.

So how to approach this problem – we need to break the problem into distinct steps:

  1. Import a specification and extract all the standards referenced within
  2. Search for these standards against a database of current standards
  3. List the standards with the current versions, or flag if they have been superseded

So for this job, I’ll dig out my trusty Python toolkit again…

Starting at #1 – We need to import a PDF and extract the text. For this I used the Python Module ‘PDFPlumber’ which runs through the sheet and extracts all the text.

Taking that text, we then need to search for standards. Helpfully standards are usually written in a common format – e.g. BS7671, BSEN62305, DW144, EN54 etc, so this allows us to set up a Regex search for this specific format. The code below searches for this format and continues along the line to the next space.

regex = r"(BS|BS EN|EN|ISO|IEC|BSRIABG|DW) ?\d+"

After that we can add the standards found to a list:

    matches = re.finditer(regex, input_text, re.MULTILINE)
    list_of_standards_in_text = []
    for matchNum, match in enumerate(matches, start=1):
        list_of_standards_in_text.append(match.group())
'BS7430', 'ISO8528', 'BS7698', 'EN60947', 'EN60529', 'EN61082', 'ISO8528', 'BS7671', 'BS7430', 'EN62305', 'BS7671', 'EN61000', 'BS7671', 'BS7430', 'BS7671', 'BS7430', 'BS7671', 'BS7671', 'EN1360
1', 'EN61082', 'BS7430', 'BS7671', 'BS7430', 'BS7671', 'IEC60255', 'BS142', 'EN62271', 'EN62271', 'EN50522', 'EN62271', 'EN62271', 'IEC60466', 'EN60529', 'EN60076', 'EN61111', 'EN61914', 'EN61914
', 'BS5839', 'BS8519', 'EN61869', 'EN60044', 'EN60044', 'EN50522', 'EN62271', 'EN62271', 'EN50522', 'EN62271', 'EN62271', 'EN62271', 'EN62271', 'EN62271', 'EN60076', 'BS5467', 'BS5839', 'BS8434',
 'EN61000', 'BS7671', 'EN61914', 'BS5839', 'BS8519', 'BS7671', 'BS9999', 'BS8519', 'BS8519', 'EN60947', 'BS7671', 'EN50174', 'EN61643', 'EN62305', 'EN61643', 'EN62305', 'BS7671', 'BS7671', 'BS767
1', 'EN60439', 'EN60529', 'EN60439', 'EN60529', 'EN62262', 'EN60947', 'BS7629', 'BS6724', 'EN60898', 'EN60947', 'EN61009', 'EN50470', 'EN60445', 'BS7671', 'BS31', 'BS7671', 'BS8488', 'BS7671', 'E
N62275', 'EN50085', 'EN50085', 'BS4678', 'EN61386', 'EN61386', 'EN61386', 'EN61386', 'BS4607', 'BS7629', 'EN50525', 'EN50525', 'BS1363', 'EN60309', 'EN60309', 'EN60529', 'EN60529', 'EN1838', 'EN5
0172', 'BS5266', 'BS7671', 'EN50085', 'EN50085', 'BS4678', 'EN61386', 'EN61386', 'EN61386', 'EN61386', 'BS4607', 'BS7211', 'BS7671', 'BS5266', 'BS7671', 'EN50171', 'BS5266', 'BS7671', 'EN1838', '
EN50172', 'BS5266', 'BS7671', 'EN61537', 'BS7846', 'EN50173', 'EN50174', 'IEC11801', 'EN50173', 'EN61537', 'EN50288', 'EN60793', 'EN60793', 'BS5733', 'EN60669', 'EN50173', 'EN61643', 'BS6701', 'B
S7671', 'EN50174', 'EN50174', 'EN50174', 'BS7671', 'EN50174', 'EN50174', 'EN60825', 'BS7671', 'EN50346', 'EN61935', 'BS7594', 'EN61386', 'EN60529', 'EN60529', 'EN60268', 'BS7594', 'BS7671', 'BS76
71', 'BS7671', 'BS7594', 'EN60728', 'BS7671', 'EN61537', 'EN61537', 'BS5733', 'BS7671', 'BS5839', 'BS5839', 'BS5839', 'BS5839', 'BS5839', 'BS5839', 'EN61537', 'EN61386', 'EN61386', 'EN61386', 'EN
61386', 'BS4607', 'BS7629', 'EN54', 'EN54', 'EN54', 'EN54', 'BS5839', 'EN1155', 'EN54', 'EN54', 'BS5839', 'BS5839', 'BS7671', 'BS7671', 'BS8300', 'EN50085', 'EN50085', 'BS4678', 'EN54', 'BS7671',
 'BS7671', 'BS7671', 'BS5839', 'EN50085', 'EN50085', 'BS4678', 'BS7629', 'BS5839', 'EN54', 'EN54', 'EN54', 'BS7671', 'BS5839', 'BS7671', 'BS5839', 'BS5839', 'BS7671', 'BS5839', 'EN62305', 'EN6230
5', 'EN62305', 'EN62305', 'EN62561', 'EN62561', 'EN755', 'EN755', 'EN755', 'EN13601', 'EN62305', 'EN62305', 'EN62305', 'EN62305', 'EN62305', 'EN62305', 'EN62305', 'EN62305']

It is then sorted into a smaller list to remove duplicates:

['ISO8528', 'EN50346', 'BS5467', 'IEC60466', 'EN50085', 'EN61914', 'BS7846', 'EN60825', 'EN60947', 'BS4607', 'EN62305', 'EN1838', 'EN50171', 'EN61000', 'EN60044', 'EN61009', 'EN50525', 'EN61869',
 'EN50288', 'EN62275', 'IEC60255', 'EN61537', 'BS6724', 'BS4678', 'EN60309', 'BS5266', 'EN61935', 'BS31', 'EN54', 'EN1155', 'BS8488', 'EN50172', 'EN62561', 'EN755', 'EN13601', 'EN60669', 'BS6701'
, 'EN62271', 'BS8519', 'BS7430', 'EN50174', 'BS8300', 'EN60439', 'EN50173', 'BS1363', 'BS5733', 'EN61111', 'IEC11801', 'BS7698', 'BS142', 'EN50470', 'BS7211', 'EN60268', 'EN60728', 'EN50522', 'EN
61082', 'BS7629', 'BS7594', 'EN61386', 'EN60898', 'BS5839', 'EN60793', 'EN60529', 'BS7671', 'BS9999', 'BS8434', 'EN62262', 'EN61643', 'EN60076', 'EN60445']

#2 – Search against a current database. Now I don’t have an up-to-date database of current and superseded standards. It would be great to have, but it’s quite a big undertaking. So for this, I had my first go at ‘Web scraping’. This is where you pull the information from a website – in this case the BSI group https://www.bsigroup.com/en-GB/ which has the facility to search for standards, which gives you additional information as to their status.

After playing around with the website, you can see that the search URL is in an easy to use format:

https://shop.bsigroup.com/SearchResults/?q={standard_name}&pg=1&no=100&c=100&t=p

Where {standard_name} you can insert your own search, and ‘no=100’ limits how many results you can display on a single page.

Using the python modules ‘Requests’ to get a copy of the page, and ‘Beautiful soup’ it allows you to parse the output of the page to find the appropriate class within the HTML code. Using the ‘inspect’ function on your web browser is an easy way to drill down into the correct levels.

def return_list_of_standards(standard_name):
    URL = f"https://shop.bsigroup.com/SearchResults/?q={standard_name}&pg=1&no=100&c=100&t=p"
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.find(id='MainFrame')

    standards_lists = results.find_all('div', class_='resultsInd')
    results_list = []

    for _ in standards_lists:
        title_name = _.find('h2', class_='H2SearchResultsTitle')
        title_status = _.find('span', class_='text12Grey')
        results_list.append(_.text.strip())

So once this has completed it’s first run through, it extracts standards – such as this output below.

#3 – Then querying the BSI website we are looking to get back the information in the following format:

Name: 
Title: 
Publish Date:
Status: 

Writing the output to a text file it gives the current format:

Summary of Standards:

            

=======================================

ISO8528


Name: BS ISO 8528-5:2018 - TC
Title: BS ISO 8528-5:2018 - TC. Tracked Changes. Reciprocating internal combustion engine driven alternating current generating sets. Generating sets
Publish Date: 26/02/2020 
Status: Current  

Name: BS ISO 8528-2:2018 - TC
Title: BS ISO 8528-2:2018 - TC. Tracked Changes. Reciprocating internal combustion engine driven alternating current generating sets. Engines
Publish Date: 26/02/2020 
Status: Current  

Name: BS ISO 8528-1:2018 - TC
Title: BS ISO 8528-1:2018 - TC. Tracked Changes. Reciprocating internal combustion engine driven alternating current generating sets. Application, ratings and performance
Publish Date: 26/02/2020 
Status: Current  

.......

And from here you can do a quick check to make sure you don’t have any out of date standards. Yes – manually searching using the list. Well, there is no substitute for a person doing the final check…

Full code is here, hopefully you find it useful:

https://github.com/craig3050/StandardsSearch