Uncategorized

regex – Extracting bundles of comic book series and issues from a string of text in Python


I have a string of text that includes a bundle of comic series and issues all in one long block for each row.

Examples are the following:

Example 1

“Batman #323, 325, 335, 340, 368-369, 397-400, Amazing Spider-Man #13-17”

Example 2

“Amazing Spider-Man #nn, Amazing Spider-Man Annual #10, Amazing Spider-Man 174, 185, 213, 245, 326”

I would like to note that “#nn” should be retained as the series in the comic. If it makes it easier, I can replace the “#nn” with “#00”.

I have been trying to use regular expression (or regex) in Python. For instance, I have tried

r"([a-zA-Z\s\'-]+) #(\d+|\d+-\d+|\w+)"

The code I have written is as follows

import re

def separate_comic_books(comic_books_str):
    series_issue_dict = {}

    # Define a regular expression pattern to extract series and issue information
    pattern = re.compile(r'([a-zA-Z\s\'-]+) #(\d+|\d+-\d+|\w+)')

    # Split the string into individual comic book entries
    comic_books_list = re.split(',\s*', comic_books_str)

    # Iterate through the list of comic books
    for comic_book in comic_books_list:
        matches = pattern.findall(comic_book)
        print(matches)

        for match in matches:
            series = match[0].strip()
            issues = match[1].strip()

            # Split the issues if it's a range
            issues_list = [str(i) for i in range(int(issues.split('-')[0]), int(issues.split('-')[-1]) + 1)]

            # Add the comic book to the dictionary based on series
            if series in series_issue_dict:
                series_issue_dict[series].extend(issues_list)
            else:
                series_issue_dict[series] = issues_list

    # Create the final formatted string
    formatted_comic_books = []
    for series, issues in series_issue_dict.items():
        formatted_issues=", ".join([f"{series} #{issue}" for issue in sorted(issues)])
        formatted_comic_books.append(formatted_issues)

    return ', '.join(formatted_comic_books)

# Provided string of comic books
comic_books_str = "Amazing Spider-Man #nn, Amazing Spider-Man Annual #10, Amazing Spider-Man 174, 185, 213, 245, 326"

result = separate_comic_books(comic_books_str)
print(result)

However, I am getting the following results

Example 1

"Batman #323, Amazing Spider-Man #13"

Example 2

ValueError: invalid literal for int() with base 10: 'nn'

However, I would like to get the following results

Example 1

Batman #323, Batman #325, Batman #335, Batman #340, Batman #368, Batman #369, Batman #397, Batman #398, Batman #399, Batman #400, Amazing Spider-Man #13, Amazing Spider-Man #14, Amazing Spider-Man #15, Amazing Spider-Man #16, Amazing Spider-Man #17

Example 2

Amazing Spider-Man #nn, Amazing Spider-Man Annual #10, Amazing Spider-Man 174, Amazing Spider-Man 185, Amazing Spider-Man 213, Amazing Spider-Man 245, Amazing Spider-Man 326

Is there a way to write a Python code that does this?

Thank you so much!!



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *