No Good First issue

I am working on Folio while also helping out with a few small projects to help my friends learn Python. The program is an unnecessarily complex, over-engineered playground for me to practice different things, like software design, patterns, etc. A lot of this is because I want to learn how to write good software, and also, frankly, because I am looking for work as a software developer, so I figured trying to implement industry practices would be a great way to learn new stuff. I also thought I would create a few Issues that my friends could look at, and if they wanted to, contribute. In my mind, this is a perfect blend of learning, teaching, and practising while also getting to hang-out with people I like. As a result, I labelled those issues with Good First Issue.

About 4 minutes after creating the labels, someone forked my repository! At first I found it exciting - some one is forking my personal project! Then I was a bit concerned; the impostor syndrome kicked in and I thought “Oh fuck, someone is forking my personal project”. No more than 5 minutes after that, I had my first Pull Request. The circle is complete. Someone forked by repo and tried to contribute!

Of course, it wasn’t a real pull request.

Five minutes to do all the work I defined in the Issue seemed excessively fast. Again, I am not a professional developer yet, but even a 100x developer would surely take more than a few minutes to write a full suite of integration tests, run them, and check them. Hell, it probably takes about 5 minutes to finish writing, pushing to GitHub, and clicking through menus to make a PR¹.

Anyway, by the time I started to look at the files changed, I already suspected this was AI slop, and I was not disappointed. I think at this point a lot of us are getting relatively proficient at noticing when things are written by an LLM². The single commit added 425 lines of code, along with the comments and docstrings we’ve come to expect from the likes of Claude, ChatGPT, et. al.

My favourite part was where the test functions called methods that do not exist in my code, but are generally available in tutorials and other public spaces. For example, my code uses the Unit of Work pattern via a UnitOfWork object that gets called from a service layer³. The submitted PR, instead, did the classic sqlite3 thing:

def test_address_constraints_unique_address(fake_db):
    """Test that UNIQUE constraints are enforced directly by the database
    for the combination of (start, end, street, city, province, country, postal_code)."""
    conn = fake_db
    cursor = conn.cursor()
 
    # First successful insert
    cursor.execute(
        """
        INSERT INTO addresses (start, end, street, city, province, country, postal_code, notes)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
        """,
        (
            "1900-01-01",
            "1950-01-01",
            "123 Some Str",
            "Vancouver",
            "BC",
            "Canada",
            "V6Y 0A0",
            None,
        ),
    )
    conn.commit()

On it’s own and at a glance, I guess this can be passed off as written by a person. I guess. If you are willing to give the benefit of the doubt.

But again, this was over 400 lines of code doing similar things. And, critically, it simply does things like the whole cursor = conn.cursor() and then cursor.execute(...), which in my program will simply not work. It’s not that using a cursor object is inherently wrong - indeed, the Python documentation for sqlite3 shows examples of exactly that pattern. - its that this code would immediately fail any and every test using everything else I’ve written.

A human would immediately know this. A shitty AI-slop bot does not. An LLM assumes you are doing the same thing everyone does, all the time, in the most superficial way possible.

What I think happened

The problem, I think, starts with the age-old adage of “if you want to be a developer, contribute to FOSS projects!“. Probably good advice, mind you, but in a rather saturated market with tons of people like me trying to carve out a space for ourselves I think this hits the limit of what is reasonable. How many of those contributions to FOSS projects are for things the contributor actually cares about? How much of it is for the sake of posturing and including a LinkedIn reference that says “Look at me! I contributed to $FOSS_PROJECT!“?

Anyway if you are looking to contribute to something and you’re just starting, going to GitHub and finding open issues with the Good First Issue label seems like a reasonable thing to do. A simple Github search for good first issue results in over 350 repositories. There are also countless pages to search for and find things marked Good First Issue, i.e. from a very simple search:

https://goodfirstissue.dev/
https://goodfirstissues.com/
https://forgoodfirstissue.github.com/

So what happens when I create an issue on my public repository and apply the label Good First Issue? Well, it gets thrown into those lists. In theory this is good - people can search for this kind of thing, contribute, and learn. But the reality now seems to be that this type of aggregation can be and is used by bots, and thereby they can start to “contribute” to projects in the most inane way possible. It means they can make PRs on every project that presumably used the Good First Issue label as a way to onboard new contributors.

It means people who maintain those repositories also have to deal with LLM-generated crap. This is, I suppose, no different from all the complaints about LLMs and crawlers accessing every website over and over.

In any case, I think once I added the label to the issues on my project, the issues bubbled into those lists and searches. From there, it got picked up by this bot that is configured to look for the specific label. From their repository⁴:

    SEARCH_QUERY: str = 'is:issue is:open label:"good first issue"'
    MAX_ISSUE_AGE_DAYS: int = 60
    MAX_ISSUE_COMMENTS: int = 5

It then “processes” the repository by forking, cloning, and then adding fixes that lack any sort of human context. At least, I guess, the person who made this in the first place took enough care to add a friendly little warning to their PRs: `Please review carefully before merging.

So what?

This is a “user” that has had 285 contributions in the last year - all of them in the past 4 days or so. They forked over 160 repositories in the past 3 days.

I think my problem with this isn’t the creation of the bot in itself. I mean, if you wanted to learn how to use an LLM, create a project, use MCP, all of that - well, kudos to you! But for the love of God, let it run amok in your own repositories. The offensive part is that they simply let it operate on all of GitHub, and in particular with the Good First Issue label that was designed as an on-ramp to projects. I mean, I am unwilling now to apply the label to any open issue in my projects. I don’t want to interact with uninvited bot spam like this. It also, I suspect, doesn’t stop there: what happens when the SEARCH_QUERY is simply is:issue is:open? At that point may as well make the projects private.

Perhaps I am exaggerating the problem, it’s not all doom and gloom. Perhaps the consequences are not all that bad. But it does feel like it cheapens and devalues both the time and work of open source projects as well as the effort of potentially-first time contributors. Having to deal with this is another thing that is abusive in the context of the open web, of interoperability, of having technology for humans.

And look, it’s like I’m advocating for not using LLMs. I certainly use them, and fairly regularly. Sometimes I’m stuck on a problem for so long I just ask for suggestions (and of course I read and modify the code!), sometimes I use it to simply detect logical issues, sometimes I use them to search for things, sometimes I don’t feel like writing something and I ask for a ready-made function (for things I have done before). But it’s a tool, a thing I use to accomplish other things, not to spam people and issues and promote garbage.

So for now, at least, my solution is that Good First Issues will not exist in my repositories. I will not mark anything with that label. I will make whatever I do less inviting, which sucks, but I guess we are already losing the battle on this anyway - just look at email or generally web access.

I suppose you could do all of this with the gh command-line tool, but still. ↩
Although as a topic for a different post, I think we need to coin a new term for the ad-hominem attack of “this is written by AI” which is prevalent in a lot of forums these days. It is obviously an issue, but I wonder how often describing a commenter/author as simply vomiting AI slop is a way to not engage with what they say. ↩
Like I said, excessively complex for what this is, but I wanted to learn about these patterns. ↩
This repository is also mostly made by an LLM. ↩

arvb.net

Explorer

No Good First issue

What I think happened

So what?

Graph View

Table of Contents

arvb.net

Explorer

No Good First issue

What I think happened

So what?

Footnotes

Graph View

Table of Contents