@elrik

elrik@lemmy.world · 1 year

Even if it didn’t outright display the code you need to enter, my guess is this and similar implementations hide further vulnerabilities like: the numbers aren’t generated with a secure random number generator, or the validation call isn’t resistant to simple brute force quickly guessing every possible number, or the number is known client side for validation, etc.

elrik@lemmy.world · 2 years

That’s correct, it is just plain text and it can easily be spoofed. You should never perform an auth check of any kind with the user agent.

In the above examples, it wouldn’t really matter if someone spoofed the header as there generally isn’t a benefit to the malicious agent.

Where some sites get into trouble though is if they have an implicit auth check using user agents. An example could be a paywalled recipe site. They want the recipe to be indexed by Google. If I spoof my user agent to be Googlebot, I’ll get to view the recipe content they want indexed, bypassing the paywall.

But, an example of a more reasonable use for checking user agent strings for bots might be regional redirects. If a new user comes to my site, maybe I want to redirect to a localized version at a different URL based on their country. However, I probably don’t want to do that if the agent is a bot, since the bot might be indexing a given URL from anywhere. If someone spoofed their user agent and they aren’t redirected, no big deal.

elrik@lemmy.world · 2 years

User agents are useful for checking if the request was made by a (legitimate self-identifying) bot, such as Googlebot.

It could also be used in some specific scenarios where you control the client and want to easily identify your client traffic in request logs.

Or maybe you offer a download on your site and you want to reorder your list to highlight the most likely correct binary for the platform in the user agent.

There are plenty of reasonable uses for user agent that have nothing to do with feature detection.

elrik@lemmy.world · 2 years

JSON Problem Details

https://datatracker.ietf.org/doc/html/rfc9457

It has a specification, so a consumer of the API can immediately know what to expect.
It has a content type, so a client sdk can intelligently handle the response.
It supports commonly needed members which are a superset of all of the above JSON examples, including type for code and repeating the http status code in the body if desired.
It is extensible if needed.
It has been defined since at least 2016.

This specification’s aim is to define common error formats for applications that need one so that they aren’t required to define their own …

So why aren’t you using problem details?

elrik@lemmy.world · 2 years

This isn’t the evolution of C at all. It’s all just one language and you’re simply stuck in a lower dimension with a dimensionally compatible cross-section.

elrik@lemmy.world · 2 years

If the goal of training is to produce output that users “like” or engage with, then yes, upvoted content is higher quality. The definition of quality here will certainly depend on their goals.

My point is a bunch of spammed content intended to poison AI training is unlikely to gather upvotes, and so it could easily be filtered out if they’re also okay with discarding some human generated content that was not upvoted.

elrik@lemmy.world · 2 years

No, because the upvote ratio on posts and comments will be used to signal higher quality content.

It would take considerable effort and coordination to generate low quality content and give it an upvote history that isn’t obviously suspicious and do that for enough content that it actually matters to the training.

Even if you could accomplish that, you can’t backdate this activity, so they could simply filter out posts and comments after a recent date and still have an enormous amount of data to train.

elrik@lemmy.world · 2 years

I recently went through these exact pains trying to contribute to a project that exclusively ran through Discord and eventually had to give up when it was clear they would never enable issues in their GitHub repos for “reasons.”

It was impossible to discover the history behind anything. Even current information was lost within days, having to rehash aspects that were already investigated and decided upon.